Fix Sim2Conn read hanging after peer close#12906
Conversation
When a peer closed a simulated connection, the receiver coroutine would co_await Never() after stopReceive fired, leaving readers with no way to discover the connection was done. Add an incomingClosed AsyncVar that is set after all in-flight bytes have been delivered, so read() and whenReadable() can surface connection_failed instead of blocking forever. Care is taken to switch to self->process before setting incomingClosed, since stopReceive's delay may fire on the closing peer's process.
|
@tclinkenbeard-oai could you have a look at this? I am not sure if this is a latent problem or specifically related to coroutine conversion @daleiz Thanks for sending this in. Do you happen to have any specific repro info, e.g. a commit hash + random seed + buggify value + test name that you have seen problems? |
|
Thanks, this looks like a good fix. I'm curious what simulation failures you saw from the underlying bug. I think |
|
Closing and reopening to trigger CI |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
|
This was not discovered by an FDB simulator test. I hit it in one of our own simulator-based tests In that setup, once |
fixed |
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
It looks like there are some ~1/1k test failures due to |
fixed |
|
@daleiz having a look. FWIW Claude review notes are attached if any of this is helpful (I have not looked closely at it). |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Rewrite Sim2Conn termination-sensitive waits and process hops with race-based helpers, keep in-flight bytes deliverable after peer death, and clarify shutdown semantics in the receiver/readable paths. Also make the simulator HTTP response mutex ref-counted so callback actors cannot outlive a stack-allocated FlowMutex when a connection closes early.
|
I checked the latest correctness failures and fixed them: I also fixed a separate lifetime bug in |
Thanks, I'm hoping to have a deeper look tomorrow. BTW the amount of changes here continues to grow -- I am not really sure if there is anything you can do about that, but any way to break this down into smaller PRs would be helpful. Also on a related note, I am not sure if it is possible to come up with a test case that reproduces the original issue you saw, but if there is, that would be beneficial on general principle. But not sure what's easy/possible/etc. |
When a peer closed a simulated connection, the receiver coroutine would
co_await Never()afterstopReceivefired, leaving readers with no way to discover the connection was done. Add anincomingClosed AsyncVarthat is set after all in-flight bytes have been delivered, soread()andwhenReadable()can surfaceconnection_failedinstead of blocking forever.Care is taken to switch to
self->processbefore settingincomingClosed, sincestopReceive's delay may fire on the closing peer's process.