Skip to content

Fix Sim2Conn read hanging after peer close#12906

Open
daleiz wants to merge 4 commits into
apple:mainfrom
daleiz:sim2conn-read-close-fix
Open

Fix Sim2Conn read hanging after peer close#12906
daleiz wants to merge 4 commits into
apple:mainfrom
daleiz:sim2conn-read-close-fix

Conversation

@daleiz
Copy link
Copy Markdown
Contributor

@daleiz daleiz commented Mar 31, 2026

When a peer closed a simulated connection, the receiver coroutine would co_await Never() after stopReceive fired, leaving readers with no way to discover the connection was done. Add an incomingClosed AsyncVar that is set after all in-flight bytes have been delivered, so read() and whenReadable() can surface connection_failed instead of blocking forever.

Care is taken to switch to self->process before setting incomingClosed, since stopReceive's delay may fire on the closing peer's process.

When a peer closed a simulated connection, the receiver coroutine would
co_await Never() after stopReceive fired, leaving readers with no way to
discover the connection was done. Add an incomingClosed AsyncVar that is
set after all in-flight bytes have been delivered, so read() and
whenReadable() can surface connection_failed instead of blocking forever.

Care is taken to switch to self->process before setting incomingClosed,
since stopReceive's delay may fire on the closing peer's process.
@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Mar 31, 2026

@tclinkenbeard-oai could you have a look at this? I am not sure if this is a latent problem or specifically related to coroutine conversion

@daleiz Thanks for sending this in. Do you happen to have any specific repro info, e.g. a commit hash + random seed + buggify value + test name that you have seen problems?

@tclinkenbeard-oai tclinkenbeard-oai self-requested a review March 31, 2026 16:37
@tclinkenbeard-oai
Copy link
Copy Markdown
Collaborator

Thanks, this looks like a good fix. I'm curious what simulation failures you saw from the underlying bug. I think connectionMonitor should usually detect the peer closure at a higher level

@tclinkenbeard-oai
Copy link
Copy Markdown
Collaborator

Closing and reopening to trigger CI

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 720af66
  • Duration 0:40:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 720af66
  • Duration 0:43:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 720af66
  • Duration 0:54:47
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 720af66
  • Duration 0:54:52
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@daleiz
Copy link
Copy Markdown
Contributor Author

daleiz commented Apr 2, 2026

@tclinkenbeard-oai

This was not discovered by an FDB simulator test. I hit it in one of our own simulator-based tests
built on top of FDB's simulated IConnection, without going through FlowTransport / Peer, so
connectionMonitor was not involved.

In that setup, once read() returns 0, the caller depends on onReadable() to either become ready when more
bytes arrive or throw once the connection is no longer readable. The bug in Sim2Conn was that after peer
close, it could reach a state where no more bytes could ever arrive, but onReadable() still waited forever.

@daleiz
Copy link
Copy Markdown
Contributor Author

daleiz commented Apr 2, 2026

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 720af66
  • Duration 0:54:47
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

fixed

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 720af66
  • Duration 2:15:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 546a812
  • Duration 0:25:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 546a812
  • Duration 0:44:08
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 546a812
  • Duration 0:45:15
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 546a812
  • Duration 0:47:39
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 546a812
  • Duration 2:12:23
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@tclinkenbeard-oai tclinkenbeard-oai self-requested a review April 2, 2026 19:25
@tclinkenbeard-oai
Copy link
Copy Markdown
Collaborator

It looks like there are some ~1/1k test failures due to incomingClosed.set(true) waking readers causing invalid Sim2Conn object access

@daleiz
Copy link
Copy Markdown
Contributor Author

daleiz commented Apr 3, 2026

It looks like there are some ~1/1k test failures due to incomingClosed.set(true) waking readers causing invalid Sim2Conn object access

fixed

@daleiz daleiz closed this Apr 7, 2026
@daleiz daleiz reopened this Apr 7, 2026
@gxglass gxglass self-requested a review April 7, 2026 20:11
@gxglass gxglass self-requested a review April 7, 2026 20:11
@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Apr 7, 2026

@daleiz having a look. FWIW Claude review notes are attached if any of this is helpful (I have not looked closely at it).
pr12906_review.md

@gxglass gxglass closed this Apr 7, 2026
@gxglass gxglass reopened this Apr 7, 2026
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: feaddc5
  • Duration 0:26:49
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: feaddc5
  • Duration 0:36:35
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: feaddc5
  • Duration 0:43:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: feaddc5
  • Duration 0:46:38
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: feaddc5
  • Duration 0:56:31
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: feaddc5
  • Duration 1:03:32
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: feaddc5
  • Duration 2:05:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Rewrite Sim2Conn termination-sensitive waits and process hops with race-based helpers, keep in-flight bytes deliverable after peer death, and clarify shutdown semantics in the receiver/readable paths.

Also make the simulator HTTP response mutex ref-counted so callback actors cannot outlive a stack-allocated FlowMutex when a connection closes early.
@daleiz
Copy link
Copy Markdown
Contributor Author

daleiz commented Apr 8, 2026

@gxglass @tclinkenbeard-oai

I checked the latest correctness failures and fixed them: pass=9998 fail=2 and pass=9999 fail=1, all 3 failed seeds now pass locally. Please run the tests again.

I also fixed a separate lifetime bug in HTTPServer, where the response mutex used by callback actors could outlive the stack frame that created it. That mutex is now held through a ref-counted wrapper.

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Apr 8, 2026

@gxglass @tclinkenbeard-oai

I checked the latest correctness failures and fixed them: pass=9998 fail=2 and pass=9999 fail=1, all 3 failed seeds now pass locally. Please run the tests again.

I also fixed a separate lifetime bug in HTTPServer, where the response mutex used by callback actors could outlive the stack frame that created it. That mutex is now held through a ref-counted wrapper.

Thanks, I'm hoping to have a deeper look tomorrow. BTW the amount of changes here continues to grow -- I am not really sure if there is anything you can do about that, but any way to break this down into smaller PRs would be helpful. Also on a related note, I am not sure if it is possible to come up with a test case that reproduces the original issue you saw, but if there is, that would be beneficial on general principle. But not sure what's easy/possible/etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants