Skip to content

Fix peer disconnect detection in waitValueOrSignal#12935

Merged
spraza merged 2 commits intoapple:release-7.3from
saintstack:dd_7.3.disconnect_fix
Apr 13, 2026
Merged

Fix peer disconnect detection in waitValueOrSignal#12935
spraza merged 2 commits intoapple:release-7.3from
saintstack:dd_7.3.disconnect_fix

Conversation

@saintstack
Copy link
Copy Markdown
Contributor

@saintstack saintstack commented Apr 3, 2026

Add a when() clause watching peer->disconnect in waitValueOrSignal (genericactors.actor.h) so dead connections (e.g., from NAT timeouts) are detected immediately instead of hanging indefinitely waiting on a connection the lower layer has already replaced.

We saw this in an incident where waiting on a long reply on a network with frequent disconnects; low level fdb would make a new connection but high-level would wait until we timed out on the original.

Here is a diagram showing where we'd get stuck :

Here's the call chain showing where requests get stuck:


  DD (Data Distributor)
  │
  ├─► trackShardMetrics() [DDShardTracker.actor.cpp:265]
  │   │
  │   └─► loop {
  │       │
  │       └─► waitStorageMetrics() [NativeAPI.actor.cpp:5904]
  │           │
  │           └─► loop {  ◄── RETRY LOOP (catches wrong_shard_server, all_alternatives_failed)
  │               │
  │               └─► waitStorageMetricsWithLocation()
  │                   │
  │                   └─► waitStorageMetricsMultipleLocations() [NativeAPI.actor.cpp:5764]
  │                       │
  │                       ├─► loadBalance(SS_1, waitMetrics) ──► fx[0]
  │                       ├─► loadBalance(SS_2, waitMetrics) ──► fx[1]
  │                       ├─► loadBalance(SS_3, waitMetrics) ──► fx[2]
  │                       │
  │                       └─► waitForAll(fx)  ◄── BLOCKS until ALL complete
  │                           │
  │                           │   If ANY loadBalance is stuck, waitForAll blocks forever
  │                           ▼

  Inside each loadBalance() call:

  loadBalance() [LoadBalance.actor.h]
  │
  └─► loop {  ◄── RETRY LOOP (retries on broken_promise, request_maybe_delivered)
      │
      └─► getReply()
          │
          └─► waitValueOrSignal() [genericactors.actor.h:366]
              │
              └─► loop {
                  │
                  ├─► choose {
                  │   │
                  │   ├─► when(X x = wait(value)) { return x; }     // Wait for SS response
                  │   │
                  │   └─► when(wait(signal)) { return error; }      // Wait for failure signal
                  │   }
                  │
                  └─► catch (broken_promise) {
                          │
                          ├─► endpointNotFound(endpoint)  // Notify failure monitor
                          │
                          └─► value = Never()  ◄── BUG: Waits instead of returning error!
                              │
                              │   Loop continues, waiting on:
                              │   - value = Never() (never completes)
                              │   - signal (may never fire if endpoint looks healthy)
                              │
                              └─► STUCK

Also fixes compile errors with fmt::join and template specializations that appear with newer compilers.

Includes unit tests for the peer disconnect detection.

20260409-211017-stack-79f3f7f73910ccd5 compressed=True data_size=50913761 duration=4223818 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:50:46 sanity=False started=100000 stopped=20260409-220103 submitted=20260409-211017 timeout=5400 username=stack

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: ebd2114
  • Duration 0:03:51
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: ebd2114
  • Duration 0:08:19
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: ebd2114
  • Duration 0:08:43
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: ebd2114
  • Duration 0:08:50
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack saintstack force-pushed the dd_7.3.disconnect_fix branch from ebd2114 to 0fdc2e8 Compare April 10, 2026 00:07
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 0fdc2e8
  • Duration 0:04:10
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Add a when() clause watching peer->disconnect in waitValueOrSignal
(genericactors.actor.h) so dead connections (e.g., from NAT timeouts)
are detected immediately instead of hanging indefinitely waiting on a
connection the lower layer has already replaced.

We saw this in an incident where waiting on a long reply on a network
with frequent disconnects; low level fdb would make a new connection
but high-level would wait until we timed out on the original.

Includes unit tests for the peer disconnect detection.
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 0fdc2e8
  • Duration 0:08:38
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 0fdc2e8
  • Duration 0:08:43
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 0fdc2e8
  • Duration 0:08:45
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack saintstack force-pushed the dd_7.3.disconnect_fix branch from 0fdc2e8 to 157f046 Compare April 10, 2026 00:18
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 157f046
  • Duration 0:04:22
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 157f046
  • Duration 0:08:20
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 157f046
  • Duration 0:08:31
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 157f046
  • Duration 0:08:37
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 0fdc2e8
  • Duration 0:33:07
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 0fdc2e8
  • Duration 0:39:53
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Apr 10, 2026

Will look closer later. FYI:

image

@gxglass gxglass self-requested a review April 10, 2026 02:22
Copy link
Copy Markdown
Collaborator

@gxglass gxglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I took the liberty of penciling this in as an addition AI on the postmortem where this was involved.

@gxglass gxglass requested a review from spraza April 10, 2026 21:02
@saintstack
Copy link
Copy Markdown
Contributor Author

Thanks for reviews @gxglass and @spraza

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: e99893f
  • Duration 0:03:52
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: e99893f
  • Duration 0:08:46
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: e99893f
  • Duration 0:08:38
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: e99893f
  • Duration 0:08:49
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@spraza spraza merged commit a042081 into apple:release-7.3 Apr 13, 2026
0 of 6 checks passed
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: e99893f
  • Duration 0:33:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: e99893f
  • Duration 0:39:54
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

saintstack added a commit to saintstack/foundationdb that referenced this pull request Apr 18, 2026
Forward-port from 7.3 to 7.4.

Add a when() clause watching peer->disconnect in waitValueOrSignal
(genericactors.actor.h) so dead connections (e.g., from NAT timeouts)
are detected immediately instead of hanging indefinitely waiting on a
connection the lower layer has already replaced.

We saw this in an incident where waiting on a long reply on a network
with frequent disconnects; low level fdb would make a new connection
but high-level would wait until we timed out on the original.

Includes unit tests for the peer disconnect detection.
saintstack added a commit to saintstack/foundationdb that referenced this pull request Apr 24, 2026
Forward-port from 7.3 to 7.4.

Add a when() clause watching peer->disconnect in waitValueOrSignal
(genericactors.actor.h) so dead connections (e.g., from NAT timeouts)
are detected immediately instead of hanging indefinitely waiting on a
connection the lower layer has already replaced.

We saw this in an incident where waiting on a long reply on a network
with frequent disconnects; low level fdb would make a new connection
but high-level would wait until we timed out on the original.

Includes unit tests for the peer disconnect detection.
saintstack added a commit that referenced this pull request Apr 27, 2026
* Fix peer disconnect detection in waitValueOrSignal (#12935)

Forward-port from 7.3 to 7.4.

Add a when() clause watching peer->disconnect in waitValueOrSignal
(genericactors.actor.h) so dead connections (e.g., from NAT timeouts)
are detected immediately instead of hanging indefinitely waiting on a
connection the lower layer has already replaced.

We saw this in an incident where waiting on a long reply on a network
with frequent disconnects; low level fdb would make a new connection
but high-level would wait until we timed out on the original.

Includes unit tests for the peer disconnect detection.

* Add suppressFor to WaitValueOrSignalPeerDisconnect and remove redundant include
saintstack added a commit that referenced this pull request May 5, 2026
* Fix peer disconnect detection in waitValueOrSignal (#12935)

Forward-port from 7.3 to main.

Add a when() clause watching peer->disconnect in waitValueOrSignal
(genericactors.actor.h) so dead connections (e.g., from NAT timeouts)
are detected immediately instead of hanging indefinitely waiting on a
connection the lower layer has already replaced.

We saw this in an incident where waiting on a long reply on a network
with frequent disconnects; low level fdb would make a new connection
but high-level would wait until we timed out on the original.

Includes unit tests for the peer disconnect detection.

* Remove redundant FlowTransport.h include (already included via fdbrpc.h)

* Address PR feedback: check knownUnauthorized on peer disconnect, fix test comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants