Skip to content

fix(libp2p): default abortConnectionOnPingFailure to false#3463

Open
marshalleq wants to merge 1 commit intolibp2p:mainfrom
marshalleq:fix/connection-monitor-abort-default
Open

fix(libp2p): default abortConnectionOnPingFailure to false#3463
marshalleq wants to merge 1 commit intolibp2p:mainfrom
marshalleq:fix/connection-monitor-abort-default

Conversation

@marshalleq
Copy link
Copy Markdown

@marshalleq marshalleq commented Apr 18, 2026

Motivation

@libp2p/connection-monitor defaults to abortConnectionOnPingFailure: true. On any probe failure this calls conn.abort(err), which cascades through the muxer (GoAway) and every multiplexed stream, and terminates the underlying MA-connection with a TCP RST on @libp2p/tcp.

Under normal WAN conditions — trans-continental RTT, intermittent event-loop contention on either peer, transient network jitter — a single 10s probe exchange failing is routine. The current default treats it as evidence the connection is dead and destroys it.

Evidence

A five-host deployment across three continents (Germany, Australia, New Zealand) on libp2p 3.x with yamux keepalive enabled (enableKeepAlive: true, keepAliveInterval: 10_000). Baseline measurements per host over a 30-minute window:

Host Role Disconnect rate
bootstrap (IONOS, DE) VPS bootstrap ~36/hr
bootstrap-de (Contabo, DE) VPS bootstrap ~42/hr
station3 (Contabo, AU) VPS node ~25/hr
station .11 (TrueNAS, NZ) LAN guardian ~48/hr
station .13 (LAN, NZ) LAN guardian ~62/hr

Packet capture on the AU→NZ link confirmed the Node process itself sends the TCP RST mid-traffic, in the same event-loop tick as a preceding yamux GoAway burst. Instrumenting every abort call site in production produced stack traces rooted overwhelmingly at `connection-monitor.ts` → `conn.abort(err)`.

Setting `abortConnectionOnPingFailure: false` on all five hosts dropped the measured disconnect rate to 0/hr across all hosts over multi-hour windows — no other change was required. Yamux keepalive handled liveness without the RST cascade. I do have a full writeup available of the hypotheses tested and their results to get to this point, but it's currently an offline copy in StreamResetInvestigation.md. Please advise if required.

Change

Flips `DEFAULT_ABORT_CONNECTION_ON_PING_FAILURE` from `true` to `false`. The option is still honoured when set explicitly. Applications that want the aggressive behaviour opt in rather than opt out.

Tests

  • Two existing tests that relied on the old default (`should abort a connection that times out`, `should abort a connection that fails`) now pass `abortConnectionOnPingFailure: true` explicitly — their intent was always to test the abort mechanism, not the default.
  • New regression test `should not abort a connection that fails by default` guards against accidental regression of the new default.
  • All 9 connection-monitor tests pass.

Alternative considered

A more conservative middle-ground would be to require N consecutive probe failures before aborting (with N defaulting to e.g. 3), preserving the dead-connection-detection intent while eliminating the false-positive cascade. Happy to follow up with that if maintainers prefer — but changing the default to `false` is the simpler, strictly-safer change and matches what every deployment we're aware of ends up doing manually after hitting this problem.

A single 10s ping exchange failing is not reliable evidence that a
connection is dead under real WAN conditions. Trans-continental RTT,
event-loop contention, and transient network jitter routinely cause
probes to fail. Yamux keepalive detects genuinely dead connections at
the muxer layer without the aggressive RST cascade that
conn.abort() triggers, so the upstream default was causing measurable
self-harm at non-trivial scale.

Deployment evidence: a five-host network spanning three continents
(Germany, Australia, New Zealand) with yamux keepalive enabled showed
25–62 disconnects/hr/host under the upstream default. Setting
abortConnectionOnPingFailure: false on all five hosts dropped the
measured rate to 0/hr/host over multi-hour windows. No other change
was required. Pcap confirmed the Node process itself sent the TCP RSTs
mid-traffic, in the same event-loop tick as a preceding yamux GoAway
burst rooted at connection-monitor.ts.

Applications that still want the aggressive behaviour can opt in
explicitly. Existing tests that relied on the old default have been
updated to pass abortConnectionOnPingFailure: true explicitly, and a
new regression test asserts the new default behaviour.

Signed-off-by: Marshalleq <13308996+marshalleq@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant