fix(libp2p): default abortConnectionOnPingFailure to false by marshalleq · Pull Request #3463 · libp2p/js-libp2p

marshalleq · 2026-04-18T01:02:41Z

Motivation

@libp2p/connection-monitor defaults to abortConnectionOnPingFailure: true. On any probe failure this calls conn.abort(err), which cascades through the muxer (GoAway) and every multiplexed stream, and terminates the underlying MA-connection with a TCP RST on @libp2p/tcp.

Under normal WAN conditions — trans-continental RTT, intermittent event-loop contention on either peer, transient network jitter — a single 10s probe exchange failing is routine. The current default treats it as evidence the connection is dead and destroys it.

Evidence

A five-host deployment across three continents (Germany, Australia, New Zealand) on libp2p 3.x with yamux keepalive enabled (enableKeepAlive: true, keepAliveInterval: 10_000). Baseline measurements per host over a 30-minute window:

Host	Role	Disconnect rate
bootstrap (IONOS, DE)	VPS bootstrap	~36/hr
bootstrap-de (Contabo, DE)	VPS bootstrap	~42/hr
station3 (Contabo, AU)	VPS node	~25/hr
station .11 (TrueNAS, NZ)	LAN guardian	~48/hr
station .13 (LAN, NZ)	LAN guardian	~62/hr

Packet capture on the AU→NZ link confirmed the Node process itself sends the TCP RST mid-traffic, in the same event-loop tick as a preceding yamux GoAway burst. Instrumenting every abort call site in production produced stack traces rooted overwhelmingly at `connection-monitor.ts` → `conn.abort(err)`.

Setting `abortConnectionOnPingFailure: false` on all five hosts dropped the measured disconnect rate to 0/hr across all hosts over multi-hour windows — no other change was required. Yamux keepalive handled liveness without the RST cascade. I do have a full writeup available of the hypotheses tested and their results to get to this point, but it's currently an offline copy in StreamResetInvestigation.md. Please advise if required.

Change

Flips `DEFAULT_ABORT_CONNECTION_ON_PING_FAILURE` from `true` to `false`. The option is still honoured when set explicitly. Applications that want the aggressive behaviour opt in rather than opt out.

Tests

Two existing tests that relied on the old default (`should abort a connection that times out`, `should abort a connection that fails`) now pass `abortConnectionOnPingFailure: true` explicitly — their intent was always to test the abort mechanism, not the default.
New regression test `should not abort a connection that fails by default` guards against accidental regression of the new default.
All 9 connection-monitor tests pass.

Alternative considered

A more conservative middle-ground would be to require N consecutive probe failures before aborting (with N defaulting to e.g. 3), preserving the dead-connection-detection intent while eliminating the false-positive cascade. Happy to follow up with that if maintainers prefer — but changing the default to `false` is the simpler, strictly-safer change and matches what every deployment we're aware of ends up doing manually after hitting this problem.

A single 10s ping exchange failing is not reliable evidence that a connection is dead under real WAN conditions. Trans-continental RTT, event-loop contention, and transient network jitter routinely cause probes to fail. Yamux keepalive detects genuinely dead connections at the muxer layer without the aggressive RST cascade that conn.abort() triggers, so the upstream default was causing measurable self-harm at non-trivial scale. Deployment evidence: a five-host network spanning three continents (Germany, Australia, New Zealand) with yamux keepalive enabled showed 25–62 disconnects/hr/host under the upstream default. Setting abortConnectionOnPingFailure: false on all five hosts dropped the measured rate to 0/hr/host over multi-hour windows. No other change was required. Pcap confirmed the Node process itself sent the TCP RSTs mid-traffic, in the same event-loop tick as a preceding yamux GoAway burst rooted at connection-monitor.ts. Applications that still want the aggressive behaviour can opt in explicitly. Existing tests that relied on the old default have been updated to pass abortConnectionOnPingFailure: true explicitly, and a new regression test asserts the new default behaviour. Signed-off-by: Marshalleq <13308996+marshalleq@users.noreply.github.com>

marshalleq requested a review from a team as a code owner April 18, 2026 01:02

marshalleq mentioned this pull request Apr 18, 2026

fix(libp2p): release probe stream slot on ConnectionMonitor ping error #3464

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(libp2p): default abortConnectionOnPingFailure to false#3463

fix(libp2p): default abortConnectionOnPingFailure to false#3463
marshalleq wants to merge 1 commit intolibp2p:mainfrom
marshalleq:fix/connection-monitor-abort-default

marshalleq commented Apr 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marshalleq commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Evidence

Change

Tests

Alternative considered

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marshalleq commented Apr 18, 2026 •

edited

Loading