fix(libp2p): default abortConnectionOnPingFailure to false#3463
Open
marshalleq wants to merge 1 commit intolibp2p:mainfrom
Open
fix(libp2p): default abortConnectionOnPingFailure to false#3463marshalleq wants to merge 1 commit intolibp2p:mainfrom
marshalleq wants to merge 1 commit intolibp2p:mainfrom
Conversation
A single 10s ping exchange failing is not reliable evidence that a connection is dead under real WAN conditions. Trans-continental RTT, event-loop contention, and transient network jitter routinely cause probes to fail. Yamux keepalive detects genuinely dead connections at the muxer layer without the aggressive RST cascade that conn.abort() triggers, so the upstream default was causing measurable self-harm at non-trivial scale. Deployment evidence: a five-host network spanning three continents (Germany, Australia, New Zealand) with yamux keepalive enabled showed 25–62 disconnects/hr/host under the upstream default. Setting abortConnectionOnPingFailure: false on all five hosts dropped the measured rate to 0/hr/host over multi-hour windows. No other change was required. Pcap confirmed the Node process itself sent the TCP RSTs mid-traffic, in the same event-loop tick as a preceding yamux GoAway burst rooted at connection-monitor.ts. Applications that still want the aggressive behaviour can opt in explicitly. Existing tests that relied on the old default have been updated to pass abortConnectionOnPingFailure: true explicitly, and a new regression test asserts the new default behaviour. Signed-off-by: Marshalleq <13308996+marshalleq@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
@libp2p/connection-monitordefaults toabortConnectionOnPingFailure: true. On any probe failure this callsconn.abort(err), which cascades through the muxer (GoAway) and every multiplexed stream, and terminates the underlying MA-connection with a TCP RST on@libp2p/tcp.Under normal WAN conditions — trans-continental RTT, intermittent event-loop contention on either peer, transient network jitter — a single 10s probe exchange failing is routine. The current default treats it as evidence the connection is dead and destroys it.
Evidence
A five-host deployment across three continents (Germany, Australia, New Zealand) on libp2p 3.x with yamux keepalive enabled (
enableKeepAlive: true,keepAliveInterval: 10_000). Baseline measurements per host over a 30-minute window:Packet capture on the AU→NZ link confirmed the Node process itself sends the TCP RST mid-traffic, in the same event-loop tick as a preceding yamux GoAway burst. Instrumenting every abort call site in production produced stack traces rooted overwhelmingly at `connection-monitor.ts` → `conn.abort(err)`.
Setting `abortConnectionOnPingFailure: false` on all five hosts dropped the measured disconnect rate to 0/hr across all hosts over multi-hour windows — no other change was required. Yamux keepalive handled liveness without the RST cascade. I do have a full writeup available of the hypotheses tested and their results to get to this point, but it's currently an offline copy in StreamResetInvestigation.md. Please advise if required.
Change
Flips `DEFAULT_ABORT_CONNECTION_ON_PING_FAILURE` from `true` to `false`. The option is still honoured when set explicitly. Applications that want the aggressive behaviour opt in rather than opt out.
Tests
Alternative considered
A more conservative middle-ground would be to require N consecutive probe failures before aborting (with N defaulting to e.g. 3), preserving the dead-connection-detection intent while eliminating the false-positive cascade. Happy to follow up with that if maintainers prefer — but changing the default to `false` is the simpler, strictly-safer change and matches what every deployment we're aware of ends up doing manually after hitting this problem.