fix(libp2p): release probe stream slot on ConnectionMonitor ping error#3464
Open
marshalleq wants to merge 3 commits intolibp2p:mainfrom
Open
fix(libp2p): release probe stream slot on ConnectionMonitor ping error#3464marshalleq wants to merge 3 commits intolibp2p:mainfrom
marshalleq wants to merge 3 commits intolibp2p:mainfrom
Conversation
The ConnectionMonitor heartbeat opens an outbound /ipfs/ping/1.0.0 stream, runs Promise.all([bs.write, bs.read]), then awaits stream.close(). If the signal aborts during write or read (a routine outcome under WAN latency or transient event-loop contention), the await stream.close() line is never reached. The outbound stream remains in the muxer's state and continues to count against libp2p's maxOutboundStreams limit for the protocol. The default maxOutboundStreams for /ipfs/ping/1.0.0 is 1, so a single leaked stream permanently starves the monitor on that connection: the next probe 10s later fails with TooManyOutboundProtocolStreamsError, which itself leaks another slot via the same missing cleanup, and every subsequent probe on that connection fails fast with no bytes on the wire. This patch captures the stream reference at the top of the try block and, in the catch, aborts it if close() was not reached (status is not already 'closed' or 'aborted'). Releases the slot in exactly the error cases it was being leaked. stream.abort() rather than close() because at this point we already know the probe failed — we don't want to wait on a graceful close that may itself block on the aborted signal. Includes a regression test that stubs a stream whose write throws (simulating signal-abort during the exchange) and asserts that the catch block calls stream.abort(). Signed-off-by: Marshalleq <13308996+marshalleq@users.noreply.github.com>
Removed comments explaining stream handling in case of errors.
dozyio
approved these changes
Apr 18, 2026
Collaborator
|
Thanks @marshalleq - will get this merged shortly |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The
ConnectionMonitorheartbeat opens an outbound/ipfs/ping/1.0.0stream, runsPromise.all([bs.write, bs.read]), then awaitsstream.close({ signal }). If the signal aborts duringwriteorread— which happens routinely under WAN latency or transient event-loop contention — theawait stream.close(...)line is never reached. The outbound stream remains in the muxer's state and continues to count against libp2p'smaxOutboundStreamslimit for the protocol.The default
maxOutboundStreamsfor/ipfs/ping/1.0.0is 1. A single leaked stream therefore permanently starves the monitor on that connection: the next probe 10s later fails withTooManyOutboundProtocolStreamsError, which itself leaks another slot via the same missing cleanup, and every subsequent probe on that connection fails fast with no bytes on the wire.Evidence
Observed in production across a five-host deployment. With
abortConnectionOnPingFailure: falseset (see #3463), probe failures are no longer catastrophic, but the stream-slot leak causes the monitor to silently lose its ability to measure RTT on any connection where a probe has failed at least once.Logs from teardown instrumentation show:
at ~6 events/hour/host. At scale (1M nodes × several connections each) this is noticeable event-loop pressure from work that produces no useful output.
Change
Capture the stream reference at the top of the
tryblock. In thecatch, ifclose()was not reached (status is not already'closed'or'aborted'), callstream.abort(err)to release the slot.abortrather thanclosebecause at this point we already know the probe failed — we don't want to wait on a graceful close that may itself block on the aborted signal.Tests
should abort the probe stream when the ping exchange failsstubs a stream whosesendthrows (simulating signal-abort during the exchange) and assertsstream.abortis called from the catch block.Notes
close()had completed before a subsequent throw (rare but cheap to check).