Skip to content

fix(libp2p): release probe stream slot on ConnectionMonitor ping error#3464

Open
marshalleq wants to merge 3 commits intolibp2p:mainfrom
marshalleq:fix/connection-monitor-stream-leak
Open

fix(libp2p): release probe stream slot on ConnectionMonitor ping error#3464
marshalleq wants to merge 3 commits intolibp2p:mainfrom
marshalleq:fix/connection-monitor-stream-leak

Conversation

@marshalleq
Copy link
Copy Markdown

Motivation

The ConnectionMonitor heartbeat opens an outbound /ipfs/ping/1.0.0 stream, runs Promise.all([bs.write, bs.read]), then awaits stream.close({ signal }). If the signal aborts during write or read — which happens routinely under WAN latency or transient event-loop contention — the await stream.close(...) line is never reached. The outbound stream remains in the muxer's state and continues to count against libp2p's maxOutboundStreams limit for the protocol.

The default maxOutboundStreams for /ipfs/ping/1.0.0 is 1. A single leaked stream therefore permanently starves the monitor on that connection: the next probe 10s later fails with TooManyOutboundProtocolStreamsError, which itself leaks another slot via the same missing cleanup, and every subsequent probe on that connection fails fast with no bytes on the wire.

Evidence

Observed in production across a five-host deployment. With abortConnectionOnPingFailure: false set (see #3463), probe failures are no longer catastrophic, but the stream-slot leak causes the monitor to silently lose its ability to measure RTT on any connection where a probe has failed at least once.

Logs from teardown instrumentation show:

TooManyOutboundProtocolStreamsError: Too many outbound protocol streams for protocol "/ipfs/ping/1.0.0" - 2/1
  at Connection.newStream (libp2p/src/connection.ts:113)

at ~6 events/hour/host. At scale (1M nodes × several connections each) this is noticeable event-loop pressure from work that produces no useful output.

Change

Capture the stream reference at the top of the try block. In the catch, if close() was not reached (status is not already 'closed' or 'aborted'), call stream.abort(err) to release the slot. abort rather than close because at this point we already know the probe failed — we don't want to wait on a graceful close that may itself block on the aborted signal.

Tests

  • New regression test should abort the probe stream when the ping exchange fails stubs a stream whose send throws (simulating signal-abort during the exchange) and asserts stream.abort is called from the catch block.
  • All 9 existing connection-monitor tests still pass.

Notes

The ConnectionMonitor heartbeat opens an outbound /ipfs/ping/1.0.0
stream, runs Promise.all([bs.write, bs.read]), then awaits
stream.close(). If the signal aborts during write or read (a routine
outcome under WAN latency or transient event-loop contention), the
await stream.close() line is never reached. The outbound stream
remains in the muxer's state and continues to count against libp2p's
maxOutboundStreams limit for the protocol.

The default maxOutboundStreams for /ipfs/ping/1.0.0 is 1, so a single
leaked stream permanently starves the monitor on that connection: the
next probe 10s later fails with TooManyOutboundProtocolStreamsError,
which itself leaks another slot via the same missing cleanup, and
every subsequent probe on that connection fails fast with no bytes on
the wire.

This patch captures the stream reference at the top of the try block
and, in the catch, aborts it if close() was not reached (status is
not already 'closed' or 'aborted'). Releases the slot in exactly the
error cases it was being leaked. stream.abort() rather than close()
because at this point we already know the probe failed — we don't
want to wait on a graceful close that may itself block on the aborted
signal.

Includes a regression test that stubs a stream whose write throws
(simulating signal-abort during the exchange) and asserts that the
catch block calls stream.abort().

Signed-off-by: Marshalleq <13308996+marshalleq@users.noreply.github.com>
@marshalleq marshalleq requested a review from a team as a code owner April 18, 2026 01:03
@dozyio
Copy link
Copy Markdown
Collaborator

dozyio commented Apr 18, 2026

Thanks @marshalleq - will get this merged shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants