Skip to content

Handle quick channel closing#11428

Merged
smartprogrammer93 merged 4 commits into
masterfrom
fix-rlpx-stale-closed-session
May 4, 2026
Merged

Handle quick channel closing#11428
smartprogrammer93 merged 4 commits into
masterfrom
fix-rlpx-stale-closed-session

Conversation

@flcl42
Copy link
Copy Markdown
Contributor

@flcl42 flcl42 commented Apr 30, 2026

Fixes #11232

Changes

  • Handle quick channel closing

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 30, 2026

Claude finished @flcl42's task in 3m 7s —— View job


PR Review

  • Gather context and read changed files
  • Read relevant rule files
  • Analyze changes for correctness, edge cases, security, and performance
  • Post review findings

Summary

This PR fixes a race window in the RLPx session lifecycle: when a channel closes, OnChannelCloseCompleted fires immediately but the MarkDisconnected call is intentionally delayed by 1 second (to allow any in-flight disconnect-reason packets to arrive). During that 1-second window, HasOpenSession still returned true for the dying session — causing new incoming sessions from the same peer to be rejected as "already connected."

The fix is correct and targeted:

  • Session.MarkChannelClosed() is called immediately when the channel close task completes, before scheduling the delay.
  • HasOpenSession is extended to treat a channel-closed session as not open, breaking the stale-session deadlock.
  • A regression test is included.

Findings

Medium — Downcast coupling in HasOpenSession (see inline comment on PeerManager.cs:1115)

HasOpenSession downcasts ISession to the concrete Session class to read IsChannelClosed. Both types are internal to Nethermind.Network, so access is fine today, but the check silently becomes a no-op for any non-Session ISession implementation (test doubles, future variants). This means:

  • Test doubles that implement ISession directly won't benefit from the channel-closed short-circuit, which could make certain integration tests misleading.
  • There's no compiler enforcement that new ISession implementations expose IsChannelClosed.

The idiomatic fix is to add bool IsChannelClosed { get; } to ISession. If that's considered out of scope for this bug fix, a comment explaining the assumption would help future maintainers. See the inline suggestion on the file.

Low — Test stale session never reaches MarkDisconnected

Will_replace_channel_closed_incoming_session_before_delayed_disconnect verifies the happy path (fresh session becomes the active peer) but doesn't exercise the end-state of the stale session. In production, after 1 second, MarkDisconnectedAfterCloseDelay will call staleSession.MarkDisconnected(ConnectionClosed, Remote, ...). If the stale session is already attached to a peer at that point, OnDisconnected in PeerManager would attempt to clean it up. The test doesn't verify that this delayed clean-up doesn't corrupt the state (e.g., accidentally deactivating the fresh session's peer). This is acceptable for an initial regression test but worth noting for follow-up.

Low — InitSession sets up unused channel substitute

IChannel channel = Substitute.For<IChannel>();
context.Channel.Returns(channel);

Session.Init does not access _context.Channel during initialization (only during EnableSnappy), so the channel substitute is set up but never exercised. Harmless but adds noise.


Correctness & Thread Safety ✓

  • _isChannelClosed is volatile bool, which is correct: writes are immediately visible across threads; no torn reads; write-once semantics fit perfectly.
  • MarkChannelClosed() is idempotent — safe to call multiple times.
  • The HasOpenSession logic is correct for both Session (checks channel-closed) and any other ISession (falls back to IsClosing only, matching prior behaviour).
  • The 1-second delay in MarkDisconnectedAfterCloseDelay is preserved — this PR doesn't disturb the existing graceful-disconnect flow for the case where a disconnect reason is received just after channel close.

Overall

1 Medium, 2 Low findings. The fix is directionally correct and the test is a solid regression guard. The medium finding (downcast coupling) should be acknowledged — either by adding IsChannelClosed to ISession or by leaving an explanatory comment — before merge.

Comment thread src/Nethermind/Nethermind.Network/PeerManager.cs Outdated
@flcl42 flcl42 marked this pull request as ready for review May 1, 2026 07:53
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @flcl42's task in 3m 49s —— View job


PR Review

  • Gather context and read changed files
  • Read relevant rule files
  • Analyze changes for correctness, edge cases, security, and performance
  • Post review findings

Summary

This PR fixes a real and reproducible race in the RLPx session lifecycle. When a transport channel closes, MarkChannelClosed() is now called immediately, and HasOpenSession now treats a channel-closed session as not open. This breaks the 1-second window where a dying session blocked new incoming connections from the same peer.

The previous Medium finding from the April 30 review (downcast coupling in HasOpenSession) has been resolved by adding bool IsChannelClosed => false; as a default interface member (DIM) to ISession. HasOpenSession now calls session.IsChannelClosed directly through the interface — no downcast.


Findings

Low — MarkChannelClosed() downcast in RlpxHost is unavoidably asymmetric

ISession.IsChannelClosed is readable through the interface (DIM on ISession), but MarkChannelClosed() is internal and thus can only be called through the concrete Session type. OnChannelCloseCompleted in RlpxHost.cs therefore still does:

if (state is Session session)
{
    session.MarkChannelClosed();
}

This is correct and intentional given that MarkChannelClosed is an internal implementation detail that cannot be part of the public ISession interface. If the cast fails (impossible in practice since InitializeChannel always passes a concrete Session as state), the call silently no-ops. A brief comment would help future readers understand why the readable property is on the interface but the write method is not:

// MarkChannelClosed() is internal and cannot be on ISession; the readable property IsChannelClosed is.
if (state is Session session)
{
    session.MarkChannelClosed();
}

Low — Test leaves peer.OutSession stale after the scenario

Will_replace_channel_closed_incoming_session_before_delayed_disconnect correctly verifies:

  1. freshSession becomes peer.InSession
  2. Calling staleSession.MarkDisconnected(...) does not evict the fresh session ✓

However, after staleSession.MarkDisconnected fires, peer.OutSession (null in this test) is not exercised. In the corresponding production path, peer.OutSession could point to the stale session if the closed connection was outgoing — and after OnDisconnected returns early (session ID mismatch), peer.OutSession is never cleared. This is a pre-existing concern and a low-risk follow-up, not a blocker.


Correctness & Thread Safety ✓

  • volatile bool _isChannelClosed is correct: write-once semantics, immediate visibility across threads, no torn reads.
  • MarkChannelClosed() is idempotent — safe to call multiple times.
  • The HasOpenSession short-circuit logic is correct for all four state combinations (IsClosing × IsChannelClosed).
  • ResolveSessionConflict is not entered in the fixed case because IsConnected(peer) = false when the only session is channel-closed, so CanAttachSessionDirectly returns true directly.
  • The 1-second MarkDisconnectedAfterCloseDelay is preserved and fires correctly regardless of whether _shutdownCts has been cancelled.
  • OnDisconnected correctly ignores the stale session's delayed disconnect by checking session ID against activePeer.InSession and activePeer.OutSession.

Overall

0 Critical, 0 High, 0 Medium findings. The prior Medium (downcast in HasOpenSession) is resolved. The fix is correct, thread-safe, and well-tested. Ready to merge.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Claude finished @flcl42's task in 4m 40s —— View job


PR Review

  • Gather context and read changed files
  • Read relevant rule files
  • Analyze changes for correctness, edge cases, security, and performance
  • Post review findings

Summary

This review revisits the same 5 changed files against origin/master. The fix closes a 1-second race window where a dying RLPx session blocked new incoming connections from the same peer: MarkChannelClosed() is now called immediately on channel-close, and HasOpenSession treats a channel-closed session as not open.

The two prior medium findings are resolved:

  • April 30 Medium (downcast coupling in HasOpenSession): resolved by adding bool IsChannelClosed => false; as a default interface member on ISession.
  • May 1 Low (test not calling staleSession.MarkDisconnected): test now calls it at line 214 and asserts the fresh session survives.

Findings

Low — ConnectedPeers / ConnectedPeersCount now immediately excludes channel-closed peers

ConnectedPeers (PeerManager.cs:100) and ConnectedPeersCount (PeerManager.cs:104) both delegate to IsConnected(peer)HasOpenSession. With the fix, a peer whose transport channel has closed — but whose 1-second MarkDisconnected delay hasn't expired yet — is immediately excluded from both properties.

The connection selection loop at line 536 is safe because it checks _peerPool.ActivePeers.TryGetValue(peer.Node.Id, out _) separately, and the peer stays in ActivePeers for the full delay window. But ConnectedPeersCount is surfaced via parity_netPeers (ParityRpcModule.cs:88), so callers will see a 1-second dip in reported peer count per disconnection event. This is arguably more accurate than the prior behavior but is a subtle behavioral change worth noting.

Low — Missing comment explaining the MarkChannelClosed() downcast in RlpxHost

IsChannelClosed is readable through ISession (DIM), but MarkChannelClosed() is internal and can only be called through the concrete Session type. The if (state is Session session) pattern is therefore intentional — if the cast fails (impossible in practice) the call silently no-ops. A short comment would future-proof this:

// MarkChannelClosed() is internal; IsChannelClosed is on ISession. Cast is always safe here.
if (state is Session session)
{
    session.MarkChannelClosed();
}

Low — OUT-direction session scenario not covered by regression test

The test covers the case where the stale and fresh sessions are both ConnectionDirection.In. If the stale session was outgoing (peer.OutSession), the fix is symmetric — HasOpenSession(peer.OutSession) returns false when IsChannelClosed = true, so CanAttachSessionDirectly still returns true. Correctness holds, but this path has no targeted test case.


Correctness & Thread Safety ✓

  • volatile bool _isChannelClosed is correct for write-once semantics: immediate cross-thread visibility, no torn reads.
  • MarkChannelClosed() is idempotent.
  • HasOpenSession logic is correct for all state combinations (IsClosing × IsChannelClosed).
  • The outbound-connection selection loop (PeerManager.cs:536) guards against premature reconnect attempts via _peerPool.ActivePeers.TryGetValue(peer.Node.Id, out _), which holds true for the full 1-second delay — no spurious outbound connection is possible.
  • OnDisconnected correctly ignores the stale session's delayed disconnect via session-ID mismatch check (PeerManager.cs:1179), leaving the fresh session's peer state untouched.
  • The 1-second MarkDisconnectedAfterCloseDelay is preserved and fires correctly.

Overall

0 Critical, 0 High, 0 Medium. All prior findings are resolved or carry-over Lows. The fix is correct, thread-safe, and well-tested. Ready to merge.

@smartprogrammer93 smartprogrammer93 merged commit 93b3e3b into master May 4, 2026
445 checks passed
@smartprogrammer93 smartprogrammer93 deleted the fix-rlpx-stale-closed-session branch May 4, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nethermind RLPx session recovery can fail after a small malformed connection storm

4 participants