Fix close/listen race in PlainNettyTransport that leaks listener ports#475
Open
codexcoder21 wants to merge 1 commit into
Open
Conversation
A concurrent `close()` racing an in-flight `listen()` on the same transport
can leak the listen socket — the OS file descriptor stays open after both
calls return, so the next bind to the same port from this JVM fails with
`BindException: Address already in use`.
## Root cause
`listen()` writes the listener channel into the `listeners` map under
`synchronized(this@PlainNettyTransport)`, but AFTER calling
`listener.bind(addr)`. At that point the bind task is already on the boss
event loop queue (so the port WILL be bound once the event loop drains the
queue). `close()` reads the same map WITHOUT acquiring the transport
monitor.
In the microsecond window between `bind()` being submitted and the
synchronized `listeners += ...` running, a concurrent `close()`:
1. observes an empty `listeners` map and schedules no channel close;
2. proceeds straight to `shutdownGracefully()`;
3. — but Netty's event loop still runs the queued bind task before the
loop terminates (the queue is drained on shutdown), so the port DOES
get bound; and
4. nothing ever calls `close()` on the now-orphaned listener channel.
The OS file descriptor stays open for the lifetime of the JVM, and every
subsequent bind to the same port from this process fails.
## Fix
- `close()` snapshots `listeners` and `channels` under
`synchronized(this@PlainNettyTransport)`, the same monitor that
`listen()` / `dial()` / `registerChannel()` use to mutate them.
- `listen()` holds that monitor across the entire closed-check ->
`listener.bind(addr)` -> `listeners += ...` sequence so the map state is
never "bind submitted but listener not recorded".
- `registerChannel()` performs its `closed`-check and `channels += ch`
under one acquire so the dial path has the same atomicity.
- `closed` is marked `@Volatile`. The synchronized blocks already give us
happens-before; the annotation is for the cheap, lock-free reads that
still occur inside the dial fast-path before entering `registerChannel`.
The reorder honors the existing constraint that `bind()` is called on the
calling thread (no Netty callback into the transport during bind), so
holding the transport monitor across the bind() submission is safe — bind
is microseconds of work to queue the task on the boss event loop, and the
actual socket bind happens on the event loop where this monitor is never
acquired.
## Regression test
`PlainNettyTransportConcurrentListenCloseTest`'s "parallel listen and close
from separate threads" variant launches `listen()` and `close()` on two
threads released by a `CountDownLatch` barrier and verifies the listen
port is bindable after both return. Against the unfixed code the test
deterministically leaks the port by the 3rd iteration (198/200 iterations
failing); against the fix it passes consistently across 200 iterations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A concurrent
close()racing an in-flightlisten()on the same transport can leak the listen socket. The OS file descriptor stays open after both calls return, so the next bind to the same port from this JVM fails withBindException: Address already in use— for the lifetime of the JVM.Root cause
listen()writes the listener channel into thelistenersmap undersynchronized(this@PlainNettyTransport), but AFTER callinglistener.bind(addr). At that point the bind task is already on the boss event loop queue, so the port WILL be bound once the event loop drains it.close()reads the same map WITHOUT acquiring the transport monitor.In the microsecond window between
bind()being submitted and the synchronizedlisteners += ...running, a concurrentclose():listenersmap and schedules no channel close;shutdownGracefully();close()on the now-orphaned listener channel.The OS file descriptor stays open for the lifetime of the JVM, and every subsequent bind to the same port from this process fails.
Fix
close()snapshotslistenersandchannelsundersynchronized(this@PlainNettyTransport), the same monitor thatlisten()/dial()/registerChannel()use to mutate them.listen()holds that monitor across the entireclosed-check →listener.bind(addr)→listeners += ...sequence so the map state is never "bind submitted but listener not recorded".registerChannel()performs itsclosed-check andchannels += chunder one acquire so the dial path has the same atomicity.closedis marked@Volatilefor the cheap pre-acquire read in the dial fast path.The reorder is safe because
bind()is called on the calling thread and only queues the actual socket bind for the event loop — bind is microseconds of work to submit the task, and the event loop never acquires the transport monitor, so holding it across the bind submission cannot deadlock.Regression test
PlainNettyTransportConcurrentListenCloseTestlauncheslisten()andclose()from separate threads released by aCountDownLatchbarrier and verifies the listen port is bindable after both return. Against the unfixed code the test deterministically leaks the port by the 3rd iteration (198/200 iterations fail withBindException: Address already in use); against the fix it passes consistently across 200 iterations.Test plan
./gradlew spotlessCheckand./gradlew :libp2p:detektpassPlainNettyTransportConcurrentListenCloseTestpasses consistentlyPlainNettyTransport*test suites passDiscovery context
The race was caught downstream by a
UrlResolverproject where many short-lived libp2p hosts get created and torn down on a fixed listen port. On slow / CPU-contended CI it surfaced as "listen port N not released within 2 seconds of host.stop()" with cascading "Address already in use" on every subsequent iteration. The same fix has been validated end-to-end there.