Skip to content

test(qwp): fix flaky port-bind race in WebSocket client tests#47

Open
puzpuzpuz wants to merge 1 commit into
mainfrom
fix-flaky-test-port-bind-race
Open

test(qwp): fix flaky port-bind race in WebSocket client tests#47
puzpuzpuz wants to merge 1 commit into
mainfrom
fix-flaky-test-port-bind-race

Conversation

@puzpuzpuz

@puzpuzpuz puzpuzpuz commented Jun 12, 2026

Copy link
Copy Markdown

Problem

The WebSocket client test suite intermittently failed with java.net.BindException: Address already in use, e.g.:

QwpQueryClientWalkTrackerTest.testWalk_426UpgradeRequiredIsTransportNotTerminal:114
  » Bind Address already in use   (at TestWebSocketServer.start)

The cause is a time-of-check-to-time-of-use port race. TestPorts.findUnusedPort() opened an ephemeral ServerSocket, read its port, then closed it — releasing the port. TestWebSocketServer.start() re-bound that port only later, so in the gap another process (or the very next findUnusedPort() call, since nothing held the port) could take it. The failure is timing- and load-dependent, so it shows up as a rare flake on busy CI runners rather than deterministically.

Fix

TestWebSocketServer now binds its loopback listener eagerly in the constructor, holds it for the server's whole lifetime, and exposes the OS-assigned port via getPort(). start() just launches the accept loop on the already-bound socket. Owning the port from allocation to teardown closes the race window entirely.

Call sites no longer pre-allocate a port: each test reads server.getPort() after constructing the server. findUnusedPort() remains only where a test points a client at a deliberately-dead endpoint (no server bound). The two raw ServerSocket fixtures that carried the same race (Always401Fixture, Auth401AfterFirstConnectionFixture) adopt the same eager-bind pattern; the other raw fixtures already used it. The change also removes two latent same-class hazards the migration surfaced: the sequential fixed-port allocPort() in DurableAckIntegrationTest and the port1 + 50 guess in CleanShutdownNoReplayTest.

Tradeoffs

Eager binding means the listener is accept-able at the TCP level from construction rather than from start(). For "server arrives late" tests this changes the cause of a pre-start() connection failure (upgrade timeout instead of
connection-refused) but not the outcome: there is no accept loop until start(), so the client still fails and retries, and the assertions are about end state. Those tests pass unchanged. A minor side effect is some benign Handshake failed server logs when a stale pre-start() connection is drained after the accept loop starts.

The diff is broad (19 files) because the helper is widely used, but each call-site change is mechanical: drop the pre-allocated port argument, read getPort() instead.

Test plan

  • Ran the full core test suite on the rebased branch: 2314 tests, 0 failures, 0 errors, 1 skipped
  • Confirmed the originally-failing QwpQueryClientWalkTrackerTest passes
  • Verified the timing-sensitive classes (InitialConnectAsyncTest, InitialConnectRetryTest, CloseDrainTest, RecoveryReplayTest, ReconnectTest) pass, including the "server arrives late" scenarios
  • Grepped to confirm no remaining new TestWebSocketServer(port, ...) calls and no raw ServerSocket bound to a pre-selected port

TestPorts.findUnusedPort() opened an ephemeral ServerSocket, read its
port, then closed it -- releasing the port. TestWebSocketServer.start()
re-bound that port only later, so another process (or the very next
findUnusedPort() call) could grab it in the gap, surfacing as a flaky
"BindException: Address already in use" on loaded CI runners.

TestWebSocketServer now binds its loopback listener eagerly in the
constructor, holds it for the server's whole lifetime, and exposes the
OS-assigned port via getPort(); start() just launches the accept loop on
the already-bound socket. Owning the port from allocation to teardown
closes the race window entirely.

Every test that starts a server now reads server.getPort() instead of
pre-allocating a port via findUnusedPort(); findUnusedPort() stays only
where a test points a client at a deliberately-dead endpoint. The two
raw ServerSocket fixtures that shared the same race (Always401Fixture,
Auth401AfterFirstConnectionFixture) adopt the same eager-bind pattern;
the other raw fixtures already used it. The change also removes two
latent same-class hazards the migration surfaced: the sequential
fixed-port allocPort() in DurableAckIntegrationTest and the port1 + 50
guess in CleanShutdownNoReplayTest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@puzpuzpuz puzpuzpuz self-assigned this Jun 12, 2026
@puzpuzpuz puzpuzpuz added the bug Something isn't working label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant