fix: Eliminate flaky parallel test failures caused by port collisions and timing races#558
Merged
hoffmaen merged 2 commits intocloudfoundry:developfrom Apr 27, 2026
Conversation
0aea76d to
0d41dbe
Compare
7fe6d5b to
65b0f07
Compare
403e2ad to
b5fc538
Compare
b5fc538 to
ae2d781
Compare
34c0c21 to
58acbe5
Compare
- Assign each Ginkgo parallel process a dedicated port range in [61000,65534] (above the Linux ephemeral range) to prevent cross-process port collisions - Add ReservePort/ReleasePort/ReleaseAllPorts to hold ports open until external processes (gorouter, NATS) bind them, eliminating the TOCTOU race - Apply correct reserve→release→start ordering across all 6 NATS call sites and all gorouter exec.Command launch sites - Fix flaky timing assertions in 4 test files by bracketing with before/after timestamps instead of comparing to time.Now() after the fact - Fix HTTP 100 Continue test: replace NewResponse(100) with a header-less http.Response literal to avoid spurious Connection: close on the proxy - Stop gorouter before NATS in AfterEach to prevent log.Fatal → os.Exit(1) killing the test binary before Ginkgo captures the result fix: Fix additional findings
8ec3e04 to
7031916
Compare
b1tamara
approved these changes
Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminates a class of flaky parallel test failures in the gorouter test suite.
Port allocation overhaul (
test_util/ports.go)net.Listen(":0")(kernel-assigned ephemeral port, immediately closed) with a dedicated per-Ginkgo-process port range in[61000, 65534]— above the Linux default ephemeral range (32768–60999) so the OS never steals ports between allocation andlisten().ReservePort()/ReleasePort()/ReleaseAllPorts()to hold a listener open until an external process (gorouter binary, NATS server) is ready to bind, eliminating the TOCTOU race.NextAvailPort()for in-process bindings;ReservePort()+ReleasePort()/ReleaseAllPorts()for external-process bindings.Applied across all affected test suites
integration/— all ports for gorouter and NATS binaries now useReservePort();ReleaseAllPorts()called just before everyexec.Command(gorouterPath, ...)launch, including raw launch sites that bypassed the helper functions.router/router_test.go,router_drain_test.go— NATS port usesReservePort(); in-process router ports remainNextAvailPort().mbus/subscriber_test.go,common/component_test.go— NATS port usesReservePort().NewNATSRunner→ReleasePort→Startordering enforced in all 6 call sites.test/common/app.goListen()andTlsListen()now callnet.Listen/tls.Listeneagerly before handing off tohttp.Server.Serve(), so the port is bound immediately (eliminates a secondary TOCTOU window for test app backends).Flaky timing assertions fixed (4 test files)
proxy/round_tripper/proxy_round_tripper_test.go—time.Now()was captured afterRoundTrip()+ channel receive; deadline comparison used ±11 ms / ±6 ms tolerance. Fixed by capturing the reference time before the call and widening tolerance to the full timeout value.handlers/requestinfo_test.go,handlers/reporter_test.go,proxy/proxy_test.go— same pattern; replaced with precisebefore ≤ timestamp ≤ afterbrackets.HTTP 100 Continue test fix (
proxy/proxy_test.go)test_util.NewResponse()always setsConnection: close, which is correct for final responses but wrong for 100 Continue — the proxy would close the client connection after forwarding the header, causing a rareunexpected EOFon the subsequent read. ReplacedNewResponse(100)with a plainhttp.Responseliteral (no headers) in both affected backend handlers.Other
logger/logger.go— addedfmt.Fprintf(os.Stderr, ...)beforeos.Exit(1)inFatal()so the message is guaranteed to reach stderr even if the slog handler's buffer hasn't flushed.6705/6706ingdpr_test.gowithReservePort().AfterEach/StopAndCleanupblocks to prevent the subscriber'sClosedCBfrom callinglog.Fatal→os.Exit(1)and killing the test binary before Ginkgo can capture the result.Backward Compatibility
Breaking Change? No
Note on AI usage
Parts of this code and tests were developed with assistance from Claude Code (claude-opus-4-20250514).