From 0559e6456bfc79946c839c83e831e55fb9a240c6 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 15:40:45 +0900 Subject: [PATCH] test(keyviz): replace fixed sleep with eventually in TestHotKeysAggregatorRaceFree CI flake observed on Actions run 26765510693 (PR #902 branch but unrelated to that PR -- the keyviz/ change is on main and surfaced there by chance): --- FAIL: TestHotKeysAggregatorRaceFree (0.11s) hot_keys_test.go:356: Expected value not to be nil. Root cause: same wall-clock racing pattern as the SQS-throttle and Redis-TTL flake series (PR #891, #903). The test: 1. Launches an aggregator goroutine (Step=5ms tick). 2. Spawns 8 workers each calling Observe 500 times. 3. Sleeps 50ms. 4. Asserts both per-route snapshots are non-nil. The aggregator publishes a snapshot only on a tick (or on ctx-done drain). Under -race on a slow CI runner, the Run goroutine itself can be slow to schedule -- the 50ms wait may not contain a single actual tick worth of scheduler time, so hotKeysSnap stays at its initial nil atomic.Pointer load and the assertion fires. Fix: require.Eventually with a 3-second budget + 5ms poll cadence, asserting both snapshots non-nil. The test still pins the load-bearing property (snapshot reads are lock-free; the atomic.Pointer.Load works without contention with the publisher); only the WAIT mechanism changes from "sleep blindly and hope" to "poll until ready or fail". Total time bound is at most 3 s on a slow runner, ~5-50 ms in the common case. Validation: go test ./keyviz/ -run TestHotKeysAggregatorRaceFree -race -count=5 -> ok 1.1s gofmt + golangci-lint -> 0 issues Caller audit: test-only file; no Go callers affected. The flake linage matches the structural fixes in PR #891/#903 (replace fixed-time-window sleep with require.Eventually); future flakes in this category should follow the same pattern. --- keyviz/hot_keys_test.go | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/keyviz/hot_keys_test.go b/keyviz/hot_keys_test.go index 78cb87113..869ee0763 100644 --- a/keyviz/hot_keys_test.go +++ b/keyviz/hot_keys_test.go @@ -348,13 +348,19 @@ func TestHotKeysAggregatorRaceFree(t *testing.T) { }(w) } wg.Wait() - // Let the aggregator drain + publish a few ticks. - time.Sleep(50 * time.Millisecond) - // Snapshot reads must be lock-free and consistent. - s1 := s.HotKeysSnapshot(1) - s2 := s.HotKeysSnapshot(2) - require.NotNil(t, s1) - require.NotNil(t, s2) + // Wait for the aggregator to drain the queue and publish at least + // one tick-driven snapshot for each route. The previous + // time.Sleep(50ms) raced the aggregator's tick (Step=5ms) on slow + // -race CI runners — the Run goroutine could be slow to schedule + // at all, leaving hotKeysSnap as its initial nil atomic.Pointer + // load by the time the assertions fired. require.Eventually with + // a 3-second budget + 5ms poll cadence rides out any scheduler + // jitter; snapshot reads themselves remain lock-free + // (atomic.Pointer.Load), which is what this test pins. Observed + // failure: Actions run 26765510693. + require.Eventually(t, func() bool { + return s.HotKeysSnapshot(1) != nil && s.HotKeysSnapshot(2) != nil + }, 3*time.Second, 5*time.Millisecond, "aggregator must publish a snapshot for each route within the budget") cancel() wgRun.Wait() }