test(keyviz): Phase 2-A bench coverage + concurrent-burst exact-count test#682
test(keyviz): Phase 2-A bench coverage + concurrent-burst exact-count test#682
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughThis pull request adds comprehensive performance benchmarks and a concurrency correctness test to Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of benchmarks and a concurrency test for the MemSampler to track performance characteristics like contention, route registration cost, and flush latency. Feedback was provided regarding the benchmark implementations: BenchmarkObserveParallel incorrectly causes contention by using a local variable for slot indexing across goroutines, BenchmarkRegisterRoute suffers from O(N²) complexity by scaling the data structure with the iteration count, and BenchmarkFlush includes significant setup overhead within the timed loop.
| b.RunParallel(func(pb *testing.PB) { | ||
| var i uint64 | ||
| for pb.Next() { | ||
| i++ | ||
| s.Observe((i%numRoutes)+1, OpWrite, 16, 64) | ||
| } | ||
| }) |
There was a problem hiding this comment.
The implementation of BenchmarkObserveParallel does not align with its description. The comment states that each goroutine hits a distinct slot to avoid contention, but since i is a local variable within the RunParallel closure, every goroutine starts at i=0 and iterates through the same sequence of slots. This results in high contention on the atomic counters of the first few slots, which likely masks the performance of the atomic.Pointer load and map lookup you intended to measure. To fix this, you can use an atomic counter to give each goroutine a unique starting offset. This aligns with the preference for using atomic operations to manage hot paths and reduce contention.
var gid atomic.Uint64
b.RunParallel(func(pb *testing.PB) {
offset := gid.Add(1)
var i uint64
for pb.Next() {
s.Observe(((offset+i)%numRoutes)+1, OpWrite, 16, 64)
i++
}
})References
- For frequently accessed fields that require monotonic updates, prefer atomic operations over mutexes to improve performance on hot paths.
| func BenchmarkRegisterRoute(b *testing.B) { | ||
| s := NewMemSampler(MemSamplerOptions{Step: time.Second, HistoryColumns: 4, MaxTrackedRoutes: b.N + 1}) | ||
| b.ReportAllocs() | ||
| b.ResetTimer() | ||
| for i := 0; i < b.N; i++ { | ||
| id := uint64(i + 1) //nolint:gosec // i bounded by b.N | ||
| if !s.RegisterRoute(id, []byte{byte(id)}, []byte{byte(id) + 1}) { | ||
| b.Fatalf("RegisterRoute(%d) returned false at i=%d", id, i) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This benchmark has b.N both for the number of iterations and as the scale of the data structure (via MaxTrackedRoutes and the number of registered routes). As b.N increases, each RegisterRoute call becomes more expensive due to the COW operation on an increasingly large table. This prevents the benchmark from reaching a stable ns/op value and makes the results dependent on the total execution time. It is better to benchmark the registration cost at specific, fixed table sizes (e.g., using sub-benchmarks) to ensure reproducible and meaningful results.
| for i := 0; i < b.N; i++ { | ||
| s.Flush() | ||
| clk.Advance(time.Second) | ||
| // Reseed so the next Flush still has counters to drain. | ||
| for r := uint64(1); r <= numRoutes; r++ { | ||
| s.Observe(r, OpWrite, 16, 64) | ||
| } | ||
| } |
There was a problem hiding this comment.
The Observe loop used to reseed the sampler is included in the timed portion of the benchmark. While Observe is fast, calling it 1024 times per iteration adds significant overhead (roughly 20-25% based on your reported numbers) that obscures the actual cost of Flush. Consider using b.StopTimer() and b.StartTimer() around the reseeding loop to isolate the measurement of the Flush operation.
for i := 0; i < b.N; i++ {
s.Flush()
clk.Advance(time.Second)
b.StopTimer()
// Reseed so the next Flush still has counters to drain.
for r := uint64(1); r <= numRoutes; r++ {
s.Observe(r, OpWrite, 16, 64)
}
b.StartTimer()
}There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a7e9a97909
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| const expectedPerRoute = uint64(writersPerRoute * opsPerWriter) | ||
| const expectedBytesPerRoute = expectedPerRoute * uint64(keyLen+valueLen) | ||
| for _, row := range cols[0].Rows { |
There was a problem hiding this comment.
Assert all routes are present in burst-count test
This test only validates rows that appear in cols[0].Rows, but it never checks that there is one row per registered route. Because Flush omits zero-value rows, a regression that drops all writes for one or more routes would simply omit those rows and still pass this test, weakening the stated “no counts lost” guarantee under concurrent burst.
Useful? React with 👍 / 👎.
| var i uint64 | ||
| for pb.Next() { | ||
| i++ | ||
| s.Observe((i%numRoutes)+1, OpWrite, 16, 64) |
There was a problem hiding this comment.
Use per-worker shard in ObserveParallel benchmark
The benchmark comments say each goroutine should hit a distinct slot, but i is local to each RunParallel worker and starts at zero, so workers generate the same route sequence and contend on the same counters. This measures cross-goroutine counter contention rather than the intended shared table.Load cost, which can skew the baseline and mask the specific regression this benchmark is meant to catch.
Useful? React with 👍 / 👎.
…ount test Closes the "Benchmark gate green" half of the Phase 2-A exit criteria in docs/admin_ui_key_visualizer_design.md 12. Adds four benchmarks that pin the cost shape of the keyviz package: - BenchmarkObserveParallel — contention profile of the hot path. Each goroutine hits a distinct slot, so any future change that introduces a shared mutex on Observe shows up as a sharp drop in ns/op. - BenchmarkRegisterRoute — COW route-table mutation cost; flat ns/op (vs growing-with-N) would mean the immutable-table invariant has been silently traded for a shared mutable map. - BenchmarkFlush — per-step drain over 1024 live routes; pins the swap-and-publish cost and catches an O(N^2) scan regression. - BenchmarkSnapshot — read-side pivot over 1024 routes x 64 columns; the dominant term in the admin handler's response latency. Local Apple M1 Max baselines (200 ms benchtime): - ObserveParallel: 20.6 ns/op, 0 allocs - RegisterRoute: 123.9 us/op, 24 allocs - Flush: 96.4 us/op - Snapshot: 1.86 ms/op Adds TestObserveExactCountUnderConcurrentBurst — 32 routes x 8 writers x 4000 ops, all gated on a sync.WaitGroup so every goroutine starts simultaneously. The test pins the Phase 2-A invariant that no counts are lost under concurrent burst (sampleRate is 1 today, every Observe must be reflected exactly in the post-Flush Snapshot). When the design 5.2 sub-sampling estimator lands, this test must be updated alongside the new +/-5% / 95% CI assertion. Pure-test addition; no production code changes.
Four items from the round-1 reviews: 1. Codex P1: TestObserveExactCountUnderConcurrentBurst only validated rows that appeared in cols[0].Rows. A future Flush regression that silently dropped a registered route would still pass — the loop would just iterate fewer times. Build a route ID -> row index and assert every registered route is present, then validate counters. 2. Gemini medium / Codex P2: BenchmarkObserveParallel claimed each goroutine hit a distinct slot, but the per-worker counter started at zero so all workers walked the same route sequence in lockstep. That defeated the "no contention" claim. Use atomic.Add to assign each worker a disjoint shard of route IDs (4 routes per shard, shardIndex * 4 base offset modulo numRoutes). 3. Gemini medium: BenchmarkRegisterRoute used b.N as both the iteration count AND the table size, producing O(N^2) total work. Go reports ns/op = total / b.N, so a constant-per-call cost read as growing with N. Use a fixed table size (1024 pre-loaded) and churn one ID in/out per iteration so b.N controls only the iteration count. 4. Gemini medium: BenchmarkFlush included the per-iteration reseed loop in the timed range, inflating ns/op by ~25% and letting a real Flush regression hide behind Observe variance. Bracket the reseed with b.StopTimer / b.StartTimer. New M1 Max baselines (200 ms benchtime): - ObserveParallel: 4.4 ns/op (was 20.6 — workers now genuinely parallel) - RegisterRoute: 162.2 us/op (constant; was reading as growing-with-N) - Flush: 82.3 us/op (was 96.4 — reseed excluded) - Snapshot: 1.92 ms/op (unchanged; was timed correctly already)
a7e9a97 to
19abb41
Compare
|
@claude review Round-1 follow-ups (19abb41) addressing all 4 items from Gemini + Codex on PR #682:
New M1 Max baselines (200 ms benchtime):
|
|
Claude finished @bootjp's task in 2m 12s —— View job Round-2 Review (19abb41)
All four fixes are correctly implemented. Two items worth addressing before merge. P1 —
|
| Fix | Status | Notes |
|---|---|---|
| Codex P1 / route presence | ✅ Correct | rowsByRoute map + per-route ok check covers the exact regression Codex identified |
| Codex P2 / parallel shard | ✅ Correct (GOMAXPROCS ≤ 16) | atomic.Add assigns disjoint shardBase per worker; see P2 above for the high-core edge case |
| Gemini / RegisterRoute O(N²) | ✅ Correct | Pre-load 1024, MaxTrackedRoutes = 1025, churn one ID per iteration. RegisterRoute checks len(slots) < MaxTrackedRoutes so first-iteration add (table at 1024) succeeds cleanly |
| Gemini / Flush timer isolation | ✅ Correct | b.StopTimer / b.StartTimer bracket the reseed loop; clock advance is also inside the stopped window so each Flush sees a fresh time window |
· Branch
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 19abb415cc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| shardIndex := nextShard.Add(1) - 1 | ||
| shardBase := (shardIndex * routesPerShard) % numRoutes |
There was a problem hiding this comment.
Size benchmark shards to parallel worker count
BenchmarkObserveParallel still reintroduces cross-goroutine counter contention when RunParallel launches more than 16 workers (its default is GOMAXPROCS): with numRoutes=64 and routesPerShard=4, shardBase := (shardIndex * routesPerShard) % numRoutes wraps and assigns multiple workers to the same 4-route shard. In that environment, the benchmark no longer isolates the shared table.Load cost it claims to pin, so ns/op becomes host-core-count dependent and can mask/blur hot-path regressions.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
keyviz/sampler_test.go (2)
1145-1168:retiredSlotsgrows unboundedly acrossb.Niterations.
RemoveRouteenqueues the slot ontos.retiredSlotsand the bench never callsFlushto drain it, so for typicalb.N(millions) the queue accumulates millions of entries. That doesn't change the per-iteration COW cost you're trying to pin, but it will inflate theallocs/opandB/opfigures with the slice-growth amortization, and any future change that traversesretiredSlotsduringRegisterRoutewould silently turn this benchmark superlinear.Consider periodically flushing (e.g. every iteration with
b.StopTimer/StartTimeraround it, similar to the pattern inBenchmarkFlush) so the retired queue stays bounded.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@keyviz/sampler_test.go` around lines 1145 - 1168, The benchmark BenchmarkRegisterRoute lets s.retiredSlots grow unbounded because RemoveRoute enqueues slots but the test never calls Flush; modify the benchmark to periodically drain retiredSlots (e.g., call s.Flush) while excluding flush time from measurement by surrounding it with b.StopTimer() / b.StartTimer(), or call Flush every iteration with the same stop/start pattern (similar to BenchmarkFlush) so retiredSlots stays bounded and allocs/op/B/op reflect the intended RegisterRoute costs; locate the benchmark function BenchmarkRegisterRoute and add the timed/un-timed Flush calls around the chosen flush cadence.
1110-1133: Shard mapping collides onceGOMAXPROCS > 16, defeating the "distinct slot per goroutine" claim.
(shardIndex * routesPerShard) % numRouteswithnumRoutes=64, routesPerShard=4produces onlynumRoutes/routesPerShard = 16distinctshardBasevalues (0, 4, 8, …, 60). On hosts whereb.RunParallelspawns more than 16 workers (e.g. 24/32-core CI runners, EPYC dev boxes), the 17th+ goroutine wraps toshardBase=0and contends on the sameatomic.AddUint64counters as worker 0 — exactly the false-sharing scenario the comment above promises is avoided. The bench would silently report contention numbers as the steady-state hot-path cost on those machines.Either bump
numRoutes(and the sampler'sMaxTrackedRoutes) so the shard space exceeds any plausibleGOMAXPROCS, or size it fromruntime.GOMAXPROCS(0)at setup time.♻️ One way to make the disjointness explicit
-func BenchmarkObserveParallel(b *testing.B) { - const ( - numRoutes = 64 - routesPerShard = 4 - ) - s := NewMemSampler(MemSamplerOptions{Step: time.Second, HistoryColumns: 4, MaxTrackedRoutes: numRoutes}) +func BenchmarkObserveParallel(b *testing.B) { + const routesPerShard = 4 + // Ensure numRoutes/routesPerShard >= any plausible GOMAXPROCS so each + // pb worker maps to a disjoint shard even on large CI runners. + numRoutes := routesPerShard * runtime.GOMAXPROCS(0) * 4 + s := NewMemSampler(MemSamplerOptions{Step: time.Second, HistoryColumns: 4, MaxTrackedRoutes: numRoutes}) for r := uint64(1); r <= numRoutes; r++ { if !s.RegisterRoute(r, []byte{byte(r)}, []byte{byte(r) + 1}) { b.Fatalf("RegisterRoute(%d) returned false", r) } } var nextShard atomic.Uint64 b.ReportAllocs() b.ResetTimer() b.RunParallel(func(pb *testing.PB) { shardIndex := nextShard.Add(1) - 1 - shardBase := (shardIndex * routesPerShard) % numRoutes + shardBase := (shardIndex * routesPerShard) % uint64(numRoutes) var i uint64 for pb.Next() { s.Observe(shardBase+(i%routesPerShard)+1, OpWrite, 16, 64) i++ } }) }(Add
"runtime"to the import block if you take this path.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@keyviz/sampler_test.go` around lines 1110 - 1133, BenchmarkObserveParallel's shardBase calculation (shardIndex * routesPerShard) % numRoutes only yields 16 distinct shards with numRoutes=64 and routesPerShard=4, so when b.RunParallel spawns >16 workers goroutines wrap and contend; fix by ensuring the shard space exceeds possible GOMAXPROCS: compute numRoutes based on runtime.GOMAXPROCS(0) (e.g., numRoutes := max(64, routesPerShard * runtime.GOMAXPROCS(0))) or bump numRoutes and the sampler MaxTrackedRoutes accordingly, and pass that value into NewMemSampler(MemSamplerOptions{... MaxTrackedRoutes: numRoutes}) so each parallel worker gets a distinct shardBase in BenchmarkObserveParallel.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@keyviz/sampler_test.go`:
- Around line 1145-1168: The benchmark BenchmarkRegisterRoute lets
s.retiredSlots grow unbounded because RemoveRoute enqueues slots but the test
never calls Flush; modify the benchmark to periodically drain retiredSlots
(e.g., call s.Flush) while excluding flush time from measurement by surrounding
it with b.StopTimer() / b.StartTimer(), or call Flush every iteration with the
same stop/start pattern (similar to BenchmarkFlush) so retiredSlots stays
bounded and allocs/op/B/op reflect the intended RegisterRoute costs; locate the
benchmark function BenchmarkRegisterRoute and add the timed/un-timed Flush calls
around the chosen flush cadence.
- Around line 1110-1133: BenchmarkObserveParallel's shardBase calculation
(shardIndex * routesPerShard) % numRoutes only yields 16 distinct shards with
numRoutes=64 and routesPerShard=4, so when b.RunParallel spawns >16 workers
goroutines wrap and contend; fix by ensuring the shard space exceeds possible
GOMAXPROCS: compute numRoutes based on runtime.GOMAXPROCS(0) (e.g., numRoutes :=
max(64, routesPerShard * runtime.GOMAXPROCS(0))) or bump numRoutes and the
sampler MaxTrackedRoutes accordingly, and pass that value into
NewMemSampler(MemSamplerOptions{... MaxTrackedRoutes: numRoutes}) so each
parallel worker gets a distinct shardBase in BenchmarkObserveParallel.
Two items from Claude bot's round-2 on 19abb41: - P1: drop the //nolint:gosec in runConcurrentBurst. Per CLAUDE.md ("avoid //nolint - refactor instead"), change the helper signature to take numRoutes as uint64 and iterate in that type from the start, eliminating the int->uint64 conversion that triggered gosec G115. - P2: bump BenchmarkObserveParallel's numRoutes from 64 to 256 so numRoutes / routesPerShard >= 64 disjoint shards. With 16 shards, RunParallel on a host with GOMAXPROCS > 16 wrapped multiple workers onto the same 4-route shard, silently reintroducing the shared-counter contention this benchmark exists to catch. 256 keeps the benchmark meaningful up to GOMAXPROCS = 64 without changing its character.
|
@claude review Round-2 follow-ups (3310a1a):
|
|
Claude finished @bootjp's task in 3m 3s —— View job Round-3 Review (3310a1a)
Both round-2 items are correctly implemented. The PR is clean. P1 —
|
| Item | Status |
|---|---|
P1 — no //nolint:gosec in added code |
✅ |
| P2 — numRoutes=256, 64 disjoint shards | ✅ |
Codex P1 — route presence check via rowsByRoute map |
✅ (unchanged from round-2) |
| Gemini — RegisterRoute O(N²) fixed | ✅ (unchanged from round-2) |
| Gemini — Flush timer isolation | ✅ (unchanged from round-2) |
Ready to merge from a test-quality standpoint.
--- · Branch
Summary
Closes the "Benchmark gate green" half of the Phase 2-A exit criteria in
docs/admin_ui_key_visualizer_design.md§12. Pure-test addition; no production code changes.Four new benchmarks pin the cost shape of the
keyvizpackage:BenchmarkObserveParallelatomic.Pointer.LoadBenchmarkRegisterRouteBenchmarkFlushBenchmarkSnapshotTestObserveExactCountUnderConcurrentBurst— 32 routes × 8 writers × 4000 ops gated on async.WaitGroupso every goroutine starts simultaneously. Pins the Phase 2-A invariant that no counts are lost under concurrent burst:sampleRate = 1today, so everyObservemust be reflected exactly in the post-Flush Snapshot. When the design §5.2 sub-sampling estimator lands, this test must be updated alongside the new ±5% / 95%-CI assertion.Why not the ±5% accuracy SLO test?
Per design §5.2, the SLO is for the Horvitz–Thompson estimator that runs when the adaptive controller raises
sampleRateabove 1 under burst. That controller is not yet implemented — the current code path always counts every Observe — so a ±5% / 95%-CI assertion would be testing the absence of a feature. This PR pins the current "exact counting" invariant instead; the SLO test lands with the sub-sampling implementation in a follow-up that updates §5.2 to reflect the chosen controller shape.Five-lens self-review
go test -race) it's clean.Test plan
go test -race -count=1 ./keyviz/...— cleango test -bench='Benchmark(ObserveParallel|RegisterRoute|Flush|Snapshot)$' -benchtime=200ms -run='^$' ./keyviz/...— runs and reports the numbers abovegolangci-lint run ./keyviz/...— cleanSummary by CodeRabbit