Skip to content

test(keyviz): Phase 2-A bench coverage + concurrent-burst exact-count test#682

Merged
bootjp merged 3 commits intomainfrom
feat/keyviz-bench-accuracy
Apr 27, 2026
Merged

test(keyviz): Phase 2-A bench coverage + concurrent-burst exact-count test#682
bootjp merged 3 commits intomainfrom
feat/keyviz-bench-accuracy

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 26, 2026

Summary

Closes the "Benchmark gate green" half of the Phase 2-A exit criteria in docs/admin_ui_key_visualizer_design.md §12. Pure-test addition; no production code changes.

Four new benchmarks pin the cost shape of the keyviz package:

Benchmark What it pins Apple M1 Max @ 200 ms
BenchmarkObserveParallel Hot-path contention — distinct slot per goroutine, only shared work is the table atomic.Pointer.Load 20.6 ns/op, 0 allocs
BenchmarkRegisterRoute COW route-table mutation cost; flat ns/op would signal a shift to a shared mutable map 123.9 µs/op, 24 allocs
BenchmarkFlush Per-step drain over 1024 live routes; catches O(N²) scan regressions 96.4 µs/op
BenchmarkSnapshot Read-side pivot over 1024 routes × 64 columns; dominant term in admin handler latency 1.86 ms/op

TestObserveExactCountUnderConcurrentBurst — 32 routes × 8 writers × 4000 ops gated on a sync.WaitGroup so every goroutine starts simultaneously. Pins the Phase 2-A invariant that no counts are lost under concurrent burst: sampleRate = 1 today, so every Observe must be reflected exactly in the post-Flush Snapshot. When the design §5.2 sub-sampling estimator lands, this test must be updated alongside the new ±5% / 95%-CI assertion.

Why not the ±5% accuracy SLO test?

Per design §5.2, the SLO is for the Horvitz–Thompson estimator that runs when the adaptive controller raises sampleRate above 1 under burst. That controller is not yet implemented — the current code path always counts every Observe — so a ±5% / 95%-CI assertion would be testing the absence of a feature. This PR pins the current "exact counting" invariant instead; the SLO test lands with the sub-sampling implementation in a follow-up that updates §5.2 to reflect the chosen controller shape.

Five-lens self-review

  1. Data loss — n/a; tests only.
  2. Concurrency / distributed — the burst test is the concurrency check; with race detector on (go test -race) it's clean.
  3. Performance — adds CI-runnable benchmarks; baseline ns/op recorded above.
  4. Data consistency — pins the exact-counting invariant explicitly so a regression that loses counts on the hot path is caught.
  5. Test coverage — the burst test plus four benchmarks are the new coverage. Existing tests unchanged.

Test plan

  • go test -race -count=1 ./keyviz/... — clean
  • go test -bench='Benchmark(ObserveParallel|RegisterRoute|Flush|Snapshot)$' -benchtime=200ms -run='^$' ./keyviz/... — runs and reports the numbers above
  • golangci-lint run ./keyviz/... — clean

Summary by CodeRabbit

  • Tests
    • Added performance benchmarks for concurrent operations, route registration, flush operations, and snapshot functionality
    • Added concurrency correctness test to verify data integrity and accurate counting under concurrent load

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 49 minutes and 39 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b11036e5-6e60-4dd1-bfc1-604d7cd6897f

📥 Commits

Reviewing files that changed from the base of the PR and between 19abb41 and 3310a1a.

📒 Files selected for processing (1)
  • keyviz/sampler_test.go
📝 Walkthrough

Walkthrough

This pull request adds comprehensive performance benchmarks and a concurrency correctness test to keyviz/sampler_test.go. The additions cover hot-path Observe operations, route registration churn, flush costs, snapshot performance, and concurrent operation safety.

Changes

Cohort / File(s) Summary
Performance Benchmarks
keyviz/sampler_test.go
Added four benchmarks targeting distinct operations: BenchmarkObserveParallel (sharded concurrent Observe calls), BenchmarkRegisterRoute (route registration churn), BenchmarkFlush (flush cost isolation), and BenchmarkSnapshot (snapshot allocation overhead). Each isolates specific measurement concerns using timer control and preloading strategies.
Concurrency Correctness
keyviz/sampler_test.go
Added TestObserveExactCountUnderConcurrentBurst with runConcurrentBurst helper to verify that concurrent Observe calls produce correct snapshot state: exactly one column, all registered routes present, and exact per-route counts and bytes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Hop-hopping through benchmarks, measuring the way,
Concurrent bursts verified, all correct and okay,
From Observe to Flush, each operation shines,
Performance tales told in test-defined lines!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: adding Phase 2-A benchmark coverage and a concurrent-burst exact-count test to keyviz package.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/keyviz-bench-accuracy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of benchmarks and a concurrency test for the MemSampler to track performance characteristics like contention, route registration cost, and flush latency. Feedback was provided regarding the benchmark implementations: BenchmarkObserveParallel incorrectly causes contention by using a local variable for slot indexing across goroutines, BenchmarkRegisterRoute suffers from O(N²) complexity by scaling the data structure with the iteration count, and BenchmarkFlush includes significant setup overhead within the timed loop.

Comment thread keyviz/sampler_test.go
Comment on lines +1113 to +1119
b.RunParallel(func(pb *testing.PB) {
var i uint64
for pb.Next() {
i++
s.Observe((i%numRoutes)+1, OpWrite, 16, 64)
}
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of BenchmarkObserveParallel does not align with its description. The comment states that each goroutine hits a distinct slot to avoid contention, but since i is a local variable within the RunParallel closure, every goroutine starts at i=0 and iterates through the same sequence of slots. This results in high contention on the atomic counters of the first few slots, which likely masks the performance of the atomic.Pointer load and map lookup you intended to measure. To fix this, you can use an atomic counter to give each goroutine a unique starting offset. This aligns with the preference for using atomic operations to manage hot paths and reduce contention.

	var gid atomic.Uint64
	b.RunParallel(func(pb *testing.PB) {
		offset := gid.Add(1)
		var i uint64
		for pb.Next() {
			s.Observe(((offset+i)%numRoutes)+1, OpWrite, 16, 64)
			i++
		}
	})
References
  1. For frequently accessed fields that require monotonic updates, prefer atomic operations over mutexes to improve performance on hot paths.

Comment thread keyviz/sampler_test.go
Comment on lines +1128 to +1138
func BenchmarkRegisterRoute(b *testing.B) {
s := NewMemSampler(MemSamplerOptions{Step: time.Second, HistoryColumns: 4, MaxTrackedRoutes: b.N + 1})
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
id := uint64(i + 1) //nolint:gosec // i bounded by b.N
if !s.RegisterRoute(id, []byte{byte(id)}, []byte{byte(id) + 1}) {
b.Fatalf("RegisterRoute(%d) returned false at i=%d", id, i)
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This benchmark has $O(N^2)$ complexity because it uses b.N both for the number of iterations and as the scale of the data structure (via MaxTrackedRoutes and the number of registered routes). As b.N increases, each RegisterRoute call becomes more expensive due to the COW operation on an increasingly large table. This prevents the benchmark from reaching a stable ns/op value and makes the results dependent on the total execution time. It is better to benchmark the registration cost at specific, fixed table sizes (e.g., using sub-benchmarks) to ensure reproducible and meaningful results.

Comment thread keyviz/sampler_test.go
Comment on lines +1164 to +1171
for i := 0; i < b.N; i++ {
s.Flush()
clk.Advance(time.Second)
// Reseed so the next Flush still has counters to drain.
for r := uint64(1); r <= numRoutes; r++ {
s.Observe(r, OpWrite, 16, 64)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Observe loop used to reseed the sampler is included in the timed portion of the benchmark. While Observe is fast, calling it 1024 times per iteration adds significant overhead (roughly 20-25% based on your reported numbers) that obscures the actual cost of Flush. Consider using b.StopTimer() and b.StartTimer() around the reseeding loop to isolate the measurement of the Flush operation.

	for i := 0; i < b.N; i++ {
		s.Flush()
		clk.Advance(time.Second)
		b.StopTimer()
		// Reseed so the next Flush still has counters to drain.
		for r := uint64(1); r <= numRoutes; r++ {
			s.Observe(r, OpWrite, 16, 64)
		}
		b.StartTimer()
	}

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7e9a97909

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler_test.go
}
const expectedPerRoute = uint64(writersPerRoute * opsPerWriter)
const expectedBytesPerRoute = expectedPerRoute * uint64(keyLen+valueLen)
for _, row := range cols[0].Rows {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Assert all routes are present in burst-count test

This test only validates rows that appear in cols[0].Rows, but it never checks that there is one row per registered route. Because Flush omits zero-value rows, a regression that drops all writes for one or more routes would simply omit those rows and still pass this test, weakening the stated “no counts lost” guarantee under concurrent burst.

Useful? React with 👍 / 👎.

Comment thread keyviz/sampler_test.go Outdated
Comment on lines +1114 to +1117
var i uint64
for pb.Next() {
i++
s.Observe((i%numRoutes)+1, OpWrite, 16, 64)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use per-worker shard in ObserveParallel benchmark

The benchmark comments say each goroutine should hit a distinct slot, but i is local to each RunParallel worker and starts at zero, so workers generate the same route sequence and contend on the same counters. This measures cross-goroutine counter contention rather than the intended shared table.Load cost, which can skew the baseline and mask the specific regression this benchmark is meant to catch.

Useful? React with 👍 / 👎.

bootjp added 2 commits April 27, 2026 17:55
…ount test

Closes the "Benchmark gate green" half of the Phase 2-A exit criteria
in docs/admin_ui_key_visualizer_design.md 12.

Adds four benchmarks that pin the cost shape of the keyviz package:

- BenchmarkObserveParallel — contention profile of the hot path. Each
  goroutine hits a distinct slot, so any future change that introduces
  a shared mutex on Observe shows up as a sharp drop in ns/op.
- BenchmarkRegisterRoute — COW route-table mutation cost; flat ns/op
  (vs growing-with-N) would mean the immutable-table invariant has
  been silently traded for a shared mutable map.
- BenchmarkFlush — per-step drain over 1024 live routes; pins the
  swap-and-publish cost and catches an O(N^2) scan regression.
- BenchmarkSnapshot — read-side pivot over 1024 routes x 64 columns;
  the dominant term in the admin handler's response latency.

Local Apple M1 Max baselines (200 ms benchtime):
- ObserveParallel: 20.6 ns/op, 0 allocs
- RegisterRoute:   123.9 us/op, 24 allocs
- Flush:           96.4 us/op
- Snapshot:        1.86 ms/op

Adds TestObserveExactCountUnderConcurrentBurst — 32 routes x 8
writers x 4000 ops, all gated on a sync.WaitGroup so every goroutine
starts simultaneously. The test pins the Phase 2-A invariant that no
counts are lost under concurrent burst (sampleRate is 1 today, every
Observe must be reflected exactly in the post-Flush Snapshot). When
the design 5.2 sub-sampling estimator lands, this test must be
updated alongside the new +/-5% / 95% CI assertion.

Pure-test addition; no production code changes.
Four items from the round-1 reviews:

1. Codex P1: TestObserveExactCountUnderConcurrentBurst only validated
   rows that appeared in cols[0].Rows. A future Flush regression that
   silently dropped a registered route would still pass — the loop
   would just iterate fewer times. Build a route ID -> row index and
   assert every registered route is present, then validate counters.

2. Gemini medium / Codex P2: BenchmarkObserveParallel claimed each
   goroutine hit a distinct slot, but the per-worker counter started
   at zero so all workers walked the same route sequence in lockstep.
   That defeated the "no contention" claim. Use atomic.Add to assign
   each worker a disjoint shard of route IDs (4 routes per shard,
   shardIndex * 4 base offset modulo numRoutes).

3. Gemini medium: BenchmarkRegisterRoute used b.N as both the
   iteration count AND the table size, producing O(N^2) total work.
   Go reports ns/op = total / b.N, so a constant-per-call cost read
   as growing with N. Use a fixed table size (1024 pre-loaded) and
   churn one ID in/out per iteration so b.N controls only the
   iteration count.

4. Gemini medium: BenchmarkFlush included the per-iteration reseed
   loop in the timed range, inflating ns/op by ~25% and letting a
   real Flush regression hide behind Observe variance. Bracket the
   reseed with b.StopTimer / b.StartTimer.

New M1 Max baselines (200 ms benchtime):
- ObserveParallel:  4.4 ns/op (was 20.6 — workers now genuinely parallel)
- RegisterRoute:  162.2 us/op (constant; was reading as growing-with-N)
- Flush:           82.3 us/op (was 96.4 — reseed excluded)
- Snapshot:         1.92 ms/op (unchanged; was timed correctly already)
@bootjp bootjp force-pushed the feat/keyviz-bench-accuracy branch from a7e9a97 to 19abb41 Compare April 27, 2026 08:58
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Round-1 follow-ups (19abb41) addressing all 4 items from Gemini + Codex on PR #682:

  • Codex P1 / route presenceTestObserveExactCountUnderConcurrentBurst now indexes rows by RouteID and asserts every registered route is present, so a future Flush regression that silently drops a route is caught instead of being masked by a loop that just iterates fewer times.
  • Codex P2 / Gemini medium / parallel shardBenchmarkObserveParallel assigns each worker a disjoint shard of route IDs via atomic.Add. The previous draft had every worker start its counter at zero, so workers walked the same routes in lockstep. Now they actually hit distinct slots.
  • Gemini medium / RegisterRoute O(N²) — fixed pre-load of 1024 routes; benchmark now churns one ID in/out per iteration. b.N controls only the iteration count, so reported ns/op is constant per-call rather than growing with N.
  • Gemini medium / Flush timed reseedb.StopTimer / b.StartTimer bracket the reseed loop. Reported cost dropped from 96.4 µs/op to 82.3 µs/op once the 1024 Observe calls were excluded.

New M1 Max baselines (200 ms benchtime):

Bench Before After
ObserveParallel 20.6 ns/op 4.4 ns/op
RegisterRoute 124 µs/op (growing with N) 162 µs/op (constant)
Flush 96.4 µs/op 82.3 µs/op
Snapshot 1.86 ms/op 1.92 ms/op

go test -race -count=1 ./keyviz/... clean. golangci-lint run ./keyviz/... clean.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 2m 12s —— View job


Round-2 Review (19abb41)

  • Gather context
  • Read keyviz/sampler_test.go
  • Verify each of the 4 fixes
  • Check for new issues
  • Post final review

All four fixes are correctly implemented. Two items worth addressing before merge.


P1 — //nolint:gosec conflicts with project convention

runConcurrentBurst (line 1315) suppresses gosec G115 with a comment:

for r := 1; r <= numRoutes; r++ {
    routeID := uint64(r) //nolint:gosec // r bounded by numRoutes

CLAUDE.md is explicit: "Avoid //nolint — refactor instead." The fix is a one-liner — iterate with uint64 from the start and the cast disappears entirely:

for r := uint64(1); r <= uint64(numRoutes); r++ {
    for w := 0; w < writersPerRoute; w++ {
        go func(routeID uint64) {
            // ...
        }(r)
    }
}

Fix this →


P2 — BenchmarkObserveParallel loses disjoint-shard guarantee at GOMAXPROCS > 16

With numRoutes = 64 and routesPerShard = 4, there are exactly 64 / 4 = 16 disjoint shards. The shard index wraps via (shardIndex * routesPerShard) % numRoutes, so goroutine 16 gets shardBase = 0 (same as goroutine 0), goroutine 17 gets shardBase = 4 (same as goroutine 1), etc. On a 32- or 64-core CI runner the benchmark silently starts measuring shared-counter contention again — the exact regression it was designed to detect.

Bumping numRoutes to 256 (same routesPerShard = 4) gives 64 disjoint shards and keeps the benchmark meaningful up to GOMAXPROCS = 64, without changing its character or O(N) cost.

Fix this →


Verification of each fix

Fix Status Notes
Codex P1 / route presence ✅ Correct rowsByRoute map + per-route ok check covers the exact regression Codex identified
Codex P2 / parallel shard ✅ Correct (GOMAXPROCS ≤ 16) atomic.Add assigns disjoint shardBase per worker; see P2 above for the high-core edge case
Gemini / RegisterRoute O(N²) ✅ Correct Pre-load 1024, MaxTrackedRoutes = 1025, churn one ID per iteration. RegisterRoute checks len(slots) < MaxTrackedRoutes so first-iteration add (table at 1024) succeeds cleanly
Gemini / Flush timer isolation ✅ Correct b.StopTimer / b.StartTimer bracket the reseed loop; clock advance is also inside the stopped window so each Flush sees a fresh time window

· Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19abb415cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler_test.go
Comment on lines +1125 to +1126
shardIndex := nextShard.Add(1) - 1
shardBase := (shardIndex * routesPerShard) % numRoutes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Size benchmark shards to parallel worker count

BenchmarkObserveParallel still reintroduces cross-goroutine counter contention when RunParallel launches more than 16 workers (its default is GOMAXPROCS): with numRoutes=64 and routesPerShard=4, shardBase := (shardIndex * routesPerShard) % numRoutes wraps and assigns multiple workers to the same 4-route shard. In that environment, the benchmark no longer isolates the shared table.Load cost it claims to pin, so ns/op becomes host-core-count dependent and can mask/blur hot-path regressions.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
keyviz/sampler_test.go (2)

1145-1168: retiredSlots grows unboundedly across b.N iterations.

RemoveRoute enqueues the slot onto s.retiredSlots and the bench never calls Flush to drain it, so for typical b.N (millions) the queue accumulates millions of entries. That doesn't change the per-iteration COW cost you're trying to pin, but it will inflate the allocs/op and B/op figures with the slice-growth amortization, and any future change that traverses retiredSlots during RegisterRoute would silently turn this benchmark superlinear.

Consider periodically flushing (e.g. every iteration with b.StopTimer/StartTimer around it, similar to the pattern in BenchmarkFlush) so the retired queue stays bounded.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keyviz/sampler_test.go` around lines 1145 - 1168, The benchmark
BenchmarkRegisterRoute lets s.retiredSlots grow unbounded because RemoveRoute
enqueues slots but the test never calls Flush; modify the benchmark to
periodically drain retiredSlots (e.g., call s.Flush) while excluding flush time
from measurement by surrounding it with b.StopTimer() / b.StartTimer(), or call
Flush every iteration with the same stop/start pattern (similar to
BenchmarkFlush) so retiredSlots stays bounded and allocs/op/B/op reflect the
intended RegisterRoute costs; locate the benchmark function
BenchmarkRegisterRoute and add the timed/un-timed Flush calls around the chosen
flush cadence.

1110-1133: Shard mapping collides once GOMAXPROCS > 16, defeating the "distinct slot per goroutine" claim.

(shardIndex * routesPerShard) % numRoutes with numRoutes=64, routesPerShard=4 produces only numRoutes/routesPerShard = 16 distinct shardBase values (0, 4, 8, …, 60). On hosts where b.RunParallel spawns more than 16 workers (e.g. 24/32-core CI runners, EPYC dev boxes), the 17th+ goroutine wraps to shardBase=0 and contends on the same atomic.AddUint64 counters as worker 0 — exactly the false-sharing scenario the comment above promises is avoided. The bench would silently report contention numbers as the steady-state hot-path cost on those machines.

Either bump numRoutes (and the sampler's MaxTrackedRoutes) so the shard space exceeds any plausible GOMAXPROCS, or size it from runtime.GOMAXPROCS(0) at setup time.

♻️ One way to make the disjointness explicit
-func BenchmarkObserveParallel(b *testing.B) {
-	const (
-		numRoutes      = 64
-		routesPerShard = 4
-	)
-	s := NewMemSampler(MemSamplerOptions{Step: time.Second, HistoryColumns: 4, MaxTrackedRoutes: numRoutes})
+func BenchmarkObserveParallel(b *testing.B) {
+	const routesPerShard = 4
+	// Ensure numRoutes/routesPerShard >= any plausible GOMAXPROCS so each
+	// pb worker maps to a disjoint shard even on large CI runners.
+	numRoutes := routesPerShard * runtime.GOMAXPROCS(0) * 4
+	s := NewMemSampler(MemSamplerOptions{Step: time.Second, HistoryColumns: 4, MaxTrackedRoutes: numRoutes})
 	for r := uint64(1); r <= numRoutes; r++ {
 		if !s.RegisterRoute(r, []byte{byte(r)}, []byte{byte(r) + 1}) {
 			b.Fatalf("RegisterRoute(%d) returned false", r)
 		}
 	}
 	var nextShard atomic.Uint64
 	b.ReportAllocs()
 	b.ResetTimer()
 	b.RunParallel(func(pb *testing.PB) {
 		shardIndex := nextShard.Add(1) - 1
-		shardBase := (shardIndex * routesPerShard) % numRoutes
+		shardBase := (shardIndex * routesPerShard) % uint64(numRoutes)
 		var i uint64
 		for pb.Next() {
 			s.Observe(shardBase+(i%routesPerShard)+1, OpWrite, 16, 64)
 			i++
 		}
 	})
 }

(Add "runtime" to the import block if you take this path.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keyviz/sampler_test.go` around lines 1110 - 1133, BenchmarkObserveParallel's
shardBase calculation (shardIndex * routesPerShard) % numRoutes only yields 16
distinct shards with numRoutes=64 and routesPerShard=4, so when b.RunParallel
spawns >16 workers goroutines wrap and contend; fix by ensuring the shard space
exceeds possible GOMAXPROCS: compute numRoutes based on runtime.GOMAXPROCS(0)
(e.g., numRoutes := max(64, routesPerShard * runtime.GOMAXPROCS(0))) or bump
numRoutes and the sampler MaxTrackedRoutes accordingly, and pass that value into
NewMemSampler(MemSamplerOptions{... MaxTrackedRoutes: numRoutes}) so each
parallel worker gets a distinct shardBase in BenchmarkObserveParallel.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@keyviz/sampler_test.go`:
- Around line 1145-1168: The benchmark BenchmarkRegisterRoute lets
s.retiredSlots grow unbounded because RemoveRoute enqueues slots but the test
never calls Flush; modify the benchmark to periodically drain retiredSlots
(e.g., call s.Flush) while excluding flush time from measurement by surrounding
it with b.StopTimer() / b.StartTimer(), or call Flush every iteration with the
same stop/start pattern (similar to BenchmarkFlush) so retiredSlots stays
bounded and allocs/op/B/op reflect the intended RegisterRoute costs; locate the
benchmark function BenchmarkRegisterRoute and add the timed/un-timed Flush calls
around the chosen flush cadence.
- Around line 1110-1133: BenchmarkObserveParallel's shardBase calculation
(shardIndex * routesPerShard) % numRoutes only yields 16 distinct shards with
numRoutes=64 and routesPerShard=4, so when b.RunParallel spawns >16 workers
goroutines wrap and contend; fix by ensuring the shard space exceeds possible
GOMAXPROCS: compute numRoutes based on runtime.GOMAXPROCS(0) (e.g., numRoutes :=
max(64, routesPerShard * runtime.GOMAXPROCS(0))) or bump numRoutes and the
sampler MaxTrackedRoutes accordingly, and pass that value into
NewMemSampler(MemSamplerOptions{... MaxTrackedRoutes: numRoutes}) so each
parallel worker gets a distinct shardBase in BenchmarkObserveParallel.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 97c16ac8-cab2-4b8e-a52e-92d6c2fb3bf4

📥 Commits

Reviewing files that changed from the base of the PR and between df5ea5e and 19abb41.

📒 Files selected for processing (1)
  • keyviz/sampler_test.go

Two items from Claude bot's round-2 on 19abb41:

- P1: drop the //nolint:gosec in runConcurrentBurst. Per CLAUDE.md
  ("avoid //nolint - refactor instead"), change the helper signature
  to take numRoutes as uint64 and iterate in that type from the
  start, eliminating the int->uint64 conversion that triggered
  gosec G115.
- P2: bump BenchmarkObserveParallel's numRoutes from 64 to 256 so
  numRoutes / routesPerShard >= 64 disjoint shards. With 16 shards,
  RunParallel on a host with GOMAXPROCS > 16 wrapped multiple
  workers onto the same 4-route shard, silently reintroducing the
  shared-counter contention this benchmark exists to catch.
  256 keeps the benchmark meaningful up to GOMAXPROCS = 64 without
  changing its character.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 27, 2026

@claude review

Round-2 follow-ups (3310a1a):

  • P1 / nolint:gosecrunConcurrentBurst now takes numRoutes uint64. Iterating in that type from the start eliminates the int → uint64 conversion that triggered gosec G115; the suppression is gone.
  • P2 / shard countBenchmarkObserveParallel numRoutes bumped 64 → 256. With routesPerShard = 4 that's 64 disjoint shards, so RunParallel keeps the distinct-slot guarantee up to GOMAXPROCS = 64. New M1 Max baseline at 200 ms benchtime: 3.5 ns/op (was 4.4 ns/op; the shard increase actually frees up a bit more parallelism).

go test -race -count=1 ./keyviz/... clean. golangci-lint run ./keyviz/... clean.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @bootjp's task in 3m 3s —— View job


Round-3 Review (3310a1a)

  • Read keyviz/sampler_test.go
  • Verify P1 fix (no //nolint:gosec, uint64 loop)
  • Verify P2 fix (numRoutes = 256, 64 disjoint shards)
  • Check remaining open items from CodeRabbit round-2
  • Post final review

Both round-2 items are correctly implemented. The PR is clean.

P1 — //nolint:gosec removed ✅

runConcurrentBurst (line 1323) now takes numRoutes uint64 and the loop iterates in that type from the start:

func runConcurrentBurst(s *MemSampler, numRoutes uint64, ...) {
    for r := uint64(1); r <= numRoutes; r++ {
        routeID := r   // clean copy, no int→uint64 cast

No //nolint annotation appears anywhere in the PR-added code. The only //nolint:gosec in the file (line 405) is in a pre-existing test not touched by this PR. ✅

P2 — 64 disjoint shards ✅

numRoutes = 256 / routesPerShard = 4 → 64 disjoint shards (lines 1114-1115). The shardBase := (shardIndex * routesPerShard) % numRoutes formula can assign at most 64 unique shards before wrapping, so the distinct-slot guarantee holds up to GOMAXPROCS=64. The comment block (lines 1104-1111) documents the rationale. ✅

CodeRabbit nitpick — retiredSlots in BenchmarkRegisterRoute (not a blocker)

CodeRabbit noted retiredSlots could grow unboundedly in BenchmarkRegisterRoute since RemoveRoute enqueues the slot and Flush is never called. In practice this is not an issue: TestReRegisterIndividualReusesRetiredSlot locks in that RegisterRoute reclaims a retired slot when called within the grace window, and the benchmark calls RegisterRoute microseconds after RemoveRoute — well within any grace period. The retiredSlots queue stays at 0–1 entries throughout. Addressing this explicitly (via a periodic b.StopTimer/Flush) would make the benchmark's allocation numbers more clearly document the COW cost, but it's not a correctness problem.

Verification summary

Item Status
P1 — no //nolint:gosec in added code
P2 — numRoutes=256, 64 disjoint shards
Codex P1 — route presence check via rowsByRoute map ✅ (unchanged from round-2)
Gemini — RegisterRoute O(N²) fixed ✅ (unchanged from round-2)
Gemini — Flush timer isolation ✅ (unchanged from round-2)

Ready to merge from a test-quality standpoint.

--- · Branch

@bootjp bootjp merged commit 3f35b68 into main Apr 27, 2026
9 checks passed
@bootjp bootjp deleted the feat/keyviz-bench-accuracy branch April 27, 2026 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant