feat(keyviz): in-memory Sampler + ring buffer + RunFlusher#639
Conversation
First slice of the Phase 2 Key Visualizer per docs/admin_ui_key_visualizer_design.md §5. This commit only adds the keyviz/ package — wiring into kv.ShardedCoordinator, adapter.AdminServer.GetKeyVizMatrix, and main.go follow in subsequent commits to keep diffs reviewable. Highlights: - Sampler interface (nil-safe contract: nil receiver makes Observe / RegisterRoute / Flush / Snapshot all no-ops, so callers wiring an unconfigured sampler compile to zero overhead). - MemSampler hot path: atomic.Pointer[routeTable].Load + map lookup + atomic.AddUint64 ×2 (count + bytes). Matches design §10's per-call budget — no allocation, no mutex. - Flush drains via atomic.SwapUint64; no pointer retirement, so no late-writer race past the snapshot. Idle slots skipped from the emitted MatrixColumn. - COW route table publish (RegisterRoute / RemoveRoute) under routesMu off the hot path; one atomic.Pointer.Store on publish. - Route-budget cap (--keyvizMaxTrackedRoutes default 10000): excess routes fold into a virtual aggregate slot (MatrixRow.Aggregate=true + MemberRoutes list), per design §5.3. - ringBuffer with chronological [from, to) Range queries; oldest entries drop on overflow. - RunFlusher convenience for the step-interval ticker so call sites don't repeat the boilerplate. Tests under -race cover: nil-safe contract, basic Observe→Flush roundtrip, no-counts-lost across flush boundary under writer churn, route budget coarsening, snapshot range filtering, ring eviction, RemoveRoute, unknown-route Observe, concurrent register+observe.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 47 minutes and 41 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds an in-memory heatmap sampling system: a concurrency-safe Changes
Sequence DiagramssequenceDiagram
actor Client
participant RunFlusher
participant Ticker
participant MemSampler
participant ringBuffer
Client->>RunFlusher: RunFlusher(ctx, sampler, step)
RunFlusher->>Ticker: start ticker(step)
loop while not ctx.Done
Ticker-->>RunFlusher: tick
RunFlusher->>MemSampler: Flush()
MemSampler->>MemSampler: swap and collect rows
MemSampler->>ringBuffer: Push(MatrixColumn)
end
Client->>RunFlusher: cancel ctx
RunFlusher->>Ticker: stop
RunFlusher-->>Client: return
sequenceDiagram
actor App
participant MemSampler
participant RouteTable
participant AtomicCounters
participant ringBuffer
App->>MemSampler: RegisterRoute(id, start,end)
MemSampler->>RouteTable: copy-on-write update
loop Concurrent Observes
App->>MemSampler: Observe(routeID, op, keyLen, valueLen)
MemSampler->>RouteTable: lockless lookup
MemSampler->>AtomicCounters: atomic increments
end
App->>MemSampler: RemoveRoute(routeID)
MemSampler->>MemSampler: mark retired slot (grace window)
App->>MemSampler: Snapshot(from,to)
MemSampler->>ringBuffer: Range(from,to) -> deep-copied MatrixColumns
ringBuffer-->>MemSampler: returned snapshot
MemSampler-->>App: MatrixColumn[]
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 252487ac42
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| bucket.MemberRoutes = append(bucket.MemberRoutes, routeID) | ||
| // Extend bucket end if the new route reaches further right. | ||
| if len(end) == 0 || (len(bucket.End) != 0 && bytesGT(end, bucket.End)) { | ||
| bucket.End = cloneBytes(end) |
There was a problem hiding this comment.
Avoid mutating shared virtual bucket before publishing COW table
In RegisterRoute, this branch mutates bucket in place (MemberRoutes, Start, End) even though copyRouteTable only shallow-copies slot pointers, so the same routeSlot is still reachable from the currently published table. When RunFlusher/Flush runs concurrently and MaxTrackedRoutes is exceeded, readers can observe partially updated metadata and race on non-atomic fields, violating the immutable-snapshot contract for Observe/Flush.
Useful? React with 👍 / 👎.
| delete(next.slots, routeID) | ||
| delete(next.virtualForRoute, routeID) | ||
| next.sortedSlots = rebuildSorted(next) |
There was a problem hiding this comment.
Preserve pending counters when removing a route
RemoveRoute deletes the slot from the active table immediately, but Flush only drains counters present in the currently loaded table. If a route is removed before the next flush tick, any counts accumulated since the previous flush become unreachable and are dropped from history, which undercounts traffic during route churn.
Useful? React with 👍 / 👎.
| } | ||
| next := copyRouteTable(cur) | ||
| delete(next.slots, routeID) | ||
| delete(next.virtualForRoute, routeID) |
There was a problem hiding this comment.
Update aggregate member metadata when deleting virtual routes
When removing a coarsened route, only virtualForRoute[routeID] is deleted; the shared aggregate slot's MemberRoutes list is never pruned. As a result, later snapshots can still report deleted route IDs inside aggregate rows, producing stale drill-down metadata and incorrect route membership for virtual buckets.
Useful? React with 👍 / 👎.
Three Codex findings on commit 252487a: P1 — RegisterRoute mutated a shared virtual bucket (MemberRoutes, Start, End) before publishing the COW table, but copyRouteTable only shallow-copies slot pointers, so the same routeSlot is reachable from the still-published table. A concurrent Flush reading MemberRoutes would observe a half-extended slice. Add metaMu (sync.RWMutex) to routeSlot; RegisterRoute holds it Lock() while mutating; Flush emits rows via slot.snapshotMeta() under RLock(). Observe never reads metadata so the hot path stays lockless. P2 — RemoveRoute deleted the slot from the active table immediately, so any counts accumulated since the last Flush became unreachable (the slot is GC'd before the next Flush sees it). Queue removed individual slots into MemSampler.retiredSlots; Flush drains the retired queue alongside the live table and clears it. Counts are preserved across route churn. P2 — RemoveRoute on a virtual-bucket member only deleted the virtualForRoute mapping; the shared bucket's MemberRoutes list still listed the removed RouteID, leaking stale drill-down metadata. Now prune the bucket's MemberRoutes list in place under metaMu while holding routesMu (no concurrent registers). Tests added (all under -race): - TestRemoveRouteHarvestsPendingCounts: 7 reads + Remove + Flush emits a row with Reads=7. - TestRemoveVirtualMemberPrunesMemberRoutes: removed virtual member is not advertised in subsequent Snapshot rows. - TestRegisterDoesNotRaceFlushOnVirtualBucket: -race detector catches the P1 regression if metaMu is dropped.
|
@claude[agent] review |
LGTM - Excellent implementation! 🚀 This is high-quality systems code that correctly implements Phase 2 of the Key Visualizer design. All tests pass with Strengths:
Minor future enhancements (non-blocking):
The implementation matches the design specification exactly. Ready to wire into the coordinator. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e492ad5f4c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| out := make([]MatrixColumn, hi-lo) | ||
| copy(out, all[lo:hi]) |
There was a problem hiding this comment.
Deep-copy row slices before returning range snapshots
Range only copies MatrixColumn structs, so each returned column still shares its Rows backing array with the ring buffer. Any caller-side mutation (for example, sorting/rewriting rows while shaping an API response) will mutate stored history and can race with concurrent Flush reads/writes, which corrupts later snapshots despite the comment promising callers can mutate the result freely.
Useful? React with 👍 / 👎.
| start = s.Start | ||
| end = s.End |
There was a problem hiding this comment.
Clone start/end bounds when snapshotting slot metadata
snapshotMeta returns s.Start and s.End by reference, so MatrixRow entries in flushed columns alias live slot metadata. If a consumer mutates returned row bounds, it mutates sampler state for future coarsening/sorting decisions; this is especially problematic because these byte slices are exported in the snapshot API.
Useful? React with 👍 / 👎.
Codex round-2 P2: Snapshot consumers were aliasing live sampler state. snapshotMeta returned Start/End by reference (so flushed rows tracked later RegisterRoute mutations) and ringBuffer.Range only copied MatrixColumn structs (so concurrent Flush could rotate the underlying Rows slice, and caller mutation could corrupt stored history). Clone Start/End in snapshotMeta and add a cloneColumn helper that ringBuffer.Range uses to materialise a fully-owned column per call. Regression test TestSnapshotReturnsDeepCopy mutates a returned rows bounds and verifies a follow-up snapshot is unaffected.
|
@claude[agent] review |
LGTM - Critical deep-copy fix! 🎯 HEAD commit (347afd1) addresses a P2 memory safety issue where Fixes:
Impact:
Testing:
Overall PR verdict (all 3 commits): Production-ready. All memory safety issues addressed through iterative refinement with excellent regression test coverage. Ready to wire into coordinator. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 347afd14e8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| s.retiredMu.Lock() | ||
| retired := s.retiredSlots | ||
| s.retiredSlots = nil | ||
| s.retiredMu.Unlock() |
There was a problem hiding this comment.
Keep retired slots alive across late Observe writers
The retired-slot drain path clears retiredSlots before draining (retired := s.retiredSlots; s.retiredSlots = nil), so an Observe call that loaded the old table before RemoveRoute can still execute Add after that one-time drain and write into a slot that is no longer reachable by any future Flush. In that concurrency window, increments are permanently dropped, which undercounts traffic during route churn; this needs an additional grace period/epoch before retiring the slot permanently.
Useful? React with 👍 / 👎.
| if bytesLT(start, bucket.Start) { | ||
| bucket.Start = cloneBytes(start) | ||
| } |
There was a problem hiding this comment.
Reorder sorted slots after virtual bucket start shifts
When a new over-budget route folds into an existing virtual bucket, this block can lower bucket.Start, but the bucket remains at its old position in next.sortedSlots. If the new route start is earlier than the bucket’s previous start, later Flush iterations over tbl.sortedSlots are no longer in key order, breaking the sampler’s own sorted by Start ordering contract and producing misordered matrix rows.
Useful? React with 👍 / 👎.
Codex round-3 P1: RemoveRoute previously queued the retired slot for a single Flush and then dropped it. An Observe goroutine that loaded the pre-RemoveRoute table snapshot could still complete its atomic.Add into the slot after that drain, silently losing the increment. Hold each retired slot for retainedFlushes (=2) cycles instead — the late writer is harvested by the second drain. Codex round-3 P2: Folding an over-budget route into an existing virtual bucket may lower bucket.Start, but the bucket kept its old position in next.sortedSlots, so Flush emitted matrix rows out of key order. Rebuild sortedSlots whenever Start is lowered, restoring the API ordering contract. Refactor: extracted foldIntoBucket helper so RegisterRoute stays under the cyclop budget. Regression tests TestRetiredSlotGracePeriod and TestRegisterFoldLowerStartReorders cover both fixes.
|
@claude[agent] review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c6ce9560b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for _, r := range s.retiredSlots { | ||
| col.Rows = appendDrainedRow(col.Rows, r.slot) | ||
| r.remaining-- |
There was a problem hiding this comment.
Preserve sort order when appending retired rows
Flush builds col.Rows from tbl.sortedSlots (start-key order) and then appends retired-slot rows afterward, so a removed route with a lower Start is emitted out of order. During route churn this violates the sampler’s own “sorted by Start” ordering contract and can destabilize row alignment for matrix consumers that rely on deterministic key order between columns. Sort col.Rows (or merge retired rows in order) before pushing to history.
Useful? React with 👍 / 👎.
| filtered := bucket.MemberRoutes[:0] | ||
| for _, m := range bucket.MemberRoutes { | ||
| if m != routeID { | ||
| filtered = append(filtered, m) | ||
| } |
There was a problem hiding this comment.
Defer virtual-member pruning until post-flush drain
When removing a virtual-member route, this code immediately drops the route ID from MemberRoutes, but the shared bucket counters still include that route’s pre-removal increments until the next Flush. In that window, the flushed row reports traffic that came from the removed route while advertising only the remaining members (misattributed metadata during churn). Keep the removed ID in membership until the pending bucket counters for that interval are drained.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
keyviz/sampler.go (2)
220-238:opts.Stepis normalized but never read byMemSampler.
Stepis defaulted on line 220 and stored ons.opts, but nothing on the sampler consumes it — onlyRunFlusheraccepts astepparameter directly from its caller. If the intent is for callers to query the configured step (e.g., to align a ticker), exposing a small accessor would help; otherwise the field is purely informational. Optional cleanup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@keyviz/sampler.go` around lines 220 - 238, The opts.Step value is normalized and saved on MemSampler but never used; either make it accessible or remove it—prefer adding an accessor so callers (e.g., the code that starts RunFlusher) can read the configured step. Add a method on MemSampler like Step() int (or appropriate type) that returns s.opts.Step and update any caller using an external step to call sampler.Step() instead of passing a separate value; ensure MemSampler.RunFlusher continues to accept an override if desired, but document/align callers to use the new Step() accessor so the normalized default is actually consumed.
419-444: Retired-slot rows break the Start-order invariant during the grace window.
Flushemits live slots insortedSlotsorder (whichfoldIntoBucketandappendSortedgo to lengths to maintain —TestRegisterFoldLowerStartReorderspins this contract), then appends rows fromretiredSlotsin queue order. If a retired slot'sStartfalls between two live slots'Starts, the resultingMatrixColumn.Rowsis no longer monotone byStartforretainedFlushescycles. The mixed live+retired ordering case is not covered by the existing tests.Suggest sorting once after all rows are appended, or merging the retired drain into the same Start-ordered iteration:
Proposed fix
col := MatrixColumn{At: s.now()} tbl := s.table.Load() for _, slot := range tbl.sortedSlots { col.Rows = appendDrainedRow(col.Rows, slot) } s.retiredMu.Lock() keep := s.retiredSlots[:0] for _, r := range s.retiredSlots { col.Rows = appendDrainedRow(col.Rows, r.slot) r.remaining-- if r.remaining > 0 { keep = append(keep, r) } } s.retiredSlots = keep s.retiredMu.Unlock() + sort.SliceStable(col.Rows, func(i, j int) bool { + return bytesLT(col.Rows[i].Start, col.Rows[j].Start) + }) + s.historyMu.Lock()A regression test that registers two live routes with
Startstraddling a retired slot'sStart, removes the middle route pre-flush, and assertsRowsis sorted would lock this in.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@keyviz/sampler.go` around lines 419 - 444, Flush builds MatrixColumn.Rows by first appending live slots from s.table.Load().sortedSlots then appending retired slots from s.retiredSlots, which can violate the Start-order invariant; change MemSampler.Flush to produce Rows in Start order by either (a) merging retired slot draining into the same Start-ordered iteration used for sortedSlots (so retired rows are appended at the correct position during the single scan), or (b) append all rows as now but run a stable sort by Start on col.Rows before pushing to s.history (preserving tiebreakers); update references to appendDrainedRow usage accordingly and add a regression test that registers two live routes with Starts straddling a retired Start, removes the middle route before Flush, and asserts MatrixColumn.Rows is sorted by Start for retainedFlushes cycles.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@keyviz/sampler.go`:
- Around line 16-19: Update the package comment that describes Observe in
sampler.go: it currently claims Observe does four atomic.AddUint64 calls but the
implementation (function Observe) performs one unconditional atomic.AddUint64
for the count and at most one more for bytes (skipped when keyLen == 0 &&
valueLen == 0); change the wording to state "at most two atomic.AddUint64 calls"
(or equivalent) and briefly note the conditional second increment to accurately
reflect the hot-path cost.
- Around line 215-240: The docstring for NewMemSampler is incorrect: the
function never returns nil (it always constructs and returns a *MemSampler and
newRingBuffer clamps small capacities), so update the comment on NewMemSampler
to remove the claim about returning nil for opts.HistoryColumns < 0 and instead
state that zero fields fall back to defaults and that negative HistoryColumns
are treated the same as small values (newRingBuffer will clamp to minimum
capacity); reference the MemSamplerOptions.HistoryColumns behavior and the
newRingBuffer clamping in the comment for clarity.
---
Nitpick comments:
In `@keyviz/sampler.go`:
- Around line 220-238: The opts.Step value is normalized and saved on MemSampler
but never used; either make it accessible or remove it—prefer adding an accessor
so callers (e.g., the code that starts RunFlusher) can read the configured step.
Add a method on MemSampler like Step() int (or appropriate type) that returns
s.opts.Step and update any caller using an external step to call sampler.Step()
instead of passing a separate value; ensure MemSampler.RunFlusher continues to
accept an override if desired, but document/align callers to use the new Step()
accessor so the normalized default is actually consumed.
- Around line 419-444: Flush builds MatrixColumn.Rows by first appending live
slots from s.table.Load().sortedSlots then appending retired slots from
s.retiredSlots, which can violate the Start-order invariant; change
MemSampler.Flush to produce Rows in Start order by either (a) merging retired
slot draining into the same Start-ordered iteration used for sortedSlots (so
retired rows are appended at the correct position during the single scan), or
(b) append all rows as now but run a stable sort by Start on col.Rows before
pushing to s.history (preserving tiebreakers); update references to
appendDrainedRow usage accordingly and add a regression test that registers two
live routes with Starts straddling a retired Start, removes the middle route
before Flush, and asserts MatrixColumn.Rows is sorted by Start for
retainedFlushes cycles.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b777b9c7-3b03-4528-9db5-5c88800d7286
📒 Files selected for processing (5)
keyviz/flusher.gokeyviz/flusher_test.gokeyviz/ring_buffer.gokeyviz/sampler.gokeyviz/sampler_test.go
Round-4 review fixes for PR #639: Codex P2 (sampler.go:419-444) and CodeRabbit duplicate finding: Flush appended retired-slot rows after the live sortedSlots scan, so a removed route with a Start between two live slots produced a column whose Rows were not monotone by Start. Add a final sort.SliceStable on col.Rows so the API contract holds across the retainedFlushes window. TestFlushSortsMixedLiveAndRetiredRows is the regression. Codex P2 (sampler.go:404): Removing a virtual-bucket member pruned the routeID from MemberRoutes immediately, but the bucket counters still held that routes pre-removal increments — the next flushed row mixed the removed routes traffic into the bucket while advertising only the remaining members. Defer the prune across retainedFlushes via a pendingPrunes queue so the row attribution stays accurate during the grace window. TestRemoveVirtualMemberPruneDeferred + an updated TestRemoveVirtualMemberPrunesMemberRoutes pin both halves of the contract. CodeRabbit nits: package comment said Observe does four atomic.AddUint64 calls (its at most two); NewMemSampler doc claimed it returns nil for negative HistoryColumns (it never does — newRingBuffer clamps). Updated both. Added a Step() accessor so callers wiring RunFlusher can read the configured interval rather than threading it through two config paths. Refactored Flush into drainRetiredSlots + advancePendingPrunes helpers to keep cyclomatic complexity under budget.
|
@claude[agent] review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 156e9f50b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| case isVirtual: | ||
| // Defer pruning until after the bucket's pre-removal counters | ||
| // have been drained. While the prune is pending the routeID | ||
| // stays in MemberRoutes so the next few Flush rows attribute | ||
| // the bucket's mixed counters to all members that contributed |
There was a problem hiding this comment.
Drain orphaned virtual bucket before removing last member
This isVirtual branch only schedules a member prune, but it never preserves the bucket counters when the removed route is the bucket's last remaining mapping. After delete(next.virtualForRoute, routeID) and rebuildSorted(next), that bucket can become unreachable, so Flush no longer drains it and all pre-removal counts are dropped (reproducible with MaxTrackedRoutes=1: register one over-budget route, observe it, remove it, then flush yields no rows). The retired-slot path currently protects only isIndividual, so virtual buckets also need a drain/grace path when they become orphaned.
Useful? React with 👍 / 👎.
Codex round-5 P1: When RemoveRoute removes the last member of a virtual bucket, delete(virtualForRoute, routeID) leaves no path back to the bucket, so rebuildSorted drops it from sortedSlots and Flush silently loses any pre-removal counters the bucket still holds (plus any in-flight late-Observe writers hitting the prior table snapshot). Detect the now-orphaned bucket via bucketStillReferenced and queue it on retiredSlots with the same retainedFlushes grace so Flush keeps draining it across the grace window. The pendingPrune entry already keeps MemberRoutes intact for those drains, so attribution remains correct. Regression test TestRemoveLastVirtualMemberHarvestsBucket sets MaxTrackedRoutes=1, observes the over-budget route, removes it, and asserts the bucket counts survive into the next snapshot column.
|
@claude[agent] review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@claude review |
|
Claude finished @bootjp's task in 4m 11s —— View job Code ReviewVerdict: LGTM — production-ready for this phase. Six rounds of iterative review have driven this to excellent quality. All P1/P2 issues from prior rounds are addressed. Detailed notes below. What's resolved from prior reviews
All regression tests pass under Remaining minor findings (non-blocking)1. Interface-nil vs typed-nil for the The nil-safe guarantee on
Worth a one-line comment in the coordinator wiring or a note in the package doc. No change needed in this PR. 2. The comment says the function returns the bucket "closest to the right of start" but the fallback loop (second // ... returns nil when no virtual bucket exists yet — caller creates one.
// Fallback: if no bucket contains start, return the first aggregate
// in key order so over-budget routes collapse into a single global bucket.3. Package doc mentions Line 26 lists 4. if opts.HistoryColumns == 0 {
opts.HistoryColumns = DefaultHistoryColumns
}A negative value skips the default and lands in if opts.HistoryColumns <= 0 {
opts.HistoryColumns = DefaultHistoryColumns
}5. No The package comment asserts specific hot-path properties (single Code correctness spot-checks
SummaryThis is well-engineered systems code with thorough race-detector coverage. The hot-path is genuinely lock-free, the COW pattern is correctly implemented, and the grace-period mechanism handles the edge cases that tripped up earlier rounds. Items 1–5 above are all documentation/minor correctness improvements; none are blockers. Ready to proceed to the coordinator wiring PR. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6346ac8618
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // table snapshot will have completed its single atomic.Add long | ||
| // before the second flush tick (Step is at human timescales, | ||
| // microseconds suffice). | ||
| const retainedFlushes = 2 |
There was a problem hiding this comment.
Make late-writer grace independent of flush frequency
Fresh evidence in this commit: the late-writer protection now expires after a fixed retainedFlushes = 2, so retirement is tied to flush count rather than elapsed time. If Step is configured small (for example, millisecond-scale), a goroutine that loaded the old table before RemoveRoute but is delayed by scheduler/GC for more than two ticks can still write after the slot has been dropped from retiredSlots, causing silent undercount during route churn despite the intended grace logic.
Useful? React with 👍 / 👎.
The PR-triggered claude-code-review workflow runs on every PR but the action exits without posting a review comment (see run 24935980219 on PR #637). The mention-triggered claude.yml works correctly (verified on PR #639). Disable the auto flow until the root cause is identified. https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ
) ## Summary - `claude-code-review.yml` を `.disable` 付きにリネームし、PR 起動時の自動レビューワークフローを停止します。 - mention 起動の `claude.yml` は引き続き有効。 ## Background - 自動レビューは [run 24935980219 (PR #637)](https://github.com/bootjp/elastickv/actions/runs/24935980219/job/73021502681?pr=637) で実行されたものの、PR にレビューコメントが投稿されない事象が発生。 - mention 起動 (`@claude`) は [PR #639 のコメント](#639 (comment)) で正常動作を確認済み。 - 原因調査が完了するまで自動フローを停止。 ## Test plan - [ ] このPRに対して `Claude Code Review` ワークフローが**起動しない**ことを確認 - [ ] `@claude` mention が引き続き反応することを別PRで確認 - [ ] 既存の Go テスト・Lint・CodeQL ワークフローに影響がないことを確認 https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ --- _Generated by [Claude Code](https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ)_
There was a problem hiding this comment.
🧹 Nitpick comments (4)
keyviz/sampler.go (2)
734-745: Preferbytes.Comparefrom the standard library.
bytesLT/bytesGT/bytesLE/bytesGEre-implement whatbytes.Comparealready provides (and the latter is asm-optimized on most architectures). Using the stdlib helper trims four hand-rolled comparators down to one-liners and removes a small re-invented wheel that the test suite (flushedRowsSorted) is also using to assert ordering.♻️ Proposed refactor
Add
"bytes"to the import block and replace the helpers:+import "bytes"-func bytesLT(a, b []byte) bool { - for i := 0; i < len(a) && i < len(b); i++ { - if a[i] != b[i] { - return a[i] < b[i] - } - } - return len(a) < len(b) -} - -func bytesLE(a, b []byte) bool { return !bytesGT(a, b) } -func bytesGE(a, b []byte) bool { return !bytesLT(a, b) } -func bytesGT(a, b []byte) bool { return bytesLT(b, a) } +func bytesLT(a, b []byte) bool { return bytes.Compare(a, b) < 0 } +func bytesLE(a, b []byte) bool { return bytes.Compare(a, b) <= 0 } +func bytesGT(a, b []byte) bool { return bytes.Compare(a, b) > 0 } +func bytesGE(a, b []byte) bool { return bytes.Compare(a, b) >= 0 }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@keyviz/sampler.go` around lines 734 - 745, Replace the custom byte-comparison helpers with the standard library bytes.Compare: remove bytesLT, bytesGT, bytesLE, bytesGE and update callers to use bytes.Compare(a,b) < 0, > 0, <= 0, >= 0 respectively; add "bytes" to the import block so comparisons are asm-optimized and consistent with the test suite (e.g., update any usage in flushedRowsSorted or other functions to call bytes.Compare).
373-375: Doc nit: comment doesn't quite matchfindVirtualBucket's selection rule.The "closest-by-start virtual bucket" wording in
RegisterRoute(lines 373–374) suggests proximity-based matching, butfindVirtualBucketactually returns the first aggregate whose[Start, End)containsstart, falling back to the first aggregate insortedSlotsorder (i.e., lowestStart) when none does. Worth tightening so future readers don't try to "fix" the matching to be distance-based.📝 Suggested wording
- // Coarsening: route is folded into the closest-by-start virtual - // bucket (or one is created if no virtual bucket exists yet). + // Coarsening: route is folded into the first virtual bucket whose + // [Start, End) covers `start`, or — if none does — into the first + // existing virtual bucket in sortedSlots order. A new bucket is + // created only when no virtual bucket exists yet.Also applies to: 702-723
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@keyviz/sampler.go` around lines 373 - 375, The comment in RegisterRoute stating "closest-by-start virtual bucket" is inaccurate: findVirtualBucket selects the first aggregate whose [Start, End) contains start, and if none matches returns the first aggregate in sortedSlots (lowest Start), not the nearest by distance. Update the comment near RegisterRoute (and the similar comment block around the other occurrence) to describe the actual rule: prefer the first aggregate that contains start, otherwise fall back to the first aggregate in sortedSlots (lowest Start), and reference findVirtualBucket by name so readers understand the exact selection logic.keyviz/sampler_test.go (2)
307-314: UsefindAggregateRowhere for resilience to future setup changes.
cols[len(cols)-1].Rows[0]works today only because route 1 had no traffic across the test, so its idle slot is dropped from the column and the bucket happens to beRows[0]. If anyone later adds anObserve(1, …)to extend coverage (or reorders setup so route 1'sStartsorts after the bucket's), this assertion will silently start checking the wrong row. The file already hasfindAggregateRow(t, …)for exactly this — using it makes the intent ("look at the aggregate") explicit.♻️ Proposed change
- cols := s.Snapshot(time.Time{}, time.Time{}) - agg := cols[len(cols)-1].Rows[0] + agg := findAggregateRow(t, lastSnapshotColumn(t, s).Rows) if memberRoutesContain(agg.MemberRoutes, 2) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@keyviz/sampler_test.go` around lines 307 - 314, The test currently selects the aggregate row via cols[len(cols)-1].Rows[0], which is brittle; replace that direct index with the existing helper findAggregateRow(t, cols[len(cols)-1]) to reliably locate the aggregate row regardless of bucket ordering or dropped idle slots, i.e. assign agg := findAggregateRow(t, cols[len(cols)-1]) and keep the subsequent member-route assertions unchanged.
357-374: Helper name is mildly misleading.
setupVirtualBucketWithThreeMembersactually wires up one individual slot (route 1) plus a virtual bucket whoseMemberRoutesis[2, 3](i.e., two members), not a virtual bucket with three members. Easy to misread when adding new tests on top of this helper. Consider e.g.setupOneIndividualPlusVirtualBucketorsetupVirtualBucketWithTwoMembers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@keyviz/sampler_test.go` around lines 357 - 374, The helper function setupVirtualBucketWithThreeMembers is misnamed because it registers one individual route (route 1) and a virtual bucket with two member routes (routes 2 and 3); rename the helper to a clearer name such as setupOneIndividualPlusVirtualBucket or setupVirtualBucketWithTwoMembers and update all call sites accordingly (tests using setupVirtualBucketWithThreeMembers), and optionally update any comments/docstrings inside the function; locate the function by name and adjust references to MemSampler/newTestSampler/RegisterRoute usages so behavior stays identical while the name reflects its actual setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@keyviz/sampler_test.go`:
- Around line 307-314: The test currently selects the aggregate row via
cols[len(cols)-1].Rows[0], which is brittle; replace that direct index with the
existing helper findAggregateRow(t, cols[len(cols)-1]) to reliably locate the
aggregate row regardless of bucket ordering or dropped idle slots, i.e. assign
agg := findAggregateRow(t, cols[len(cols)-1]) and keep the subsequent
member-route assertions unchanged.
- Around line 357-374: The helper function setupVirtualBucketWithThreeMembers is
misnamed because it registers one individual route (route 1) and a virtual
bucket with two member routes (routes 2 and 3); rename the helper to a clearer
name such as setupOneIndividualPlusVirtualBucket or
setupVirtualBucketWithTwoMembers and update all call sites accordingly (tests
using setupVirtualBucketWithThreeMembers), and optionally update any
comments/docstrings inside the function; locate the function by name and adjust
references to MemSampler/newTestSampler/RegisterRoute usages so behavior stays
identical while the name reflects its actual setup.
In `@keyviz/sampler.go`:
- Around line 734-745: Replace the custom byte-comparison helpers with the
standard library bytes.Compare: remove bytesLT, bytesGT, bytesLE, bytesGE and
update callers to use bytes.Compare(a,b) < 0, > 0, <= 0, >= 0 respectively; add
"bytes" to the import block so comparisons are asm-optimized and consistent with
the test suite (e.g., update any usage in flushedRowsSorted or other functions
to call bytes.Compare).
- Around line 373-375: The comment in RegisterRoute stating "closest-by-start
virtual bucket" is inaccurate: findVirtualBucket selects the first aggregate
whose [Start, End) contains start, and if none matches returns the first
aggregate in sortedSlots (lowest Start), not the nearest by distance. Update the
comment near RegisterRoute (and the similar comment block around the other
occurrence) to describe the actual rule: prefer the first aggregate that
contains start, otherwise fall back to the first aggregate in sortedSlots
(lowest Start), and reference findVirtualBucket by name so readers understand
the exact selection logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e99d6985-a507-4f18-8270-9f24b62fda1a
📒 Files selected for processing (2)
keyviz/sampler.gokeyviz/sampler_test.go
Codex round-9 P2: RegisterRoute remapped a previously removed virtual routeID back into a bucket but never cleared the queued pendingPrunes entry for the same routeID. After grace, advancePendingPrunes still ran the prune, so MemberRoutes lost a route that Observe was actively counting against the bucket — silent under-attribution during route-churn / rebalancing where the same RouteID can disappear and reappear within the grace window. Cancel any pendingPrune with a matching routeID at the top of RegisterRoute (under routesMu, briefly grabbing retiredMu). Add a defensive dedup in foldIntoBucket so a re-fold with the routeID still present in MemberRoutes does not produce a duplicate entry. TestReRegisterDuringPruneGraceCancelsPrune drives Remove → Register → Advance(graceWindow + 1s) → Flush and asserts the routeID is present in MemberRoutes exactly once.
|
@claude review |
|
Claude finished @bootjp's task in 3m 41s —— View job Code Review —
|
| Area | Verdict |
|---|---|
Hot path: atomic.Pointer.Load + immutable map + ≤2 atomic.Add |
Lock-free ✓ |
COW publish under routesMu; readers see either old or new table atomically |
✓ |
foldIntoBucket mutates shared slot under metaMu.Lock; snapshotMeta reads under metaMu.RLock |
Race-free ✓ |
rebuildSorted reads slot.Start without metaMu — only called under routesMu; concurrent snapshotMeta is read/read |
Safe ✓ |
Lock ordering: routesMu → retiredMu everywhere (RegisterRoute, RemoveRoute, cancelPendingPrune); Flush acquires retiredMu independently — no deadlock cycle |
✓ |
Flush can load an old table and drain retiredSlots for the same slot — second drain sees all-zero counters, idle check skips it, no double-count |
✓ |
drainRetiredSlots in-place filter: keep := retired[:0]; range captures original len; writes never overtake reads |
✓ |
Final drain on grace-expiry flush: appendDrainedRow fires before entry is dropped |
✓ |
clearTail / clearPruneTail zero dropped backing-array entries; cancelPendingPrune also calls clearPruneTail |
GC-safe ✓ |
drainRetiredSlots runs before advancePendingPrunes under same retiredMu; orphaned bucket emits with full MemberRoutes before prune fires |
✓ |
cancelPendingPrune before idempotency check — no-op for already-tracked routes; essential for re-registration inside grace |
✓ |
bucketStillReferenced checks next.virtualForRoute (post-delete copy) |
✓ |
copyRouteTable shares *routeSlot pointers; all metadata mutations go through metaMu.Lock() |
✓ |
sort.SliceStable after live + retired drain — mixed-source Rows are monotone by Start |
✓ |
ringBuffer.snapshotOrdered when pos=0 (just wrapped): buf[0:] then buf[:0] — all elements once |
✓ |
ringBuffer.Range deep-copies via cloneColumn; snapshotMeta clones Start/End/MemberRoutes |
✓ |
| All non-positive option values fall back to defaults (Step, HistoryColumns, MaxTrackedRoutes, MaxMemberRoutesPerSlot) | ✓ |
keyLen / valueLen guards prevent negative-to-uint64 cast overflow |
✓ |
Remaining minor findings (non-blocking)
1. Brittle row index in TestRemoveVirtualMemberPrunesMemberRoutes [sampler_test.go:308]
agg := cols[len(cols)-1].Rows[0]This works today because route 1 receives no Observe calls after setup, so it's idle and absent from Rows, making the aggregate row land at index 0. If a future test extension adds an Observe(1, …) or reorders setup, this silently starts checking the wrong row. The file already has findAggregateRow(t, …) for exactly this purpose:
// suggested fix:
agg := findAggregateRow(t, lastSnapshotColumn(t, s).Rows)2. Misleading helper name setupVirtualBucketWithThreeMembers [sampler_test.go:357]
With MaxTrackedRoutes: 1, route 1 is individual; routes 2 and 3 fold into one virtual bucket with two members. The name "ThreeMembers" suggests three virtual members. Consider setupOneIndividualPlusVirtualBucket or setupVirtualBucketWithTwoMembers.
3. RegisterRoute comment "closest-by-start" doesn't match findVirtualBucket behavior [sampler.go:382–383]
// Coarsening: route is folded into the closest-by-start virtual
// bucket (or one is created if no virtual bucket exists yet).findVirtualBucket actually prefers the first aggregate whose [Start, End) contains start, then falls back to the first aggregate in sortedSlots order (lowest Start), not nearest-by-distance. The docstring on findVirtualBucket itself (lines 743–748) is accurate; this call-site comment is not.
// Coarsening: route is folded into the first virtual bucket whose
// [Start, End) covers start, or — if none does — into the first
// existing virtual bucket in Start order. A new bucket is created
// only when no virtual bucket exists yet.Fix this →%20contains%20start%2C%20falls%20back%20to%20the%20first%20aggregate%20in%20sortedSlots%20order%20(lowest%20Start).%20Branch%3A%20feat%2Fkeyviz-sampler&repo=bootjp/elastickv)
4. bytesLT/GT/LE/GE re-implement bytes.Compare [sampler.go:775–786]
The four hand-rolled comparators are correct but functionally identical to bytes.Compare, which is asm-optimized on most architectures:
func bytesLT(a, b []byte) bool { return bytes.Compare(a, b) < 0 }
func bytesLE(a, b []byte) bool { return bytes.Compare(a, b) <= 0 }
func bytesGT(a, b []byte) bool { return bytes.Compare(a, b) > 0 }
func bytesGE(a, b []byte) bool { return bytes.Compare(a, b) >= 0 }Since these helpers appear on the admin/flush path (not hot-path Observe), the practical impact is nil — purely a cleanup. Fix this →
5. Interface nil vs typed nil — coordinator wiring note (non-blocking, no change needed here)
The nil-safe guard if s == nil { return } in Observe/Flush etc. fires only for a typed nil ((*MemSampler)(nil)). If the coordinator declares var s keyviz.Sampler and never assigns it, calling s.Observe(…) panics at the interface dispatch. The coordinator wiring PR should initialise the disabled default as:
sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)Also worth noting in that PR: RunFlusher accepts *MemSampler (not Sampler), so callers holding a Sampler-typed variable need a type assertion — or can call s.(*MemSampler).Step() for ticker alignment.
Test coverage
All corner cases pinned by regression tests across 9 review rounds:
| Test | Pins |
|---|---|
TestRetiredSlotGracePeriod |
Wall-clock grace, late-writer simulation |
TestRemoveVirtualMemberPruneDeferred |
Prune deferred inside grace window |
TestRemoveLastVirtualMemberHarvestsBucket |
Orphaned bucket drainage |
TestFlushSortsMixedLiveAndRetiredRows |
Row ordering with retired slots |
TestRegisterFoldLowerStartReorders |
sortedSlots after Start lowered |
TestSnapshotReturnsDeepCopy |
Deep-copy isolation |
TestRegisterDoesNotRaceFlushOnVirtualBucket |
-race regression for metaMu path |
TestMemberRoutesCappedAtConfiguredCap |
MemberRoutes payload bound |
TestRetiredTailClearedAfterDrop |
GC retention fix |
TestNonPositiveOptionsFallBackToDefaults |
Negative options clamped |
TestStepAccessor |
Step() for coordinator wiring |
TestReRegisterDuringPruneGraceCancelsPrune |
Latest commit: cancel-on-re-register |
BenchmarkObserveHit / BenchmarkObserveMiss |
Zero-alloc hot-path claim |
Ready for the coordinator wiring PR.
Round-10 CodeRabbit nits (all non-blocking polish on top of round-9): - Replace hand-rolled bytesLT/LE/GT/GE with bytes.Compare wrappers so we share the asm-optimized stdlib path and shrink four comparators to one-liners. - RegisterRoute coarsening comment said "closest-by-start virtual bucket"; the actual rule is "first bucket containing start, else first aggregate in sortedSlots order." Rewrite to match findVirtualBucket's selection. - TestRemoveVirtualMemberPrunesMemberRoutes used cols[len(cols)-1].Rows[0]; switch to findAggregateRow so a future setup change cant silently shift the assertion onto the wrong row. - Rename setupVirtualBucketWithThreeMembers (which actually creates one individual + a 2-member bucket) to setupOneIndividualPlusVirtualBucket so callers don't get misled.
|
@claude review |
|
Claude finished @bootjp's task in 3m 42s —— View job Code Review
Verdict: LGTM — production-ready. One minor naming issue introduced in the latest commit. The latest commit ( What the latest commit resolves
One actionable item (introduced in this commit)
The Simple rename: - bytes := uint64(0)
- if keyLen > 0 {
- bytes += uint64(keyLen)
- }
- if valueLen > 0 {
- bytes += uint64(valueLen)
- }
- switch op {
- case OpRead:
- slot.reads.Add(1)
- if bytes > 0 {
- slot.readBytes.Add(bytes)
- }
- case OpWrite:
- slot.writes.Add(1)
- if bytes > 0 {
- slot.writeBytes.Add(bytes)
- }
+ byteCount := uint64(0)
+ if keyLen > 0 {
+ byteCount += uint64(keyLen)
+ }
+ if valueLen > 0 {
+ byteCount += uint64(valueLen)
+ }
+ switch op {
+ case OpRead:
+ slot.reads.Add(1)
+ if byteCount > 0 {
+ slot.readBytes.Add(byteCount)
+ }
+ case OpWrite:
+ slot.writes.Add(1)
+ if byteCount > 0 {
+ slot.writeBytes.Add(byteCount)
+ }Correctness spot-checks on current HEAD
Non-blocking notes for future PRs1. Interface nil vs typed nil [
sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)Not an interface 2. Final Flush on graceful shutdown [
3.
Ready to merge once the |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41ab078d14
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| s.routesMu.Lock() | ||
| defer s.routesMu.Unlock() | ||
|
|
||
| s.cancelPendingPrune(routeID) |
There was a problem hiding this comment.
Keep deferred prune when route rejoins as individual
RegisterRoute unconditionally calls cancelPendingPrune(routeID) before it knows whether the route will rejoin a virtual bucket or become an individual slot. When a removed virtual member is re-registered after capacity frees up (len(next.slots) < MaxTrackedRoutes), that cancellation suppresses the only cleanup path for the old bucket membership, so MemberRoutes can retain a route that no longer contributes to the bucket, causing persistent metadata misattribution and duplicate route IDs across rows during churn.
Useful? React with 👍 / 👎.
| bucket := findVirtualBucket(next.sortedSlots, start) | ||
| if bucket == nil { | ||
| bucket = &routeSlot{ | ||
| RouteID: routeID, |
There was a problem hiding this comment.
Decouple virtual bucket RouteID from real route IDs
A new aggregate bucket stores RouteID as the first folded real route ID, but that field is never reassigned as members are removed or changed. If that original route ID is later registered as an individual route, flush output can contain both an aggregate row and an individual row with the same RouteID, which makes row identity ambiguous and breaks per-route interpretation of the matrix under route churn.
Useful? React with 👍 / 👎.
Codex round-11 P2 #1: cancelPendingPrune ran unconditionally at the top of RegisterRoute. When a removed virtual member rejoined as an individual slot (capacity freed up), the cancel suppressed the only cleanup path for the old bucket's MemberRoutes — leaving the routeID listed on the bucket forever despite no longer contributing traffic. Replace with cancelPendingPruneFor(bucket, routeID), called only after we know the route is rejoining the same bucket; prunes against other buckets (or where the route lands as an individual slot) fire normally and clean up the old bucket's metadata. Codex round-11 P2 #2: a freshly created virtual bucket stamped its RouteID with the first folded real route ID. If that real route was later removed and re-registered as an individual slot, flushed columns could show two rows with the same RouteID — one aggregate, one individual. Add an atomic.Uint64 virtualIDCounter on MemSampler and hand out synthetic IDs from the high end of uint64 (MaxUint64, MaxUint64-1, …). The synthetic space cannot collide with real route IDs (assigned from the low end by the coordinator), so row identity stays unambiguous under churn. Claude bot nit: rename the local `bytes` variable in Observe to `byteCount` so it no longer shadows the new `bytes` stdlib import. Tests: TestVirtualBucketRouteIDIsSynthetic + TestRejoinAsIndividualLetsBucketPruneFire.
|
@claude review |
|
Claude finished @bootjp's task in 4m 32s —— View job Code Review —
|
| Area | Verdict |
|---|---|
Hot path: single atomic.Pointer.Load + map lookup + at most 2 atomic.Add |
Lock-free ✓ |
cancelPendingPruneFor called only from virtual/coarsening path; individual path leaves pending prunes intact |
✓ |
| New bucket gets synthetic ID from high end of uint64; real IDs assigned from low end by coordinator | No-collision ✓ |
cancelPendingPruneFor(newBucket, routeID) when brand-new bucket is created: no prune for new bucket exists → no-op; old-bucket prune still queued |
✓ |
cancelPendingPruneFor acquires retiredMu while caller holds routesMu; Flush acquires retiredMu independently — same lock ordering as before |
No deadlock ✓ |
foldIntoBucket mutates shared slot under metaMu.Lock; snapshotMeta reads under metaMu.RLock |
Race-free ✓ |
drainRetiredSlots runs before advancePendingPrunes in same retiredMu lock; orphaned bucket drained with full MemberRoutes before prune fires |
Attribution correct ✓ |
clearTail / clearPruneTail zero dropped entries; cancelPendingPruneFor also calls clearPruneTail |
GC-safe ✓ |
sort.SliceStable after live + retired drain |
Start-monotone ✓ |
ringBuffer.Range deep-copies via cloneColumn; snapshotMeta clones Start/End/MemberRoutes |
✓ |
| All non-positive option fields fall back to defaults | ✓ |
BenchmarkObserveHit / BenchmarkObserveMiss pin zero-alloc hot-path claim |
✓ |
Minor findings (non-blocking)
1. nextVirtualBucketID comment could be clearer [sampler.go:616]
return s.virtualIDCounter.Add(^uint64(0)) // subtract 1; counter starts at MaxUint64+1"Counter starts at MaxUint64+1" is technically correct (MaxUint64+1 ≡ 0 mod 2⁶⁴, which is atomic.Uint64's initial value), but readers who don't immediately see the uint64 wrap may be confused. A one-liner clarification:
// atomic.Uint64 starts at 0; first call returns MaxUint64, second MaxUint64-1, etc.2. TestRejoinAsIndividualLetsBucketPruneFire tests a vacuous case [sampler_test.go:725]
The test adds no traffic to bucket B before removing route 3. Because appendDrainedRow skips idle (all-zero) slots, bucket B never emits a row in any snapshot column. The assertion if r.Aggregate && memberRoutesContain(r.MemberRoutes, 3) can never find a row to assert on — it passes trivially with OR without the cancelPendingPruneFor scoping fix.
To make this test actually detect the regression, add an Observe call on route 3 BEFORE removing it, so B has non-zero counters and produces a row during the grace drain cycle:
// After RegisterRoute(3) folds into B:
s.Observe(3, OpRead, 0, 0) // now B has pending counts
// Then remove, advance, flush as-is...Without this, the test still serves as a crash/race regression guard, but doesn't verify the prune-fires-for-individual-rejoin logic.
3. Interface nil vs typed nil — coordinator wiring note [sampler.go:317]
Pre-existing note: if s == nil in Observe/Flush only fires for a typed nil ((*MemSampler)(nil)), not an uninitialized Sampler interface. The coordinator wiring PR should initialise the disabled default as:
sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)No change needed here.
4. Final flush on graceful shutdown [flusher.go:9]
Pre-existing note: RunFlusher doesn't execute the final tick. The coordinator / main.go wiring PR should call s.Flush() once after RunFlusher returns to harvest the in-progress step.
Test coverage (complete list at HEAD)
| Test | Pins |
|---|---|
TestRetiredSlotGracePeriod |
Wall-clock grace, late-writer simulation |
TestRemoveVirtualMemberPruneDeferred |
Prune deferred inside grace window |
TestRemoveLastVirtualMemberHarvestsBucket |
Orphaned bucket drainage |
TestFlushSortsMixedLiveAndRetiredRows |
Row ordering with retired slots |
TestRegisterFoldLowerStartReorders |
sortedSlots after Start lowered |
TestSnapshotReturnsDeepCopy |
Deep-copy isolation |
TestRegisterDoesNotRaceFlushOnVirtualBucket |
-race regression for metaMu path |
TestMemberRoutesCappedAtConfiguredCap |
MemberRoutes payload bound |
TestRetiredTailClearedAfterDrop |
GC retention fix |
TestNonPositiveOptionsFallBackToDefaults |
Negative options clamped |
TestStepAccessor |
Step() for coordinator wiring |
TestReRegisterDuringPruneGraceCancelsPrune |
Same-bucket cancel on re-register |
TestVirtualBucketRouteIDIsSynthetic |
Synthetic ID, no real-ID collision |
TestRejoinAsIndividualLetsBucketPruneFire |
Individual rejoin doesn't suppress old-bucket prune |
BenchmarkObserveHit / BenchmarkObserveMiss |
Zero-alloc hot-path claim |
Ready to merge. The two items worth fixing before the coordinator wiring PR are finding #2 (test robustness) and the note on graceWindow-aligned typed-nil initialization (#3).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25061d6876
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| RouteID: routeID, | ||
| Start: cloneBytes(start), | ||
| End: cloneBytes(end), | ||
| } | ||
| next.slots[routeID] = slot |
There was a problem hiding this comment.
Reuse retired slot when re-registering same route ID
When a route is removed, RemoveRoute keeps its old slot in retiredSlots for grace-draining, but this RegisterRoute branch always allocates a brand-new slot for the same routeID. If that ID is re-registered before grace expires (e.g., Observe(1) -> RemoveRoute(1) -> RegisterRoute(1) -> Observe(1) -> Flush()), Flush drains both the new live slot and the retired old slot, producing two rows with the same RouteID in one column and splitting counts across them. This breaks row identity for consumers that expect one row per route and makes per-route totals ambiguous during churn/resync flows.
Useful? React with 👍 / 👎.
Codex round-12 P2: Re-registering a removed RouteID inside the grace window allocated a brand-new *routeSlot. Flush then drained both the new live slot AND the old retired slot in the same column, producing two rows with the same RouteID and splitting counts between them. Add reclaimRetiredSlot — when RegisterRoute lands a real (non-aggregate) routeID and a retired slot with that ID is queued, pull it off retiredSlots and reuse it. The old counters and any in-flight late-Observe writes converge with new traffic on the same slot, and Flush emits a single row. TestReRegisterIndividualReusesRetiredSlot asserts one row per RouteID per column with correctly accumulated counts. Claude bot nits: - nextVirtualBucketID comment now spells out the atomic.Uint64 wrap-around explicitly instead of the cryptic "MaxUint64+1" form. - TestRejoinAsIndividualLetsBucketPruneFire was vacuously true — the bucket had no traffic so it never emitted a row, making the prune assertion unreachable. Restructure with two virtual members (so removing one leaves the bucket alive, not orphaned), drive traffic through the surviving member, and flush twice past grace so the post-prune MemberRoutes lands in a snapshot row.
|
@claude review |
|
Claude finished @bootjp's task in 4m 25s —— View job Code Review
Verdict: LGTM — ready to merge. Full read of all five files at HEAD ( What the latest commit resolvesCodex round-12 P2: Duplicate rows on same-RouteID re-registration within grace window The problem: The fix: Correctness analysis of
|
| Invariant | Status |
|---|---|
Lock ordering preserved: routesMu → retiredMu; Flush holds retiredMu alone |
✓ No deadlock |
Slot removed from retiredSlots before Start/End update under metaMu.Lock() — no concurrent Flush iteration reads the reclaimed slot during the update window |
✓ Race-free |
Reclaimed slot not yet published in new table during metadata update — no concurrent snapshotMeta call possible |
✓ Safe |
appendSorted reads slot.Start after metaMu.Unlock() — correct updated value used for insertion position |
✓ |
!r.slot.Aggregate guard prevents matching a synthetic virtual-bucket ID against a coincidental real-route ID collision |
✓ |
copyRouteTable(cur) starts from current next.sortedSlots which excludes the reclaimed slot (removed by prior RemoveRoute) — appendSorted inserts at the correct new position |
✓ |
MemberRoutes on reclaimed slot is nil (individual slots never have MemberRoutes) — no stale membership after reuse |
✓ |
clearTail in reclaimRetiredSlot zeroes the dropped backing-array entry — no *routeSlot pointer lingers via reused capacity |
✓ GC-safe |
TestReRegisterIndividualReusesRetiredSlot validity:
The test calls Observe(1, OpRead, …) before RemoveRoute(1) and again after RegisterRoute(1), then flushes once. It asserts exactly one row for RouteID=1 in the column with Reads=2. Without the fix, the test would see two rows (one live with 1 read, one retired drain with 1 read) and fail the count != 1 check. The test is correctly structured and exercises the exact defect path. ✓
TestRejoinAsIndividualLetsBucketPruneFire fix:
The prior version was vacuous — the bucket had no traffic before RemoveRoute(3), so appendDrainedRow never emitted it and the assertion was unreachable. The restructured test:
- Folds routes 2 and 3 into a bucket (so route 2 keeps the bucket alive after route 3 is removed)
- Calls
Observe(2)andObserve(3)before removals, giving the bucket non-zero counters - Frees capacity by removing route 1, removes route 3 (queues prune), re-registers route 3 as individual
- Advances past grace, drives traffic through route 2 between two flushes
- On the second flush the bucket emits a row with post-prune
MemberRoutes
The assertion !memberRoutesContain(agg.MemberRoutes, 3) is now reachable and meaningful. ✓
Full correctness spot-checks (HEAD)
| Area | Verdict |
|---|---|
Hot path: atomic.Pointer.Load + map lookup + ≤2 atomic.Add |
Lock-free ✓ |
COW publish under routesMu; readers see either old or new table atomically |
✓ |
foldIntoBucket mutates shared slot under metaMu.Lock(); snapshotMeta reads under metaMu.RLock() |
Race-free ✓ |
rebuildSorted reads slot.Start without metaMu — only under routesMu; concurrent snapshotMeta reads the old published table |
Safe ✓ |
drainRetiredSlots in-place filter — standard keep := retired[:0]; range captures original len |
Correct ✓ |
reclaimRetiredSlot removes entry before clearTail — no off-by-one, no retained pointer |
✓ |
cancelPendingPruneFor called only in virtual/coarsening path; individual path (reclaimRetiredSlot) does not cancel unrelated prunes |
✓ |
| Virtual bucket IDs from high end of uint64; real IDs from low end — no collision space | ✓ |
drainRetiredSlots before advancePendingPrunes in same retiredMu lock — orphaned bucket drained with full MemberRoutes before prune fires |
✓ |
sort.SliceStable after live + retired drain — monotone Start across all row sources |
✓ |
ringBuffer.Range deep-copies via cloneColumn; snapshotMeta clones Start/End/MemberRoutes |
✓ |
All non-positive options fall back to defaults (Step, HistoryColumns, MaxTrackedRoutes, MaxMemberRoutesPerSlot) |
✓ |
appendDrainedRow keyLen/valueLen guards prevent negative-to-uint64 overflow |
✓ |
BenchmarkObserveHit / BenchmarkObserveMiss pin zero-alloc hot-path claim |
✓ |
Non-blocking notes for future PRs (no change needed here)
1. Interface nil vs typed nil [sampler.go:317]
if s == nil { return } in Observe/Flush/etc. fires only for a typed nil receiver. If the coordinator declares var s keyviz.Sampler without assigning, s.Observe(…) panics at the interface dispatch. The coordinator wiring PR should initialize the disabled default as:
sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)2. Final Flush on graceful shutdown [flusher.go]
RunFlusher doesn't execute the final tick. The main.go / coordinator wiring PR should call s.Flush() once after RunFlusher returns to harvest the in-progress step before process exit.
Test coverage at HEAD (full list)
| Test | Pins |
|---|---|
TestRetiredSlotGracePeriod |
Wall-clock grace, late-writer simulation |
TestRemoveVirtualMemberPruneDeferred |
Prune deferred inside grace window |
TestRemoveLastVirtualMemberHarvestsBucket |
Orphaned bucket drainage |
TestFlushSortsMixedLiveAndRetiredRows |
Row ordering with retired slots |
TestRegisterFoldLowerStartReorders |
sortedSlots after Start lowered |
TestSnapshotReturnsDeepCopy |
Deep-copy isolation |
TestRegisterDoesNotRaceFlushOnVirtualBucket |
-race regression for metaMu path |
TestMemberRoutesCappedAtConfiguredCap |
MemberRoutes payload bound |
TestRetiredTailClearedAfterDrop |
GC retention fix |
TestNonPositiveOptionsFallBackToDefaults |
Negative options clamped |
TestStepAccessor |
Step() for coordinator wiring |
TestReRegisterDuringPruneGraceCancelsPrune |
Same-bucket cancel on re-register |
TestVirtualBucketRouteIDIsSynthetic |
Synthetic ID, no real-ID collision |
TestRejoinAsIndividualLetsBucketPruneFire |
Individual rejoin doesn't suppress old-bucket prune (now non-vacuous) |
TestReRegisterIndividualReusesRetiredSlot |
New — single row per RouteID on re-registration within grace |
BenchmarkObserveHit / BenchmarkObserveMiss |
Zero-alloc hot-path claim |
Ready for the coordinator wiring PR.
…645) ## Summary - Adds a `WithSampler(s keyviz.Sampler) *ShardedCoordinator` option (mirroring `WithLeaseReadObserver`) plus a single `observeMutation` call inside `groupMutations`. - Each resolved `(RouteID, mutation)` pair produces one `sampler.Observe(routeID, OpWrite, len(Key), len(Value))` call before the mutation is grouped by `GroupID`. Reads do not reach this path; transactional dispatch reuses `groupMutations` so it gets wired automatically. - `DelPrefix` is intentionally not observed — `dispatchDelPrefixBroadcast` broadcasts to every shard rather than resolving a single `RouteID`, so per-route attribution is out of scope for this slice. - Nil-safety: an interface-nil `c.sampler` is guarded at the call site; the `keyviz.MemSampler` contract also tolerates a typed-nil receiver, so a disabled sampler costs one branch off the hot path. Implements task §5.1 from `docs/admin_ui_key_visualizer_design.md` (split out of the keyviz design originally landed in #639). ## Test plan - [x] `TestShardedCoordinatorObservesEveryDispatchedMutation` — cross-shard Put batch, verifies one Observe per element with engine-resolved RouteID, OpWrite, and exact keyLen / valueLen. - [x] `TestShardedCoordinatorWithoutSamplerStaysSafe` — dispatches succeed with no `WithSampler` call (interface-nil) and with a typed-nil `*MemSampler`. - [x] `go test -race -count=1 ./kv/... ./keyviz/...` clean. - [x] `golangci-lint run ./kv/...` clean. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Optional sampling for mutation dispatches: when enabled, each routed mutation records routing, operation kind, and key/value size metrics; no effect when disabled. * **Tests** * Added tests covering sampler integration, cross-shard dispatch observations, and safe behavior when no sampler or a typed-nil sampler is configured. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Phase 2-B of the Key Visualizer design (`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into the existing `web/admin/` SPA rather than building a parallel dashboard. Doc + implementation in one PR (doc commit first, per CLAUDE.md design-doc-first workflow). - New page at `/keyviz` — canvas heatmap polling `/admin/api/v1/keyviz/matrix` with series picker (writes / reads / write_bytes / read_bytes), row-budget input (clamped at 1024), and off / 5 s / 30 s auto-refresh. - Cold cells (value 0) render as the page background, not a faint blue — spotting actually-cold routes stays the dominant visual signal. - Row-detail flyout on hover: bucket_id, start, end, aggregate, route_count, route_ids (with truncation indicator). `start` / `end` decode through a printable-or-hex preview so binary keys do not render as mojibake. - Dependency-free: hand-rolled five-stop colour ramp in `lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208 kB (raw), 64 kB gzipped. Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 / #646 / #647 / #651 / #660 / #661 / #672). ## Design `docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed status, lands with this PR. Parent design §12 phase table is split into 2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out, open). ## Five-lens self-review 1. **Data loss** — n/a; SPA is read-only against an existing handler. 2. **Concurrency / distributed** — n/a; single browser tab polling a single handler. Sampler concurrency was already covered by Phase 2-A tests. 3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower bound (sampler flush is 1 s). 4. **Data consistency** — SPA renders whatever the handler returns; consistency guarantees come from the existing leader-issued counters in the sampler. 5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go build ./internal/admin/...` clean (embed glob unaffected); `go test ./internal/admin/...` clean. Manual verification documented in the design doc §5. ## Test plan - [x] `npm run lint` (`tsc -b --noEmit`) — clean - [x] `npm run build` (Vite) — clean, output goes to `internal/admin/dist` - [x] `go build ./internal/admin/...` — clean - [x] `go test ./internal/admin/...` — clean - [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot routes light up red within ~5 s of write traffic - [ ] Manual: series picker swaps the displayed counter; row-budget input clamps at 1024; auto-refresh polls without flicker ## Out of scope - **Cluster fan-out** — handler is currently node-local. Phase 2-C will add a cross-node admin RPC; this PR will pick up the aggregate view automatically once that ships. - **Drill-down per-route sparkline** — Phase 3. - **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet built; correlation lands when those pages do. - **`localStorage` for series / rows / refresh** — punt to follow-up.
Summary
First slice of Phase 2 of the Key Visualizer per docs/admin_ui_key_visualizer_design.md §5. This PR only adds the `keyviz/` package; wiring into `kv.ShardedCoordinator`, `adapter.AdminServer.GetKeyVizMatrix`, and `main.go` will follow in subsequent PRs to keep diffs reviewable.
What's in
What's out (later PRs)
Test plan
Summary by CodeRabbit