Skip to content

feat(keyviz): in-memory Sampler + ring buffer + RunFlusher#639

Merged
bootjp merged 17 commits intomainfrom
feat/keyviz-sampler
Apr 25, 2026
Merged

feat(keyviz): in-memory Sampler + ring buffer + RunFlusher#639
bootjp merged 17 commits intomainfrom
feat/keyviz-sampler

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 25, 2026

Summary

First slice of Phase 2 of the Key Visualizer per docs/admin_ui_key_visualizer_design.md §5. This PR only adds the `keyviz/` package; wiring into `kv.ShardedCoordinator`, `adapter.AdminServer.GetKeyVizMatrix`, and `main.go` will follow in subsequent PRs to keep diffs reviewable.

What's in

  • `Sampler` interface — nil-safe (nil receiver makes every method a no-op).
  • `MemSampler` hot path: `atomic.Pointer[routeTable].Load` + map lookup + 2× `atomic.AddUint64` (count + bytes). Matches design §10's per-call budget.
  • `Flush` drains via `atomic.SwapUint64` — no pointer retirement, no late-writer race past the snapshot.
  • COW route table publish (`RegisterRoute` / `RemoveRoute`) under `routesMu` off the hot path; one `atomic.Pointer.Store` on publish.
  • Route-budget cap (`MaxTrackedRoutes`, default 10000): excess routes fold into a virtual aggregate slot (`MatrixRow.Aggregate=true` + `MemberRoutes`), per design §5.3.
  • `ringBuffer` with chronological [from, to) Range queries; oldest entries drop on overflow.
  • `RunFlusher` convenience for the step-interval ticker.

What's out (later PRs)

  • Coordinator call site (`sampler.Observe(routeID, op, …)`).
  • `AdminServer.GetKeyVizMatrix` RPC implementation.
  • `main.go` flag wiring + sampler construction.
  • Adaptive sub-sampling (§5.2) — MVP samples every op.
  • Persistence (§5.6) — Phase 3.

Test plan

  • Unit tests under `-race`: nil-safe contract, basic Observe→Flush, no-counts-lost across flush boundary under writer churn, route-budget coarsening, snapshot range filtering, ring eviction, RemoveRoute, unknown-route Observe, concurrent register+observe.
  • `go vet ./...` and `golangci-lint` clean.

Summary by CodeRabbit

  • New Features
    • In-memory sampling to track reads/writes with size metrics.
    • Periodic flushing of accumulated samples at a configurable interval.
    • Route-based tracking with automatic aggregation when route capacity is exceeded.
    • Time-range snapshot retrieval with retention and eviction of old samples.
    • Graceful flusher lifecycle that starts/stops with the application without losing data.

First slice of the Phase 2 Key Visualizer per
docs/admin_ui_key_visualizer_design.md §5. This commit only adds the
keyviz/ package — wiring into kv.ShardedCoordinator,
adapter.AdminServer.GetKeyVizMatrix, and main.go follow in subsequent
commits to keep diffs reviewable.

Highlights:
- Sampler interface (nil-safe contract: nil receiver makes Observe /
  RegisterRoute / Flush / Snapshot all no-ops, so callers wiring an
  unconfigured sampler compile to zero overhead).
- MemSampler hot path: atomic.Pointer[routeTable].Load + map lookup +
  atomic.AddUint64 ×2 (count + bytes). Matches design §10's per-call
  budget — no allocation, no mutex.
- Flush drains via atomic.SwapUint64; no pointer retirement, so no
  late-writer race past the snapshot. Idle slots skipped from the
  emitted MatrixColumn.
- COW route table publish (RegisterRoute / RemoveRoute) under
  routesMu off the hot path; one atomic.Pointer.Store on publish.
- Route-budget cap (--keyvizMaxTrackedRoutes default 10000): excess
  routes fold into a virtual aggregate slot (MatrixRow.Aggregate=true
  + MemberRoutes list), per design §5.3.
- ringBuffer with chronological [from, to) Range queries; oldest
  entries drop on overflow.
- RunFlusher convenience for the step-interval ticker so call sites
  don't repeat the boilerplate.

Tests under -race cover: nil-safe contract, basic Observe→Flush
roundtrip, no-counts-lost across flush boundary under writer churn,
route budget coarsening, snapshot range filtering, ring eviction,
RemoveRoute, unknown-route Observe, concurrent register+observe.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 47 minutes and 41 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 47 minutes and 41 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 81e2d34c-8c74-4aa1-98e5-35ab84d2e6d5

📥 Commits

Reviewing files that changed from the base of the PR and between b45233d and 927b194.

📒 Files selected for processing (2)
  • keyviz/sampler.go
  • keyviz/sampler_test.go
📝 Walkthrough

Walkthrough

Adds an in-memory heatmap sampling system: a concurrency-safe MemSampler that observes and atomically aggregates operations into time-stamped columns, a fixed-capacity ringBuffer for historical snapshots with time-range queries, and a RunFlusher helper that periodically flushes samples until context cancellation.

Changes

Cohort / File(s) Summary
Sampler core
keyviz/sampler.go
New MemSampler implementation: atomic per-slot counters, lockless lookups, copy-on-write route table, virtual-bucket folding for over-budget routes, retired-slot grace draining, Flush, Snapshot, and Step APIs.
Ring buffer
keyviz/ring_buffer.go
New unexported ringBuffer storing MatrixColumn entries in chronological circular storage; supports Push, ordered snapshot, binary-search time-range bounds, and returns deep copies to avoid concurrent mutation.
Flusher
keyviz/flusher.go
New exported RunFlusher(ctx, s *MemSampler, step time.Duration) that normalizes step, starts a ticker to call MemSampler.Flush() periodically, blocks on ctx.Done(), and handles nil sampler by waiting for cancellation.
Tests & Benchmarks
keyviz/sampler_test.go, keyviz/flusher_test.go
Extensive unit tests and benchmarks: sampler concurrency, observe/flush semantics, virtual-bucket behavior, snapshot deep-copy invariants, ring buffer eviction, retired-slot harvesting, flusher lifecycle, and performance microbenchmarks.

Sequence Diagrams

sequenceDiagram
    actor Client
    participant RunFlusher
    participant Ticker
    participant MemSampler
    participant ringBuffer

    Client->>RunFlusher: RunFlusher(ctx, sampler, step)
    RunFlusher->>Ticker: start ticker(step)
    loop while not ctx.Done
        Ticker-->>RunFlusher: tick
        RunFlusher->>MemSampler: Flush()
        MemSampler->>MemSampler: swap and collect rows
        MemSampler->>ringBuffer: Push(MatrixColumn)
    end
    Client->>RunFlusher: cancel ctx
    RunFlusher->>Ticker: stop
    RunFlusher-->>Client: return
Loading
sequenceDiagram
    actor App
    participant MemSampler
    participant RouteTable
    participant AtomicCounters
    participant ringBuffer

    App->>MemSampler: RegisterRoute(id, start,end)
    MemSampler->>RouteTable: copy-on-write update
    loop Concurrent Observes
        App->>MemSampler: Observe(routeID, op, keyLen, valueLen)
        MemSampler->>RouteTable: lockless lookup
        MemSampler->>AtomicCounters: atomic increments
    end
    App->>MemSampler: RemoveRoute(routeID)
    MemSampler->>MemSampler: mark retired slot (grace window)
    App->>MemSampler: Snapshot(from,to)
    MemSampler->>ringBuffer: Range(from,to) -> deep-copied MatrixColumns
    ringBuffer-->>MemSampler: returned snapshot
    MemSampler-->>App: MatrixColumn[]
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 I nibble counters in rows of time,

Rings hold histories, tidy and prime,
Ticker hops, flushes fall in line,
Routes fold gently when limits align,
I twitch my nose — metrics look fine!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 49.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically summarizes the main changes: introducing three core components (MemSampler, ring buffer, RunFlusher) for the keyviz package.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/keyviz-sampler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 252487ac42

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go Outdated
Comment on lines +284 to +287
bucket.MemberRoutes = append(bucket.MemberRoutes, routeID)
// Extend bucket end if the new route reaches further right.
if len(end) == 0 || (len(bucket.End) != 0 && bytesGT(end, bucket.End)) {
bucket.End = cloneBytes(end)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid mutating shared virtual bucket before publishing COW table

In RegisterRoute, this branch mutates bucket in place (MemberRoutes, Start, End) even though copyRouteTable only shallow-copies slot pointers, so the same routeSlot is still reachable from the currently published table. When RunFlusher/Flush runs concurrently and MaxTrackedRoutes is exceeded, readers can observe partially updated metadata and race on non-atomic fields, violating the immutable-snapshot contract for Observe/Flush.

Useful? React with 👍 / 👎.

Comment thread keyviz/sampler.go
Comment on lines +315 to +317
delete(next.slots, routeID)
delete(next.virtualForRoute, routeID)
next.sortedSlots = rebuildSorted(next)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve pending counters when removing a route

RemoveRoute deletes the slot from the active table immediately, but Flush only drains counters present in the currently loaded table. If a route is removed before the next flush tick, any counts accumulated since the previous flush become unreachable and are dropped from history, which undercounts traffic during route churn.

Useful? React with 👍 / 👎.

Comment thread keyviz/sampler.go
}
next := copyRouteTable(cur)
delete(next.slots, routeID)
delete(next.virtualForRoute, routeID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update aggregate member metadata when deleting virtual routes

When removing a coarsened route, only virtualForRoute[routeID] is deleted; the shared aggregate slot's MemberRoutes list is never pruned. As a result, later snapshots can still report deleted route IDs inside aggregate rows, producing stale drill-down metadata and incorrect route membership for virtual buckets.

Useful? React with 👍 / 👎.

Three Codex findings on commit 252487a:

P1 — RegisterRoute mutated a shared virtual bucket (MemberRoutes,
Start, End) before publishing the COW table, but copyRouteTable only
shallow-copies slot pointers, so the same routeSlot is reachable from
the still-published table. A concurrent Flush reading MemberRoutes
would observe a half-extended slice. Add metaMu (sync.RWMutex) to
routeSlot; RegisterRoute holds it Lock() while mutating; Flush emits
rows via slot.snapshotMeta() under RLock(). Observe never reads
metadata so the hot path stays lockless.

P2 — RemoveRoute deleted the slot from the active table immediately,
so any counts accumulated since the last Flush became unreachable
(the slot is GC'd before the next Flush sees it). Queue removed
individual slots into MemSampler.retiredSlots; Flush drains the
retired queue alongside the live table and clears it. Counts are
preserved across route churn.

P2 — RemoveRoute on a virtual-bucket member only deleted the
virtualForRoute mapping; the shared bucket's MemberRoutes list still
listed the removed RouteID, leaking stale drill-down metadata. Now
prune the bucket's MemberRoutes list in place under metaMu while
holding routesMu (no concurrent registers).

Tests added (all under -race):
- TestRemoveRouteHarvestsPendingCounts: 7 reads + Remove + Flush
  emits a row with Reads=7.
- TestRemoveVirtualMemberPrunesMemberRoutes: removed virtual member
  is not advertised in subsequent Snapshot rows.
- TestRegisterDoesNotRaceFlushOnVirtualBucket: -race detector
  catches the P1 regression if metaMu is dropped.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude[agent] review

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 25, 2026

@claude[agent] review

LGTM - Excellent implementation! 🚀

This is high-quality systems code that correctly implements Phase 2 of the Key Visualizer design. All tests pass with -race detector.

Strengths:

  • Hot-path performance is optimal: single atomic Load + map lookup + atomic counters (zero allocations, no mutex)
  • Concurrency safety: lock-free COW pattern correctly implemented
  • HEAD commit (e492ad5) fixes 3 critical races with surgical precision:
    • P1: Virtual bucket metadata now protected by metaMu RWLock
    • P2: Retired slots queued for final harvest (no lost counts)
    • P2: Stale MemberRoutes properly pruned on removal
  • Flush protocol uses textbook-perfect atomic.SwapUint64 drain (no late-writer race)
  • Nil-safe contract throughout (zero overhead when disabled)
  • Test coverage is comprehensive including regression tests for the fixes
  • Documentation clearly references design sections

Minor future enhancements (non-blocking):

  • Add BenchmarkObserve to validate hot-path overhead claims
  • Consider rate-limited debug metric for dropped unknown-route observations

The implementation matches the design specification exactly. Ready to wire into the coordinator.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e492ad5f4c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/ring_buffer.go Outdated
Comment on lines +65 to +66
out := make([]MatrixColumn, hi-lo)
copy(out, all[lo:hi])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deep-copy row slices before returning range snapshots

Range only copies MatrixColumn structs, so each returned column still shares its Rows backing array with the ring buffer. Any caller-side mutation (for example, sorting/rewriting rows while shaping an API response) will mutate stored history and can race with concurrent Flush reads/writes, which corrupts later snapshots despite the comment promising callers can mutate the result freely.

Useful? React with 👍 / 👎.

Comment thread keyviz/sampler.go Outdated
Comment on lines +169 to +170
start = s.Start
end = s.End
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clone start/end bounds when snapshotting slot metadata

snapshotMeta returns s.Start and s.End by reference, so MatrixRow entries in flushed columns alias live slot metadata. If a consumer mutates returned row bounds, it mutates sampler state for future coarsening/sorting decisions; this is especially problematic because these byte slices are exported in the snapshot API.

Useful? React with 👍 / 👎.

Codex round-2 P2: Snapshot consumers were aliasing live sampler state.
snapshotMeta returned Start/End by reference (so flushed rows tracked
later RegisterRoute mutations) and ringBuffer.Range only copied
MatrixColumn structs (so concurrent Flush could rotate the underlying
Rows slice, and caller mutation could corrupt stored history).

Clone Start/End in snapshotMeta and add a cloneColumn helper that
ringBuffer.Range uses to materialise a fully-owned column per call.
Regression test TestSnapshotReturnsDeepCopy mutates a returned rows
bounds and verifies a follow-up snapshot is unaffected.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude[agent] review

@Claude
Copy link
Copy Markdown

Claude AI commented Apr 25, 2026

@claude[agent] review

LGTM - Critical deep-copy fix! 🎯

HEAD commit (347afd1) addresses a P2 memory safety issue where Snapshot was returning aliased references to internal state.

Fixes:

  • snapshotMeta now clones Start/End byte slices (prevents corruption when RegisterRoute extends virtual bucket bounds)
  • ringBuffer.Range now deep-copies via cloneColumn helper (prevents shared row slices between caller and ring buffer)
  • cloneColumn comprehensively clones all mutable fields (Rows slice, Start/End bytes, MemberRoutes)

Impact:

  • Performance: Negligible - Snapshot is an admin API called at human timescales, not on hot path
  • Correctness: Essential - Without this, external callers could corrupt stored history and concurrent Flush could race

Testing:

  • TestSnapshotReturnsDeepCopy regression test aggressively mutates returned data and verifies isolation
  • All tests pass under -race detector ✅

Overall PR verdict (all 3 commits): Production-ready. All memory safety issues addressed through iterative refinement with excellent regression test coverage. Ready to wire into coordinator.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 347afd14e8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go
Comment on lines +400 to +403
s.retiredMu.Lock()
retired := s.retiredSlots
s.retiredSlots = nil
s.retiredMu.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep retired slots alive across late Observe writers

The retired-slot drain path clears retiredSlots before draining (retired := s.retiredSlots; s.retiredSlots = nil), so an Observe call that loaded the old table before RemoveRoute can still execute Add after that one-time drain and write into a slot that is no longer reachable by any future Flush. In that concurrency window, increments are permanently dropped, which undercounts traffic during route churn; this needs an additional grace period/epoch before retiring the slot permanently.

Useful? React with 👍 / 👎.

Comment thread keyviz/sampler.go Outdated
Comment on lines +328 to +330
if bytesLT(start, bucket.Start) {
bucket.Start = cloneBytes(start)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reorder sorted slots after virtual bucket start shifts

When a new over-budget route folds into an existing virtual bucket, this block can lower bucket.Start, but the bucket remains at its old position in next.sortedSlots. If the new route start is earlier than the bucket’s previous start, later Flush iterations over tbl.sortedSlots are no longer in key order, breaking the sampler’s own sorted by Start ordering contract and producing misordered matrix rows.

Useful? React with 👍 / 👎.

Codex round-3 P1: RemoveRoute previously queued the retired slot for a
single Flush and then dropped it. An Observe goroutine that loaded the
pre-RemoveRoute table snapshot could still complete its atomic.Add into
the slot after that drain, silently losing the increment. Hold each
retired slot for retainedFlushes (=2) cycles instead — the late writer
is harvested by the second drain.

Codex round-3 P2: Folding an over-budget route into an existing virtual
bucket may lower bucket.Start, but the bucket kept its old position in
next.sortedSlots, so Flush emitted matrix rows out of key order. Rebuild
sortedSlots whenever Start is lowered, restoring the API ordering
contract.

Refactor: extracted foldIntoBucket helper so RegisterRoute stays under
the cyclop budget. Regression tests TestRetiredSlotGracePeriod and
TestRegisterFoldLowerStartReorders cover both fixes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude[agent] review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c6ce9560b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go Outdated
Comment on lines +431 to +433
for _, r := range s.retiredSlots {
col.Rows = appendDrainedRow(col.Rows, r.slot)
r.remaining--
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve sort order when appending retired rows

Flush builds col.Rows from tbl.sortedSlots (start-key order) and then appends retired-slot rows afterward, so a removed route with a lower Start is emitted out of order. During route churn this violates the sampler’s own “sorted by Start” ordering contract and can destabilize row alignment for matrix consumers that rely on deterministic key order between columns. Sort col.Rows (or merge retired rows in order) before pushing to history.

Useful? React with 👍 / 👎.

Comment thread keyviz/sampler.go Outdated
Comment on lines +400 to +404
filtered := bucket.MemberRoutes[:0]
for _, m := range bucket.MemberRoutes {
if m != routeID {
filtered = append(filtered, m)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Defer virtual-member pruning until post-flush drain

When removing a virtual-member route, this code immediately drops the route ID from MemberRoutes, but the shared bucket counters still include that route’s pre-removal increments until the next Flush. In that window, the flushed row reports traffic that came from the removed route while advertising only the remaining members (misattributed metadata during churn). Keep the removed ID in membership until the pending bucket counters for that interval are drained.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
keyviz/sampler.go (2)

220-238: opts.Step is normalized but never read by MemSampler.

Step is defaulted on line 220 and stored on s.opts, but nothing on the sampler consumes it — only RunFlusher accepts a step parameter directly from its caller. If the intent is for callers to query the configured step (e.g., to align a ticker), exposing a small accessor would help; otherwise the field is purely informational. Optional cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keyviz/sampler.go` around lines 220 - 238, The opts.Step value is normalized
and saved on MemSampler but never used; either make it accessible or remove
it—prefer adding an accessor so callers (e.g., the code that starts RunFlusher)
can read the configured step. Add a method on MemSampler like Step() int (or
appropriate type) that returns s.opts.Step and update any caller using an
external step to call sampler.Step() instead of passing a separate value; ensure
MemSampler.RunFlusher continues to accept an override if desired, but
document/align callers to use the new Step() accessor so the normalized default
is actually consumed.

419-444: Retired-slot rows break the Start-order invariant during the grace window.

Flush emits live slots in sortedSlots order (which foldIntoBucket and appendSorted go to lengths to maintain — TestRegisterFoldLowerStartReorders pins this contract), then appends rows from retiredSlots in queue order. If a retired slot's Start falls between two live slots' Starts, the resulting MatrixColumn.Rows is no longer monotone by Start for retainedFlushes cycles. The mixed live+retired ordering case is not covered by the existing tests.

Suggest sorting once after all rows are appended, or merging the retired drain into the same Start-ordered iteration:

Proposed fix
 	col := MatrixColumn{At: s.now()}
 	tbl := s.table.Load()
 	for _, slot := range tbl.sortedSlots {
 		col.Rows = appendDrainedRow(col.Rows, slot)
 	}

 	s.retiredMu.Lock()
 	keep := s.retiredSlots[:0]
 	for _, r := range s.retiredSlots {
 		col.Rows = appendDrainedRow(col.Rows, r.slot)
 		r.remaining--
 		if r.remaining > 0 {
 			keep = append(keep, r)
 		}
 	}
 	s.retiredSlots = keep
 	s.retiredMu.Unlock()

+	sort.SliceStable(col.Rows, func(i, j int) bool {
+		return bytesLT(col.Rows[i].Start, col.Rows[j].Start)
+	})
+
 	s.historyMu.Lock()

A regression test that registers two live routes with Start straddling a retired slot's Start, removes the middle route pre-flush, and asserts Rows is sorted would lock this in.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keyviz/sampler.go` around lines 419 - 444, Flush builds MatrixColumn.Rows by
first appending live slots from s.table.Load().sortedSlots then appending
retired slots from s.retiredSlots, which can violate the Start-order invariant;
change MemSampler.Flush to produce Rows in Start order by either (a) merging
retired slot draining into the same Start-ordered iteration used for sortedSlots
(so retired rows are appended at the correct position during the single scan),
or (b) append all rows as now but run a stable sort by Start on col.Rows before
pushing to s.history (preserving tiebreakers); update references to
appendDrainedRow usage accordingly and add a regression test that registers two
live routes with Starts straddling a retired Start, removes the middle route
before Flush, and asserts MatrixColumn.Rows is sorted by Start for
retainedFlushes cycles.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@keyviz/sampler.go`:
- Around line 16-19: Update the package comment that describes Observe in
sampler.go: it currently claims Observe does four atomic.AddUint64 calls but the
implementation (function Observe) performs one unconditional atomic.AddUint64
for the count and at most one more for bytes (skipped when keyLen == 0 &&
valueLen == 0); change the wording to state "at most two atomic.AddUint64 calls"
(or equivalent) and briefly note the conditional second increment to accurately
reflect the hot-path cost.
- Around line 215-240: The docstring for NewMemSampler is incorrect: the
function never returns nil (it always constructs and returns a *MemSampler and
newRingBuffer clamps small capacities), so update the comment on NewMemSampler
to remove the claim about returning nil for opts.HistoryColumns < 0 and instead
state that zero fields fall back to defaults and that negative HistoryColumns
are treated the same as small values (newRingBuffer will clamp to minimum
capacity); reference the MemSamplerOptions.HistoryColumns behavior and the
newRingBuffer clamping in the comment for clarity.

---

Nitpick comments:
In `@keyviz/sampler.go`:
- Around line 220-238: The opts.Step value is normalized and saved on MemSampler
but never used; either make it accessible or remove it—prefer adding an accessor
so callers (e.g., the code that starts RunFlusher) can read the configured step.
Add a method on MemSampler like Step() int (or appropriate type) that returns
s.opts.Step and update any caller using an external step to call sampler.Step()
instead of passing a separate value; ensure MemSampler.RunFlusher continues to
accept an override if desired, but document/align callers to use the new Step()
accessor so the normalized default is actually consumed.
- Around line 419-444: Flush builds MatrixColumn.Rows by first appending live
slots from s.table.Load().sortedSlots then appending retired slots from
s.retiredSlots, which can violate the Start-order invariant; change
MemSampler.Flush to produce Rows in Start order by either (a) merging retired
slot draining into the same Start-ordered iteration used for sortedSlots (so
retired rows are appended at the correct position during the single scan), or
(b) append all rows as now but run a stable sort by Start on col.Rows before
pushing to s.history (preserving tiebreakers); update references to
appendDrainedRow usage accordingly and add a regression test that registers two
live routes with Starts straddling a retired Start, removes the middle route
before Flush, and asserts MatrixColumn.Rows is sorted by Start for
retainedFlushes cycles.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b777b9c7-3b03-4528-9db5-5c88800d7286

📥 Commits

Reviewing files that changed from the base of the PR and between eb1ec90 and 7c6ce95.

📒 Files selected for processing (5)
  • keyviz/flusher.go
  • keyviz/flusher_test.go
  • keyviz/ring_buffer.go
  • keyviz/sampler.go
  • keyviz/sampler_test.go

Comment thread keyviz/sampler.go Outdated
Comment thread keyviz/sampler.go
Round-4 review fixes for PR #639:

Codex P2 (sampler.go:419-444) and CodeRabbit duplicate finding: Flush
appended retired-slot rows after the live sortedSlots scan, so a
removed route with a Start between two live slots produced a column
whose Rows were not monotone by Start. Add a final sort.SliceStable on
col.Rows so the API contract holds across the retainedFlushes window.
TestFlushSortsMixedLiveAndRetiredRows is the regression.

Codex P2 (sampler.go:404): Removing a virtual-bucket member pruned the
routeID from MemberRoutes immediately, but the bucket counters still
held that routes pre-removal increments — the next flushed row mixed
the removed routes traffic into the bucket while advertising only the
remaining members. Defer the prune across retainedFlushes via a
pendingPrunes queue so the row attribution stays accurate during the
grace window. TestRemoveVirtualMemberPruneDeferred + an updated
TestRemoveVirtualMemberPrunesMemberRoutes pin both halves of the
contract.

CodeRabbit nits: package comment said Observe does four atomic.AddUint64
calls (its at most two); NewMemSampler doc claimed it returns nil for
negative HistoryColumns (it never does — newRingBuffer clamps).
Updated both. Added a Step() accessor so callers wiring RunFlusher can
read the configured interval rather than threading it through two
config paths.

Refactored Flush into drainRetiredSlots + advancePendingPrunes
helpers to keep cyclomatic complexity under budget.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude[agent] review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 156e9f50b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go
Comment on lines +417 to +421
case isVirtual:
// Defer pruning until after the bucket's pre-removal counters
// have been drained. While the prune is pending the routeID
// stays in MemberRoutes so the next few Flush rows attribute
// the bucket's mixed counters to all members that contributed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drain orphaned virtual bucket before removing last member

This isVirtual branch only schedules a member prune, but it never preserves the bucket counters when the removed route is the bucket's last remaining mapping. After delete(next.virtualForRoute, routeID) and rebuildSorted(next), that bucket can become unreachable, so Flush no longer drains it and all pre-removal counts are dropped (reproducible with MaxTrackedRoutes=1: register one over-budget route, observe it, remove it, then flush yields no rows). The retired-slot path currently protects only isIndividual, so virtual buckets also need a drain/grace path when they become orphaned.

Useful? React with 👍 / 👎.

Codex round-5 P1: When RemoveRoute removes the last member of a
virtual bucket, delete(virtualForRoute, routeID) leaves no path back
to the bucket, so rebuildSorted drops it from sortedSlots and Flush
silently loses any pre-removal counters the bucket still holds (plus
any in-flight late-Observe writers hitting the prior table snapshot).

Detect the now-orphaned bucket via bucketStillReferenced and queue it
on retiredSlots with the same retainedFlushes grace so Flush keeps
draining it across the grace window. The pendingPrune entry already
keeps MemberRoutes intact for those drains, so attribution remains
correct.

Regression test TestRemoveLastVirtualMemberHarvestsBucket sets
MaxTrackedRoutes=1, observes the over-budget route, removes it, and
asserts the bucket counts survive into the next snapshot column.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude[agent] review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 4m 11s —— View job


Code Review

Verdict: LGTM — production-ready for this phase.

Six rounds of iterative review have driven this to excellent quality. All P1/P2 issues from prior rounds are addressed. Detailed notes below.


What's resolved from prior reviews

Issue Fix
P1 virtual-bucket metadata race metaMu RWLock on routeSlot + snapshotMeta()
P1 orphaned virtual bucket on last-member remove bucketStillReferenced + orphan added to retiredSlots
P2 retired-slot late-writer race retainedFlushes grace queue + 2-cycle drain
P2 deferred member pruning pendingPrunes + advancePendingPrunes
P2 sort-order violation (live + retired rows) sort.SliceStable at end of Flush
P2 deep copy in Snapshot cloneColumn + cloneBytes in ringBuffer.Range
Doc: "four atomic.AddUint64" corrected to "at most two"
Doc: NewMemSampler returns nil claim removed
Missing Step() accessor added func (s *MemSampler) Step() time.Duration

All regression tests pass under -race (verified in test-run history).


Remaining minor findings (non-blocking)

1. Interface-nil vs typed-nil for the Sampler nil-safe contract [sampler.go:59–65]

The nil-safe guarantee on MemSampler.Observe only fires when the coordinator stores a typed nil ((*MemSampler)(nil)), not a nil Sampler interface value. If the coordinator declares var s keyviz.Sampler and never assigns it, calling s.Observe(...) panics. The coordinator PR should either:

  • Store *MemSampler directly (sacrifices the abstraction), or
  • Store (*MemSampler)(nil) as the disabled default:
    sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)

Worth a one-line comment in the coordinator wiring or a note in the package doc. No change needed in this PR.

2. findVirtualBucket doc vs actual fallback behavior [sampler.go:642]

The comment says the function returns the bucket "closest to the right of start" but the fallback loop (second for) actually returns the first aggregate in sortedSlots order. This is intentional (all excess routes collapse into one global bucket over time) but the doc is misleading. Suggest:

// ... returns nil when no virtual bucket exists yet — caller creates one.
// Fallback: if no bucket contains start, return the first aggregate
// in key order so over-budget routes collapse into a single global bucket.

Fix this →

3. Package doc mentions ApplySplit / ApplyMerge [sampler.go:26]

Line 26 lists ApplySplit, ApplyMerge as hot-path callers. These methods don't exist yet (future PRs). Since the current doc reads as a usage guide for callers, this can confuse readers scanning the package today. Safe to add (future) or move to a separate "Planned" section.

4. HistoryColumns < 0 silently becomes capacity 1 [sampler.go:244]

if opts.HistoryColumns == 0 {
    opts.HistoryColumns = DefaultHistoryColumns
}

A negative value skips the default and lands in newRingBuffer which clamps to 1. The doc now says "clamped to a safe minimum" which is accurate, but cap=1 (only ever the last column) might surprise callers who pass -1 thinking it means "unlimited". Consider catching < 0 alongside == 0:

if opts.HistoryColumns <= 0 {
    opts.HistoryColumns = DefaultHistoryColumns
}

Fix this →

5. No BenchmarkObserve

The package comment asserts specific hot-path properties (single Load, no alloc, no mutex). A benchmark would pin these claims and catch any future regression. Not blocking, but recommended before the coordinator wiring PR since that's when real load profiling will matter.


Code correctness spot-checks

  • drainRetiredSlots in-place filter: keep := retired[:0] is the standard Go filter-in-place idiom; the range header captures the original length so writes never overtake reads. ✓
  • ringBuffer.snapshotOrdered when pos=0 (just wrapped): buf[0:] then buf[:0] — all elements once, none twice. ✓
  • appendDrainedRow negative keyLen/valueLen safety: > 0 guards prevent negative cast to large uint64. ✓
  • Flush ordering invariant: live rows → retired drain → sort.SliceStable — retired rows land in the sorted output regardless of their position in the grace queue. ✓
  • Last-member retire + pending prune both reference same *routeSlot: on the final flush cycle, drainRetiredSlots runs before advancePendingPrunes, so the row is emitted with routeID still in MemberRoutes, then prune fires. Attribution is correct for all drain cycles. ✓
  • copyRouteTable shared slot pointers: virtual bucket *routeSlot pointers are shared between slots/virtualForRoute copies — mutations must (and do) go through metaMu. ✓

Summary

This is well-engineered systems code with thorough race-detector coverage. The hot-path is genuinely lock-free, the COW pattern is correctly implemented, and the grace-period mechanism handles the edge cases that tripped up earlier rounds. Items 1–5 above are all documentation/minor correctness improvements; none are blockers. Ready to proceed to the coordinator wiring PR.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6346ac8618

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go Outdated
// table snapshot will have completed its single atomic.Add long
// before the second flush tick (Step is at human timescales,
// microseconds suffice).
const retainedFlushes = 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make late-writer grace independent of flush frequency

Fresh evidence in this commit: the late-writer protection now expires after a fixed retainedFlushes = 2, so retirement is tied to flush count rather than elapsed time. If Step is configured small (for example, millisecond-scale), a goroutine that loaded the old table before RemoveRoute but is delayed by scheduler/GC for more than two ticks can still write after the slot has been dropped from retiredSlots, causing silent undercount during route churn despite the intended grace logic.

Useful? React with 👍 / 👎.

bootjp pushed a commit that referenced this pull request Apr 25, 2026
The PR-triggered claude-code-review workflow runs on every PR but the
action exits without posting a review comment (see run 24935980219 on
PR #637). The mention-triggered claude.yml works correctly (verified on
PR #639). Disable the auto flow until the root cause is identified.

https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ
bootjp added a commit that referenced this pull request Apr 25, 2026
)

## Summary
- `claude-code-review.yml` を `.disable` 付きにリネームし、PR
起動時の自動レビューワークフローを停止します。
- mention 起動の `claude.yml` は引き続き有効。

## Background
- 自動レビューは [run 24935980219 (PR
#637)](https://github.com/bootjp/elastickv/actions/runs/24935980219/job/73021502681?pr=637)
で実行されたものの、PR にレビューコメントが投稿されない事象が発生。
- mention 起動 (`@claude`) は [PR #639
のコメント](#639 (comment))
で正常動作を確認済み。
- 原因調査が完了するまで自動フローを停止。

## Test plan
- [ ] このPRに対して `Claude Code Review` ワークフローが**起動しない**ことを確認
- [ ] `@claude` mention が引き続き反応することを別PRで確認
- [ ] 既存の Go テスト・Lint・CodeQL ワークフローに影響がないことを確認

https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ

---
_Generated by [Claude
Code](https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ)_
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
keyviz/sampler.go (2)

734-745: Prefer bytes.Compare from the standard library.

bytesLT/bytesGT/bytesLE/bytesGE re-implement what bytes.Compare already provides (and the latter is asm-optimized on most architectures). Using the stdlib helper trims four hand-rolled comparators down to one-liners and removes a small re-invented wheel that the test suite (flushedRowsSorted) is also using to assert ordering.

♻️ Proposed refactor

Add "bytes" to the import block and replace the helpers:

+import "bytes"
-func bytesLT(a, b []byte) bool {
-	for i := 0; i < len(a) && i < len(b); i++ {
-		if a[i] != b[i] {
-			return a[i] < b[i]
-		}
-	}
-	return len(a) < len(b)
-}
-
-func bytesLE(a, b []byte) bool { return !bytesGT(a, b) }
-func bytesGE(a, b []byte) bool { return !bytesLT(a, b) }
-func bytesGT(a, b []byte) bool { return bytesLT(b, a) }
+func bytesLT(a, b []byte) bool { return bytes.Compare(a, b) < 0 }
+func bytesLE(a, b []byte) bool { return bytes.Compare(a, b) <= 0 }
+func bytesGT(a, b []byte) bool { return bytes.Compare(a, b) > 0 }
+func bytesGE(a, b []byte) bool { return bytes.Compare(a, b) >= 0 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keyviz/sampler.go` around lines 734 - 745, Replace the custom byte-comparison
helpers with the standard library bytes.Compare: remove bytesLT, bytesGT,
bytesLE, bytesGE and update callers to use bytes.Compare(a,b) < 0, > 0, <= 0, >=
0 respectively; add "bytes" to the import block so comparisons are asm-optimized
and consistent with the test suite (e.g., update any usage in flushedRowsSorted
or other functions to call bytes.Compare).

373-375: Doc nit: comment doesn't quite match findVirtualBucket's selection rule.

The "closest-by-start virtual bucket" wording in RegisterRoute (lines 373–374) suggests proximity-based matching, but findVirtualBucket actually returns the first aggregate whose [Start, End) contains start, falling back to the first aggregate in sortedSlots order (i.e., lowest Start) when none does. Worth tightening so future readers don't try to "fix" the matching to be distance-based.

📝 Suggested wording
-	// Coarsening: route is folded into the closest-by-start virtual
-	// bucket (or one is created if no virtual bucket exists yet).
+	// Coarsening: route is folded into the first virtual bucket whose
+	// [Start, End) covers `start`, or — if none does — into the first
+	// existing virtual bucket in sortedSlots order. A new bucket is
+	// created only when no virtual bucket exists yet.

Also applies to: 702-723

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keyviz/sampler.go` around lines 373 - 375, The comment in RegisterRoute
stating "closest-by-start virtual bucket" is inaccurate: findVirtualBucket
selects the first aggregate whose [Start, End) contains start, and if none
matches returns the first aggregate in sortedSlots (lowest Start), not the
nearest by distance. Update the comment near RegisterRoute (and the similar
comment block around the other occurrence) to describe the actual rule: prefer
the first aggregate that contains start, otherwise fall back to the first
aggregate in sortedSlots (lowest Start), and reference findVirtualBucket by name
so readers understand the exact selection logic.
keyviz/sampler_test.go (2)

307-314: Use findAggregateRow here for resilience to future setup changes.

cols[len(cols)-1].Rows[0] works today only because route 1 had no traffic across the test, so its idle slot is dropped from the column and the bucket happens to be Rows[0]. If anyone later adds an Observe(1, …) to extend coverage (or reorders setup so route 1's Start sorts after the bucket's), this assertion will silently start checking the wrong row. The file already has findAggregateRow(t, …) for exactly this — using it makes the intent ("look at the aggregate") explicit.

♻️ Proposed change
-	cols := s.Snapshot(time.Time{}, time.Time{})
-	agg := cols[len(cols)-1].Rows[0]
+	agg := findAggregateRow(t, lastSnapshotColumn(t, s).Rows)
 	if memberRoutesContain(agg.MemberRoutes, 2) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keyviz/sampler_test.go` around lines 307 - 314, The test currently selects
the aggregate row via cols[len(cols)-1].Rows[0], which is brittle; replace that
direct index with the existing helper findAggregateRow(t, cols[len(cols)-1]) to
reliably locate the aggregate row regardless of bucket ordering or dropped idle
slots, i.e. assign agg := findAggregateRow(t, cols[len(cols)-1]) and keep the
subsequent member-route assertions unchanged.

357-374: Helper name is mildly misleading.

setupVirtualBucketWithThreeMembers actually wires up one individual slot (route 1) plus a virtual bucket whose MemberRoutes is [2, 3] (i.e., two members), not a virtual bucket with three members. Easy to misread when adding new tests on top of this helper. Consider e.g. setupOneIndividualPlusVirtualBucket or setupVirtualBucketWithTwoMembers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keyviz/sampler_test.go` around lines 357 - 374, The helper function
setupVirtualBucketWithThreeMembers is misnamed because it registers one
individual route (route 1) and a virtual bucket with two member routes (routes 2
and 3); rename the helper to a clearer name such as
setupOneIndividualPlusVirtualBucket or setupVirtualBucketWithTwoMembers and
update all call sites accordingly (tests using
setupVirtualBucketWithThreeMembers), and optionally update any
comments/docstrings inside the function; locate the function by name and adjust
references to MemSampler/newTestSampler/RegisterRoute usages so behavior stays
identical while the name reflects its actual setup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@keyviz/sampler_test.go`:
- Around line 307-314: The test currently selects the aggregate row via
cols[len(cols)-1].Rows[0], which is brittle; replace that direct index with the
existing helper findAggregateRow(t, cols[len(cols)-1]) to reliably locate the
aggregate row regardless of bucket ordering or dropped idle slots, i.e. assign
agg := findAggregateRow(t, cols[len(cols)-1]) and keep the subsequent
member-route assertions unchanged.
- Around line 357-374: The helper function setupVirtualBucketWithThreeMembers is
misnamed because it registers one individual route (route 1) and a virtual
bucket with two member routes (routes 2 and 3); rename the helper to a clearer
name such as setupOneIndividualPlusVirtualBucket or
setupVirtualBucketWithTwoMembers and update all call sites accordingly (tests
using setupVirtualBucketWithThreeMembers), and optionally update any
comments/docstrings inside the function; locate the function by name and adjust
references to MemSampler/newTestSampler/RegisterRoute usages so behavior stays
identical while the name reflects its actual setup.

In `@keyviz/sampler.go`:
- Around line 734-745: Replace the custom byte-comparison helpers with the
standard library bytes.Compare: remove bytesLT, bytesGT, bytesLE, bytesGE and
update callers to use bytes.Compare(a,b) < 0, > 0, <= 0, >= 0 respectively; add
"bytes" to the import block so comparisons are asm-optimized and consistent with
the test suite (e.g., update any usage in flushedRowsSorted or other functions
to call bytes.Compare).
- Around line 373-375: The comment in RegisterRoute stating "closest-by-start
virtual bucket" is inaccurate: findVirtualBucket selects the first aggregate
whose [Start, End) contains start, and if none matches returns the first
aggregate in sortedSlots (lowest Start), not the nearest by distance. Update the
comment near RegisterRoute (and the similar comment block around the other
occurrence) to describe the actual rule: prefer the first aggregate that
contains start, otherwise fall back to the first aggregate in sortedSlots
(lowest Start), and reference findVirtualBucket by name so readers understand
the exact selection logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e99d6985-a507-4f18-8270-9f24b62fda1a

📥 Commits

Reviewing files that changed from the base of the PR and between 7c6ce95 and b45233d.

📒 Files selected for processing (2)
  • keyviz/sampler.go
  • keyviz/sampler_test.go

Codex round-9 P2: RegisterRoute remapped a previously removed virtual
routeID back into a bucket but never cleared the queued
pendingPrunes entry for the same routeID. After grace,
advancePendingPrunes still ran the prune, so MemberRoutes lost a
route that Observe was actively counting against the bucket — silent
under-attribution during route-churn / rebalancing where the same
RouteID can disappear and reappear within the grace window.

Cancel any pendingPrune with a matching routeID at the top of
RegisterRoute (under routesMu, briefly grabbing retiredMu). Add a
defensive dedup in foldIntoBucket so a re-fold with the routeID still
present in MemberRoutes does not produce a duplicate entry.

TestReRegisterDuringPruneGraceCancelsPrune drives Remove → Register
→ Advance(graceWindow + 1s) → Flush and asserts the routeID is
present in MemberRoutes exactly once.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 3m 41s —— View job


Code Review — feat/keyviz-sampler

  • Reading CLAUDE.md and repo context
  • Reading changed files (sampler.go, ring_buffer.go, flusher.go, tests)
  • Analyzing correctness, concurrency safety, and design
  • Posting final review

Verdict: LGTM — ready to merge.

Full read of all five files at HEAD (541bc92, "keyviz: cancel pending member-prune on re-registration"). All P1/P2 issues from prior rounds are resolved. The latest commit correctly addresses the Codex round-9 finding about stale prune entries surviving re-registration. Notes below.


Latest commit correctness (cancelPendingPrune)

cancelPendingPrune(routeID) is called at the top of RegisterRoute, before the idempotency check. For the common case (route already individually tracked) it's a harmless no-op. For the fix target — a virtual member removed then re-registered inside grace — it correctly cancels the deferred memberPrune before foldIntoBucket re-adds the routeID to MemberRoutes, preventing grace expiry from silently removing a live route.

Edge case verified: if the removed route was the last member of a bucket (bucket orphaned and added to retiredSlots in addition to pendingPrunes), cancelPendingPrune cancels only the memberPrune entry — the retiredSlot entry for the old (orphaned) bucket remains and drains across its grace window as intended. A new bucket is created on re-registration since the old one is no longer in sortedSlots. This is correct.

TestReRegisterDuringPruneGraceCancelsPrune pins both the cancellation and the no-duplicate-entry contract. ✓


Full correctness spot-checks (HEAD)

Area Verdict
Hot path: atomic.Pointer.Load + immutable map + ≤2 atomic.Add Lock-free ✓
COW publish under routesMu; readers see either old or new table atomically
foldIntoBucket mutates shared slot under metaMu.Lock; snapshotMeta reads under metaMu.RLock Race-free ✓
rebuildSorted reads slot.Start without metaMu — only called under routesMu; concurrent snapshotMeta is read/read Safe ✓
Lock ordering: routesMuretiredMu everywhere (RegisterRoute, RemoveRoute, cancelPendingPrune); Flush acquires retiredMu independently — no deadlock cycle
Flush can load an old table and drain retiredSlots for the same slot — second drain sees all-zero counters, idle check skips it, no double-count
drainRetiredSlots in-place filter: keep := retired[:0]; range captures original len; writes never overtake reads
Final drain on grace-expiry flush: appendDrainedRow fires before entry is dropped
clearTail / clearPruneTail zero dropped backing-array entries; cancelPendingPrune also calls clearPruneTail GC-safe ✓
drainRetiredSlots runs before advancePendingPrunes under same retiredMu; orphaned bucket emits with full MemberRoutes before prune fires
cancelPendingPrune before idempotency check — no-op for already-tracked routes; essential for re-registration inside grace
bucketStillReferenced checks next.virtualForRoute (post-delete copy)
copyRouteTable shares *routeSlot pointers; all metadata mutations go through metaMu.Lock()
sort.SliceStable after live + retired drain — mixed-source Rows are monotone by Start
ringBuffer.snapshotOrdered when pos=0 (just wrapped): buf[0:] then buf[:0] — all elements once
ringBuffer.Range deep-copies via cloneColumn; snapshotMeta clones Start/End/MemberRoutes
All non-positive option values fall back to defaults (Step, HistoryColumns, MaxTrackedRoutes, MaxMemberRoutesPerSlot)
keyLen / valueLen guards prevent negative-to-uint64 cast overflow

Remaining minor findings (non-blocking)

1. Brittle row index in TestRemoveVirtualMemberPrunesMemberRoutes [sampler_test.go:308]

agg := cols[len(cols)-1].Rows[0]

This works today because route 1 receives no Observe calls after setup, so it's idle and absent from Rows, making the aggregate row land at index 0. If a future test extension adds an Observe(1, …) or reorders setup, this silently starts checking the wrong row. The file already has findAggregateRow(t, …) for exactly this purpose:

// suggested fix:
agg := findAggregateRow(t, lastSnapshotColumn(t, s).Rows)

Fix this →

2. Misleading helper name setupVirtualBucketWithThreeMembers [sampler_test.go:357]

With MaxTrackedRoutes: 1, route 1 is individual; routes 2 and 3 fold into one virtual bucket with two members. The name "ThreeMembers" suggests three virtual members. Consider setupOneIndividualPlusVirtualBucket or setupVirtualBucketWithTwoMembers.

Fix this →

3. RegisterRoute comment "closest-by-start" doesn't match findVirtualBucket behavior [sampler.go:382–383]

// Coarsening: route is folded into the closest-by-start virtual
// bucket (or one is created if no virtual bucket exists yet).

findVirtualBucket actually prefers the first aggregate whose [Start, End) contains start, then falls back to the first aggregate in sortedSlots order (lowest Start), not nearest-by-distance. The docstring on findVirtualBucket itself (lines 743–748) is accurate; this call-site comment is not.

// Coarsening: route is folded into the first virtual bucket whose
// [Start, End) covers start, or — if none does — into the first
// existing virtual bucket in Start order. A new bucket is created
// only when no virtual bucket exists yet.

Fix this →%20contains%20start%2C%20falls%20back%20to%20the%20first%20aggregate%20in%20sortedSlots%20order%20(lowest%20Start).%20Branch%3A%20feat%2Fkeyviz-sampler&repo=bootjp/elastickv)

4. bytesLT/GT/LE/GE re-implement bytes.Compare [sampler.go:775–786]

The four hand-rolled comparators are correct but functionally identical to bytes.Compare, which is asm-optimized on most architectures:

func bytesLT(a, b []byte) bool { return bytes.Compare(a, b) < 0 }
func bytesLE(a, b []byte) bool { return bytes.Compare(a, b) <= 0 }
func bytesGT(a, b []byte) bool { return bytes.Compare(a, b) > 0 }
func bytesGE(a, b []byte) bool { return bytes.Compare(a, b) >= 0 }

Since these helpers appear on the admin/flush path (not hot-path Observe), the practical impact is nil — purely a cleanup. Fix this →

5. Interface nil vs typed nil — coordinator wiring note (non-blocking, no change needed here)

The nil-safe guard if s == nil { return } in Observe/Flush etc. fires only for a typed nil ((*MemSampler)(nil)). If the coordinator declares var s keyviz.Sampler and never assigns it, calling s.Observe(…) panics at the interface dispatch. The coordinator wiring PR should initialise the disabled default as:

sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)

Also worth noting in that PR: RunFlusher accepts *MemSampler (not Sampler), so callers holding a Sampler-typed variable need a type assertion — or can call s.(*MemSampler).Step() for ticker alignment.


Test coverage

All corner cases pinned by regression tests across 9 review rounds:

Test Pins
TestRetiredSlotGracePeriod Wall-clock grace, late-writer simulation
TestRemoveVirtualMemberPruneDeferred Prune deferred inside grace window
TestRemoveLastVirtualMemberHarvestsBucket Orphaned bucket drainage
TestFlushSortsMixedLiveAndRetiredRows Row ordering with retired slots
TestRegisterFoldLowerStartReorders sortedSlots after Start lowered
TestSnapshotReturnsDeepCopy Deep-copy isolation
TestRegisterDoesNotRaceFlushOnVirtualBucket -race regression for metaMu path
TestMemberRoutesCappedAtConfiguredCap MemberRoutes payload bound
TestRetiredTailClearedAfterDrop GC retention fix
TestNonPositiveOptionsFallBackToDefaults Negative options clamped
TestStepAccessor Step() for coordinator wiring
TestReRegisterDuringPruneGraceCancelsPrune Latest commit: cancel-on-re-register
BenchmarkObserveHit / BenchmarkObserveMiss Zero-alloc hot-path claim

Ready for the coordinator wiring PR.

Round-10 CodeRabbit nits (all non-blocking polish on top of round-9):

- Replace hand-rolled bytesLT/LE/GT/GE with bytes.Compare wrappers
  so we share the asm-optimized stdlib path and shrink four
  comparators to one-liners.
- RegisterRoute coarsening comment said "closest-by-start virtual
  bucket"; the actual rule is "first bucket containing start, else
  first aggregate in sortedSlots order." Rewrite to match
  findVirtualBucket's selection.
- TestRemoveVirtualMemberPrunesMemberRoutes used
  cols[len(cols)-1].Rows[0]; switch to findAggregateRow so a future
  setup change cant silently shift the assertion onto the wrong row.
- Rename setupVirtualBucketWithThreeMembers (which actually creates
  one individual + a 2-member bucket) to
  setupOneIndividualPlusVirtualBucket so callers don't get misled.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 3m 42s —— View job


Code Review

  • Read CLAUDE.md and repo context
  • Read all changed files at HEAD (41ab078)
  • Verified git log / full diff from origin/main
  • Analyzed concurrency, correctness, and design

Verdict: LGTM — production-ready. One minor naming issue introduced in the latest commit.

The latest commit (41ab078 keyviz: stdlib bytes.Compare + doc/test cleanup) polished every remaining CodeRabbit nitpick cleanly. All P1/P2 issues from prior review rounds are resolved.


What the latest commit resolves

Item Status
Custom bytesLT/LE/GT/GE → stdlib bytes.Compare wrappers
RegisterRoute coarsening comment inaccuracy ("closest-by-start")
TestRemoveVirtualMemberPrunesMemberRoutes: brittle Rows[0]findAggregateRow
setupVirtualBucketWithThreeMembers misnomer → setupOneIndividualPlusVirtualBucket

One actionable item (introduced in this commit)

bytes local variable in Observe now shadows the bytes package import [sampler.go:320]

The import "bytes" was added in this commit for the bytes.Compare wrappers, but Observe already declared a local bytes := uint64(0). The function compiles and runs correctly (the bytes package is not referenced inside Observe), but the shadowing is a reader trap — someone scanning Observe could mistake bytes as a package reference. gocritic may flag this.

Simple rename:

-	bytes := uint64(0)
-	if keyLen > 0 {
-		bytes += uint64(keyLen)
-	}
-	if valueLen > 0 {
-		bytes += uint64(valueLen)
-	}
-	switch op {
-	case OpRead:
-		slot.reads.Add(1)
-		if bytes > 0 {
-			slot.readBytes.Add(bytes)
-		}
-	case OpWrite:
-		slot.writes.Add(1)
-		if bytes > 0 {
-			slot.writeBytes.Add(bytes)
-		}
+	byteCount := uint64(0)
+	if keyLen > 0 {
+		byteCount += uint64(keyLen)
+	}
+	if valueLen > 0 {
+		byteCount += uint64(valueLen)
+	}
+	switch op {
+	case OpRead:
+		slot.reads.Add(1)
+		if byteCount > 0 {
+			slot.readBytes.Add(byteCount)
+		}
+	case OpWrite:
+		slot.writes.Add(1)
+		if byteCount > 0 {
+			slot.writeBytes.Add(byteCount)
+		}

Fix this →


Correctness spot-checks on current HEAD

Area Verdict
Hot path: single atomic.Pointer.Load + map lookup + at most 2 atomic.Add — no alloc, no mutex ✅ Lock-free
COW publish: copyRouteTable + mutation + atomic.Pointer.Store under routesMu
foldIntoBucket mutates shared slot under metaMu.Lock; snapshotMeta reads under metaMu.RLock ✅ Race-free
rebuildSorted reads slot.Start without metaMu — only called under routesMu; concurrent snapshotMeta reads the old published table ✅ Safe
drainRetiredSlots in-place filter: keep := retired[:0]; range header captures original len; writes can't overtake reads ✅ Correct Go idiom
Final drain on grace-expiry tick before entry is dropped
clearTail / clearPruneTail zero dropped backing-array entries — *routeSlot pointers don't linger ✅ GC-safe
drainRetiredSlots runs before advancePendingPrunes in same lock — orphaned bucket drained with full MemberRoutes before prune fires ✅ Attribution correct
cancelPendingPrune clears stale prune entries on re-registration, preventing deferred prune from removing a live route from MemberRoutes
Re-registration after last-member orphan: new bucket created; old orphaned bucket drains to grace then GC'd; cancelled prune is harmless
bucketStillReferenced checks post-delete next.virtualForRoute to detect orphan ✅ Correct
sort.SliceStable after live + retired drain — Start-monotone output across all row sources
ringBuffer.Range deep-copies via cloneColumn; snapshotMeta clones Start/End/MemberRoutes
All non-positive option fields (Step, HistoryColumns, MaxTrackedRoutes, MaxMemberRoutesPerSlot) fall back to defaults
appendDrainedRow guards keyLen > 0 / valueLen > 0 before cast — no negative-to-large-uint64 overflow
BenchmarkObserveHit / BenchmarkObserveMiss pin zero-alloc hot-path claim

Non-blocking notes for future PRs

1. Interface nil vs typed nil [sampler.go:309]

if s == nil fires only for a typed nil receiver. The coordinator wiring PR must initialize the disabled default as:

sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)

Not an interface nil. No change needed in this PR.

2. Final Flush on graceful shutdown [flusher.go:8]

RunFlusher correctly documents "the final tick is not executed." The coordinator / main.go wiring PR should call s.Flush() once after RunFlusher returns to harvest the in-progress step before process exit.

3. rebuildSorted stability [sampler.go:741]

rebuildSorted uses sort.Slice (unstable) while Flush uses sort.SliceStable. For the rare case of two routes with identical Start bytes, row ordering within a column could differ from the live-table order. Not a practical concern in a key-range system where Start bytes are unique.


Ready to merge once the bytesbyteCount rename is applied (or skipped as acceptable style).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41ab078d14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go Outdated
s.routesMu.Lock()
defer s.routesMu.Unlock()

s.cancelPendingPrune(routeID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep deferred prune when route rejoins as individual

RegisterRoute unconditionally calls cancelPendingPrune(routeID) before it knows whether the route will rejoin a virtual bucket or become an individual slot. When a removed virtual member is re-registered after capacity frees up (len(next.slots) < MaxTrackedRoutes), that cancellation suppresses the only cleanup path for the old bucket membership, so MemberRoutes can retain a route that no longer contributes to the bucket, causing persistent metadata misattribution and duplicate route IDs across rows during churn.

Useful? React with 👍 / 👎.

Comment thread keyviz/sampler.go Outdated
bucket := findVirtualBucket(next.sortedSlots, start)
if bucket == nil {
bucket = &routeSlot{
RouteID: routeID,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decouple virtual bucket RouteID from real route IDs

A new aggregate bucket stores RouteID as the first folded real route ID, but that field is never reassigned as members are removed or changed. If that original route ID is later registered as an individual route, flush output can contain both an aggregate row and an individual row with the same RouteID, which makes row identity ambiguous and breaks per-route interpretation of the matrix under route churn.

Useful? React with 👍 / 👎.

Codex round-11 P2 #1: cancelPendingPrune ran unconditionally at the
top of RegisterRoute. When a removed virtual member rejoined as an
individual slot (capacity freed up), the cancel suppressed the only
cleanup path for the old bucket's MemberRoutes — leaving the routeID
listed on the bucket forever despite no longer contributing traffic.
Replace with cancelPendingPruneFor(bucket, routeID), called only
after we know the route is rejoining the same bucket; prunes against
other buckets (or where the route lands as an individual slot) fire
normally and clean up the old bucket's metadata.

Codex round-11 P2 #2: a freshly created virtual bucket stamped its
RouteID with the first folded real route ID. If that real route was
later removed and re-registered as an individual slot, flushed
columns could show two rows with the same RouteID — one aggregate,
one individual. Add an atomic.Uint64 virtualIDCounter on MemSampler
and hand out synthetic IDs from the high end of uint64
(MaxUint64, MaxUint64-1, …). The synthetic space cannot collide with
real route IDs (assigned from the low end by the coordinator), so
row identity stays unambiguous under churn.

Claude bot nit: rename the local `bytes` variable in Observe to
`byteCount` so it no longer shadows the new `bytes` stdlib import.

Tests: TestVirtualBucketRouteIDIsSynthetic + TestRejoinAsIndividualLetsBucketPruneFire.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 4m 32s —— View job


Code Review — feat/keyviz-sampler

  • Reading CLAUDE.md and repo context
  • Reading all changed files at HEAD (25061d6)
  • Analyzing changes since last reviewed commit (41ab078)
  • Checking correctness, concurrency safety, and design
  • Posting final review

Verdict: LGTM — ready to merge.

Full read of all five files at HEAD (25061d6 keyviz: synthetic virtual bucket IDs + scoped prune cancel). Both Codex round-11 P2 issues are cleanly addressed. All prior findings from previous review rounds remain resolved. Notes below.


What the latest commit resolves

The two Codex P2 issues from 41ab078 are fixed:

P2-A: Decouple virtual bucket RouteID from real route IDs

Previously, a new virtual bucket was stamped with RouteID: routeID (the first folded real RouteID). If that same real route was later re-registered as an individual slot, it could appear on two rows in the same column — once as an aggregate row, once as an individual row — making row identity ambiguous for matrix consumers.

Fix: nextVirtualBucketID() returns a synthetic ID from the high end of uint64, counting down. Since real IDs are assigned from the low end (coordinator convention), the two spaces cannot collide in any practical deployment.

func (s *MemSampler) nextVirtualBucketID() uint64 {
    return s.virtualIDCounter.Add(^uint64(0)) // subtract 1; counter starts at MaxUint64+1
}

TestVirtualBucketRouteIDIsSynthetic pins this contract. ✓

P2-B: Keep deferred prune when route rejoins as individual

Previously, RegisterRoute called cancelPendingPrune(routeID) unconditionally before knowing whether the route would become individual or virtual. When a removed virtual member rejoined as an individual slot (capacity freed up), this cancelled the deferred prune for the OLD bucket, leaving the route ID in that bucket's MemberRoutes forever.

Fix: The cancel is now cancelPendingPruneFor(bucket, routeID) — both bucket pointer AND routeID must match. It's only called from the virtual/coarsening code path (after bucket is known), so the individual re-registration path leaves old-bucket prunes intact.

TestRejoinAsIndividualLetsBucketPruneFire pins this. ✓


Correctness spot-checks on current HEAD

Area Verdict
Hot path: single atomic.Pointer.Load + map lookup + at most 2 atomic.Add Lock-free ✓
cancelPendingPruneFor called only from virtual/coarsening path; individual path leaves pending prunes intact
New bucket gets synthetic ID from high end of uint64; real IDs assigned from low end by coordinator No-collision ✓
cancelPendingPruneFor(newBucket, routeID) when brand-new bucket is created: no prune for new bucket exists → no-op; old-bucket prune still queued
cancelPendingPruneFor acquires retiredMu while caller holds routesMu; Flush acquires retiredMu independently — same lock ordering as before No deadlock ✓
foldIntoBucket mutates shared slot under metaMu.Lock; snapshotMeta reads under metaMu.RLock Race-free ✓
drainRetiredSlots runs before advancePendingPrunes in same retiredMu lock; orphaned bucket drained with full MemberRoutes before prune fires Attribution correct ✓
clearTail / clearPruneTail zero dropped entries; cancelPendingPruneFor also calls clearPruneTail GC-safe ✓
sort.SliceStable after live + retired drain Start-monotone ✓
ringBuffer.Range deep-copies via cloneColumn; snapshotMeta clones Start/End/MemberRoutes
All non-positive option fields fall back to defaults
BenchmarkObserveHit / BenchmarkObserveMiss pin zero-alloc hot-path claim

Minor findings (non-blocking)

1. nextVirtualBucketID comment could be clearer [sampler.go:616]

return s.virtualIDCounter.Add(^uint64(0)) // subtract 1; counter starts at MaxUint64+1

"Counter starts at MaxUint64+1" is technically correct (MaxUint64+1 ≡ 0 mod 2⁶⁴, which is atomic.Uint64's initial value), but readers who don't immediately see the uint64 wrap may be confused. A one-liner clarification:

// atomic.Uint64 starts at 0; first call returns MaxUint64, second MaxUint64-1, etc.

Fix this →

2. TestRejoinAsIndividualLetsBucketPruneFire tests a vacuous case [sampler_test.go:725]

The test adds no traffic to bucket B before removing route 3. Because appendDrainedRow skips idle (all-zero) slots, bucket B never emits a row in any snapshot column. The assertion if r.Aggregate && memberRoutesContain(r.MemberRoutes, 3) can never find a row to assert on — it passes trivially with OR without the cancelPendingPruneFor scoping fix.

To make this test actually detect the regression, add an Observe call on route 3 BEFORE removing it, so B has non-zero counters and produces a row during the grace drain cycle:

// After RegisterRoute(3) folds into B:
s.Observe(3, OpRead, 0, 0)   // now B has pending counts
// Then remove, advance, flush as-is...

Without this, the test still serves as a crash/race regression guard, but doesn't verify the prune-fires-for-individual-rejoin logic.

Fix this →

3. Interface nil vs typed nil — coordinator wiring note [sampler.go:317]

Pre-existing note: if s == nil in Observe/Flush only fires for a typed nil ((*MemSampler)(nil)), not an uninitialized Sampler interface. The coordinator wiring PR should initialise the disabled default as:

sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)

No change needed here.

4. Final flush on graceful shutdown [flusher.go:9]

Pre-existing note: RunFlusher doesn't execute the final tick. The coordinator / main.go wiring PR should call s.Flush() once after RunFlusher returns to harvest the in-progress step.


Test coverage (complete list at HEAD)

Test Pins
TestRetiredSlotGracePeriod Wall-clock grace, late-writer simulation
TestRemoveVirtualMemberPruneDeferred Prune deferred inside grace window
TestRemoveLastVirtualMemberHarvestsBucket Orphaned bucket drainage
TestFlushSortsMixedLiveAndRetiredRows Row ordering with retired slots
TestRegisterFoldLowerStartReorders sortedSlots after Start lowered
TestSnapshotReturnsDeepCopy Deep-copy isolation
TestRegisterDoesNotRaceFlushOnVirtualBucket -race regression for metaMu path
TestMemberRoutesCappedAtConfiguredCap MemberRoutes payload bound
TestRetiredTailClearedAfterDrop GC retention fix
TestNonPositiveOptionsFallBackToDefaults Negative options clamped
TestStepAccessor Step() for coordinator wiring
TestReRegisterDuringPruneGraceCancelsPrune Same-bucket cancel on re-register
TestVirtualBucketRouteIDIsSynthetic Synthetic ID, no real-ID collision
TestRejoinAsIndividualLetsBucketPruneFire Individual rejoin doesn't suppress old-bucket prune
BenchmarkObserveHit / BenchmarkObserveMiss Zero-alloc hot-path claim

Ready to merge. The two items worth fixing before the coordinator wiring PR are finding #2 (test robustness) and the note on graceWindow-aligned typed-nil initialization (#3).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25061d6876

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go Outdated
Comment on lines +380 to +384
RouteID: routeID,
Start: cloneBytes(start),
End: cloneBytes(end),
}
next.slots[routeID] = slot
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reuse retired slot when re-registering same route ID

When a route is removed, RemoveRoute keeps its old slot in retiredSlots for grace-draining, but this RegisterRoute branch always allocates a brand-new slot for the same routeID. If that ID is re-registered before grace expires (e.g., Observe(1) -> RemoveRoute(1) -> RegisterRoute(1) -> Observe(1) -> Flush()), Flush drains both the new live slot and the retired old slot, producing two rows with the same RouteID in one column and splitting counts across them. This breaks row identity for consumers that expect one row per route and makes per-route totals ambiguous during churn/resync flows.

Useful? React with 👍 / 👎.

Codex round-12 P2: Re-registering a removed RouteID inside the grace
window allocated a brand-new *routeSlot. Flush then drained both the
new live slot AND the old retired slot in the same column,
producing two rows with the same RouteID and splitting counts
between them. Add reclaimRetiredSlot — when RegisterRoute lands a
real (non-aggregate) routeID and a retired slot with that ID is
queued, pull it off retiredSlots and reuse it. The old counters and
any in-flight late-Observe writes converge with new traffic on the
same slot, and Flush emits a single row.
TestReRegisterIndividualReusesRetiredSlot asserts one row per
RouteID per column with correctly accumulated counts.

Claude bot nits:
- nextVirtualBucketID comment now spells out the atomic.Uint64
  wrap-around explicitly instead of the cryptic "MaxUint64+1" form.
- TestRejoinAsIndividualLetsBucketPruneFire was vacuously true —
  the bucket had no traffic so it never emitted a row, making the
  prune assertion unreachable. Restructure with two virtual members
  (so removing one leaves the bucket alive, not orphaned), drive
  traffic through the surviving member, and flush twice past grace
  so the post-prune MemberRoutes lands in a snapshot row.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 4m 25s —— View job


Code Review

  • Read CLAUDE.md and repo context
  • Read all changed files at HEAD (4c60906)
  • Analyzed latest commit
  • Verified all prior findings remain resolved

Verdict: LGTM — ready to merge.

Full read of all five files at HEAD (4c60906 keyviz: reuse retired slot on re-registration + test polish). The Codex round-12 P2 is cleanly addressed and the previously-vacuous test is now substantive. All prior findings from earlier review rounds remain resolved.


What the latest commit resolves

Codex round-12 P2: Duplicate rows on same-RouteID re-registration within grace window

The problem: RemoveRoute(routeID) placed the slot in retiredSlots for grace-drain. If RegisterRoute(routeID) was called before grace expired, a brand-new *routeSlot was allocated. Flush then drained both the new live slot AND the still-queued retired slot, emitting two rows with the same RouteID in one column and splitting counts between them.

The fix: reclaimRetiredSlot(routeID) is called at the top of the individual-tracking branch in RegisterRoute. It scans retiredSlots under retiredMu.Lock(), removes the first non-aggregate slot whose RouteID matches, zeroes the dropped tail (clearTail), and returns the slot for reuse. The metadata fields (Start/End) are refreshed under metaMu.Lock() to match the new registration; counters are preserved. Flush now sees one slot in the live table and no retired entry, emitting a single row with accumulated counts from both the pre-removal and post-rejoin Observe calls.


Correctness analysis of reclaimRetiredSlot

Invariant Status
Lock ordering preserved: routesMuretiredMu; Flush holds retiredMu alone ✓ No deadlock
Slot removed from retiredSlots before Start/End update under metaMu.Lock() — no concurrent Flush iteration reads the reclaimed slot during the update window ✓ Race-free
Reclaimed slot not yet published in new table during metadata update — no concurrent snapshotMeta call possible ✓ Safe
appendSorted reads slot.Start after metaMu.Unlock() — correct updated value used for insertion position
!r.slot.Aggregate guard prevents matching a synthetic virtual-bucket ID against a coincidental real-route ID collision
copyRouteTable(cur) starts from current next.sortedSlots which excludes the reclaimed slot (removed by prior RemoveRoute) — appendSorted inserts at the correct new position
MemberRoutes on reclaimed slot is nil (individual slots never have MemberRoutes) — no stale membership after reuse
clearTail in reclaimRetiredSlot zeroes the dropped backing-array entry — no *routeSlot pointer lingers via reused capacity ✓ GC-safe

TestReRegisterIndividualReusesRetiredSlot validity:
The test calls Observe(1, OpRead, …) before RemoveRoute(1) and again after RegisterRoute(1), then flushes once. It asserts exactly one row for RouteID=1 in the column with Reads=2. Without the fix, the test would see two rows (one live with 1 read, one retired drain with 1 read) and fail the count != 1 check. The test is correctly structured and exercises the exact defect path. ✓

TestRejoinAsIndividualLetsBucketPruneFire fix:
The prior version was vacuous — the bucket had no traffic before RemoveRoute(3), so appendDrainedRow never emitted it and the assertion was unreachable. The restructured test:

  1. Folds routes 2 and 3 into a bucket (so route 2 keeps the bucket alive after route 3 is removed)
  2. Calls Observe(2) and Observe(3) before removals, giving the bucket non-zero counters
  3. Frees capacity by removing route 1, removes route 3 (queues prune), re-registers route 3 as individual
  4. Advances past grace, drives traffic through route 2 between two flushes
  5. On the second flush the bucket emits a row with post-prune MemberRoutes

The assertion !memberRoutesContain(agg.MemberRoutes, 3) is now reachable and meaningful. ✓


Full correctness spot-checks (HEAD)

Area Verdict
Hot path: atomic.Pointer.Load + map lookup + ≤2 atomic.Add Lock-free ✓
COW publish under routesMu; readers see either old or new table atomically
foldIntoBucket mutates shared slot under metaMu.Lock(); snapshotMeta reads under metaMu.RLock() Race-free ✓
rebuildSorted reads slot.Start without metaMu — only under routesMu; concurrent snapshotMeta reads the old published table Safe ✓
drainRetiredSlots in-place filter — standard keep := retired[:0]; range captures original len Correct ✓
reclaimRetiredSlot removes entry before clearTail — no off-by-one, no retained pointer
cancelPendingPruneFor called only in virtual/coarsening path; individual path (reclaimRetiredSlot) does not cancel unrelated prunes
Virtual bucket IDs from high end of uint64; real IDs from low end — no collision space
drainRetiredSlots before advancePendingPrunes in same retiredMu lock — orphaned bucket drained with full MemberRoutes before prune fires
sort.SliceStable after live + retired drain — monotone Start across all row sources
ringBuffer.Range deep-copies via cloneColumn; snapshotMeta clones Start/End/MemberRoutes
All non-positive options fall back to defaults (Step, HistoryColumns, MaxTrackedRoutes, MaxMemberRoutesPerSlot)
appendDrainedRow keyLen/valueLen guards prevent negative-to-uint64 overflow
BenchmarkObserveHit / BenchmarkObserveMiss pin zero-alloc hot-path claim

Non-blocking notes for future PRs (no change needed here)

1. Interface nil vs typed nil [sampler.go:317]

if s == nil { return } in Observe/Flush/etc. fires only for a typed nil receiver. If the coordinator declares var s keyviz.Sampler without assigning, s.Observe(…) panics at the interface dispatch. The coordinator wiring PR should initialize the disabled default as:

sampler keyviz.Sampler = (*keyviz.MemSampler)(nil)

2. Final Flush on graceful shutdown [flusher.go]

RunFlusher doesn't execute the final tick. The main.go / coordinator wiring PR should call s.Flush() once after RunFlusher returns to harvest the in-progress step before process exit.


Test coverage at HEAD (full list)

Test Pins
TestRetiredSlotGracePeriod Wall-clock grace, late-writer simulation
TestRemoveVirtualMemberPruneDeferred Prune deferred inside grace window
TestRemoveLastVirtualMemberHarvestsBucket Orphaned bucket drainage
TestFlushSortsMixedLiveAndRetiredRows Row ordering with retired slots
TestRegisterFoldLowerStartReorders sortedSlots after Start lowered
TestSnapshotReturnsDeepCopy Deep-copy isolation
TestRegisterDoesNotRaceFlushOnVirtualBucket -race regression for metaMu path
TestMemberRoutesCappedAtConfiguredCap MemberRoutes payload bound
TestRetiredTailClearedAfterDrop GC retention fix
TestNonPositiveOptionsFallBackToDefaults Negative options clamped
TestStepAccessor Step() for coordinator wiring
TestReRegisterDuringPruneGraceCancelsPrune Same-bucket cancel on re-register
TestVirtualBucketRouteIDIsSynthetic Synthetic ID, no real-ID collision
TestRejoinAsIndividualLetsBucketPruneFire Individual rejoin doesn't suppress old-bucket prune (now non-vacuous)
TestReRegisterIndividualReusesRetiredSlot New — single row per RouteID on re-registration within grace
BenchmarkObserveHit / BenchmarkObserveMiss Zero-alloc hot-path claim

Ready for the coordinator wiring PR.

@bootjp bootjp enabled auto-merge April 25, 2026 19:04
@bootjp bootjp merged commit 30f9049 into main Apr 25, 2026
8 checks passed
@bootjp bootjp deleted the feat/keyviz-sampler branch April 25, 2026 19:07
bootjp added a commit that referenced this pull request Apr 26, 2026
…645)

## Summary

- Adds a `WithSampler(s keyviz.Sampler) *ShardedCoordinator` option
(mirroring `WithLeaseReadObserver`) plus a single `observeMutation` call
inside `groupMutations`.
- Each resolved `(RouteID, mutation)` pair produces one
`sampler.Observe(routeID, OpWrite, len(Key), len(Value))` call before
the mutation is grouped by `GroupID`. Reads do not reach this path;
transactional dispatch reuses `groupMutations` so it gets wired
automatically.
- `DelPrefix` is intentionally not observed —
`dispatchDelPrefixBroadcast` broadcasts to every shard rather than
resolving a single `RouteID`, so per-route attribution is out of scope
for this slice.
- Nil-safety: an interface-nil `c.sampler` is guarded at the call site;
the `keyviz.MemSampler` contract also tolerates a typed-nil receiver, so
a disabled sampler costs one branch off the hot path.

Implements task §5.1 from `docs/admin_ui_key_visualizer_design.md`
(split out of the keyviz design originally landed in #639).

## Test plan

- [x] `TestShardedCoordinatorObservesEveryDispatchedMutation` —
cross-shard Put batch, verifies one Observe per element with
engine-resolved RouteID, OpWrite, and exact keyLen / valueLen.
- [x] `TestShardedCoordinatorWithoutSamplerStaysSafe` — dispatches
succeed with no `WithSampler` call (interface-nil) and with a typed-nil
`*MemSampler`.
- [x] `go test -race -count=1 ./kv/... ./keyviz/...` clean.
- [x] `golangci-lint run ./kv/...` clean.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Optional sampling for mutation dispatches: when enabled, each routed
mutation records routing, operation kind, and key/value size metrics; no
effect when disabled.

* **Tests**
* Added tests covering sampler integration, cross-shard dispatch
observations, and safe behavior when no sampler or a typed-nil sampler
is configured.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 26, 2026
## Summary

Phase 2-B of the Key Visualizer design
(`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into
the existing `web/admin/` SPA rather than building a parallel dashboard.
Doc + implementation in one PR (doc commit first, per CLAUDE.md
design-doc-first workflow).

- New page at `/keyviz` — canvas heatmap polling
`/admin/api/v1/keyviz/matrix` with series picker (writes / reads /
write_bytes / read_bytes), row-budget input (clamped at 1024), and off /
5 s / 30 s auto-refresh.
- Cold cells (value 0) render as the page background, not a faint blue —
spotting actually-cold routes stays the dominant visual signal.
- Row-detail flyout on hover: bucket_id, start, end, aggregate,
route_count, route_ids (with truncation indicator). `start` / `end`
decode through a printable-or-hex preview so binary keys do not render
as mojibake.
- Dependency-free: hand-rolled five-stop colour ramp in
`lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208
kB (raw), 64 kB gzipped.

Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and
its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 /
#646 / #647 / #651 / #660 / #661 / #672).

## Design

`docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed
status, lands with this PR. Parent design §12 phase table is split into
2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out,
open).

## Five-lens self-review

1. **Data loss** — n/a; SPA is read-only against an existing handler.
2. **Concurrency / distributed** — n/a; single browser tab polling a
single handler. Sampler concurrency was already covered by Phase 2-A
tests.
3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget
at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower
bound (sampler flush is 1 s).
4. **Data consistency** — SPA renders whatever the handler returns;
consistency guarantees come from the existing leader-issued counters in
the sampler.
5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go
build ./internal/admin/...` clean (embed glob unaffected); `go test
./internal/admin/...` clean. Manual verification documented in the
design doc §5.

## Test plan

- [x] `npm run lint` (`tsc -b --noEmit`) — clean
- [x] `npm run build` (Vite) — clean, output goes to
`internal/admin/dist`
- [x] `go build ./internal/admin/...` — clean
- [x] `go test ./internal/admin/...` — clean
- [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot
routes light up red within ~5 s of write traffic
- [ ] Manual: series picker swaps the displayed counter; row-budget
input clamps at 1024; auto-refresh polls without flicker

## Out of scope

- **Cluster fan-out** — handler is currently node-local. Phase 2-C will
add a cross-node admin RPC; this PR will pick up the aggregate view
automatically once that ships.
- **Drill-down per-route sparkline** — Phase 3.
- **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet
built; correlation lands when those pages do.
- **`localStorage` for series / rows / refresh** — punt to follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants