Skip to content

feat(kv): wire keyviz.Sampler into ShardedCoordinator dispatch path#645

Merged
bootjp merged 3 commits intomainfrom
feat/keyviz-coordinator-wiring
Apr 26, 2026
Merged

feat(kv): wire keyviz.Sampler into ShardedCoordinator dispatch path#645
bootjp merged 3 commits intomainfrom
feat/keyviz-coordinator-wiring

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 25, 2026

Summary

  • Adds a WithSampler(s keyviz.Sampler) *ShardedCoordinator option (mirroring WithLeaseReadObserver) plus a single observeMutation call inside groupMutations.
  • Each resolved (RouteID, mutation) pair produces one sampler.Observe(routeID, OpWrite, len(Key), len(Value)) call before the mutation is grouped by GroupID. Reads do not reach this path; transactional dispatch reuses groupMutations so it gets wired automatically.
  • DelPrefix is intentionally not observed — dispatchDelPrefixBroadcast broadcasts to every shard rather than resolving a single RouteID, so per-route attribution is out of scope for this slice.
  • Nil-safety: an interface-nil c.sampler is guarded at the call site; the keyviz.MemSampler contract also tolerates a typed-nil receiver, so a disabled sampler costs one branch off the hot path.

Implements task §5.1 from docs/admin_ui_key_visualizer_design.md (split out of the keyviz design originally landed in #639).

Test plan

  • TestShardedCoordinatorObservesEveryDispatchedMutation — cross-shard Put batch, verifies one Observe per element with engine-resolved RouteID, OpWrite, and exact keyLen / valueLen.
  • TestShardedCoordinatorWithoutSamplerStaysSafe — dispatches succeed with no WithSampler call (interface-nil) and with a typed-nil *MemSampler.
  • go test -race -count=1 ./kv/... ./keyviz/... clean.
  • golangci-lint run ./kv/... clean.

Summary by CodeRabbit

  • New Features

    • Optional sampling for mutation dispatches: when enabled, each routed mutation records routing, operation kind, and key/value size metrics; no effect when disabled.
  • Tests

    • Added tests covering sampler integration, cross-shard dispatch observations, and safe behavior when no sampler or a typed-nil sampler is configured.

Wires the keyviz.Sampler from PR #639 into ShardedCoordinator
(design doc §5.1):

- Adds an unexported `sampler keyviz.Sampler` field plus
  `WithSampler(s keyviz.Sampler) *ShardedCoordinator` post-
  construction option, mirroring WithLeaseReadObserver.
- groupMutations now calls `c.observeMutation(routeID, mut)` once
  per resolved (RouteID, mutation) pair before grouping by
  GroupID. Reads are not dispatched through this path; all calls
  use OpWrite.
- observeMutation guards against an interface-nil sampler at the
  call site; the keyviz contract also tolerates a typed-nil
  *MemSampler, so disabled keyviz wires through to a no-op without
  branching deep in the hot path.
- DelPrefix is intentionally not observed: dispatchDelPrefixBroadcast
  doesn't resolve a single RouteID (it broadcasts to every shard),
  and the design treats per-route attribution there as out of scope.

Tests:
- TestShardedCoordinatorObservesEveryDispatchedMutation: cross-shard
  Put batch, verifies one Observe per element with the engine's
  resolved RouteID, OpWrite, and exact key/value lengths.
- TestShardedCoordinatorWithoutSamplerStaysSafe: dispatches
  successfully both with no WithSampler call (interface-nil) and
  with a typed-nil *MemSampler.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 50 minutes and 24 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 50 minutes and 24 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b4d8762-511f-40e0-9b51-819d4ebb76d9

📥 Commits

Reviewing files that changed from the base of the PR and between 2897546 and 0c89321.

📒 Files selected for processing (2)
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_sampler_test.go
📝 Walkthrough

Walkthrough

Adds optional keyviz sampling to ShardedCoordinator: a sampler keyviz.Sampler field plus WithSampler method. During grouped mutation dispatch each routed mutation triggers a nil-safe observeMutation that calls sampler.Observe with the resolved RouteID, write op kind, and key/value lengths.

Changes

Cohort / File(s) Summary
ShardedCoordinator Sampler Integration
kv/sharded_coordinator.go
Adds sampler keyviz.Sampler field and WithSampler method; invokes nil-safe observeMutation for each routed mutation to call sampler.Observe(routeID, keyviz.OpWrite, keyLen, valueLen) before dispatch.
Sampler Wiring Tests
kv/sharded_coordinator_sampler_test.go
Adds recordingSampler test spy and tests: verifies one observation per dispatched mutation with correct RouteID, op kind, and key/value sizes; verifies nil and typed-nil sampler safety.
Sampler docs
keyviz/sampler.go
Clarifies Sampler documentation to require implementations tolerate typed-nil receivers and describes hot-path usage by the coordinator (nil-safety contract).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    rect rgba(220,240,255,0.5)
    participant Client
    participant Coordinator as ShardedCoordinator
    participant Dist as Distribution/Router
    participant Sampler as keyviz.Sampler
    participant Shard as ShardGroup
    end

    Client->>Coordinator: Dispatch(mutations)
    Coordinator->>Dist: groupMutations -> resolve RouteID per mutation
    Dist-->>Coordinator: routed groups (RouteID, elems)
    loop per routed mutation
        Coordinator->>Sampler: observeMutation(RouteID, OpWrite, keyLen, valueLen)
        alt sampler is nil
            Sampler--xCoordinator: no-op (early return)
        else sampler present
            Sampler-->>Coordinator: Observe recorded
        end
    end
    Coordinator->>Shard: send grouped dispatches to target shards
    Shard-->>Coordinator: responses
    Coordinator-->>Client: aggregated response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 In corridors of routes I hop,
I count each key before I stop.
A nibble here, a nibble there,
Nil-safe notes float through the air.
Hooray—two writes, recorded with care! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding keyviz.Sampler wiring to ShardedCoordinator's dispatch path via WithSampler method and observeMutation calls during mutation grouping.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/keyviz-coordinator-wiring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
kv/sharded_coordinator_sampler_test.go (1)

103-142: Optional: tighten the nil-safety contract assertion.

Both subtests only assert require.NoError. The PR description states that disabled samplers produce zero observations; the interface-nil case can't be checked without a recorder, but you could at least assert require.Nil(t, coord.sampler) (or a getter) for the "no WithSampler call" subtest to pin that the field stays unset rather than getting silently initialized somewhere. Minor — the panic-free assertion already covers the main contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/sharded_coordinator_sampler_test.go` around lines 103 - 142, The test
TestShardedCoordinatorWithoutSamplerStaysSafe should assert the sampler field
stays unset for the "no WithSampler call" case; after constructing coord via
tc.opt(NewShardedCoordinator(...)) add an assertion that coord.sampler is nil
(or call an exported getter if sampler is unexported) to ensure the coordinator
was not silently initialized with a sampler; leave the typed-nil subtest
unchanged because it cannot be checked without a recorder.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@kv/sharded_coordinator_sampler_test.go`:
- Around line 103-142: The test TestShardedCoordinatorWithoutSamplerStaysSafe
should assert the sampler field stays unset for the "no WithSampler call" case;
after constructing coord via tc.opt(NewShardedCoordinator(...)) add an assertion
that coord.sampler is nil (or call an exported getter if sampler is unexported)
to ensure the coordinator was not silently initialized with a sampler; leave the
typed-nil subtest unchanged because it cannot be checked without a recorder.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f93f4220-e034-4bf1-b611-502700acc929

📥 Commits

Reviewing files that changed from the base of the PR and between 29011bb and d449057.

📒 Files selected for processing (2)
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_sampler_test.go

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 2m 15s —— View job


Code Review

  • Gather context and read changed files
  • Review kv/sharded_coordinator.go changes
  • Review kv/sharded_coordinator_sampler_test.go
  • Check architecture alignment and design conventions
  • Post review feedback

Overall this is clean, focused, and well-scoped. The hot-path properties are preserved, and the test coverage is solid. A few issues to address:


Issue 1 — WithSampler silently relies on a nil-receiver contract that isn't documented in the interface (medium)

WithLeaseReadObserver normalizes typed-nil to untyped nil via normalizeLeaseObserver (which uses reflection), so c.leaseObserver == nil is a reliable guard for any LeaseReadObserver implementation. WithSampler does not do this:

// kv/sharded_coordinator.go:169
func (c *ShardedCoordinator) WithSampler(s keyviz.Sampler) *ShardedCoordinator {
    c.sampler = s  // typed-nil *MemSampler is NOT normalized to untyped nil
    return c
}

When a typed-nil *MemSampler is passed, c.sampler == nil returns false, so observeMutation falls through to c.sampler.Observe(...). This only works safely because MemSampler.Observe starts with if s == nil { return }. That nil-receiver guard is an implementation detail of MemSampler, not a contract of the Sampler interface.

The Sampler interface doc says "nil-safe contract is documented per-method" but Observe's godoc doesn't actually mention the nil-receiver requirement:

// keyviz/sampler.go:61-67
type Sampler interface {
    // Observe records a single request against a route. Op identifies
    // the counter family. ...
    Observe(routeID uint64, op Op, keyLen, valueLen int)
}

A future Sampler implementor that doesn't add a nil-receiver guard will panic when passed as a typed-nil to WithSampler. Either:

  • Add nil-receiver documentation to Sampler.Observe explicitly, or
  • Apply the same normalizeSampler (reflect-based) pattern as normalizeLeaseObserver in WithSampler so the guard at the call site is self-sufficient.

Fix this →


Issue 2 — observeMutation comment explains the WHAT, not the WHY (minor)

Per CLAUDE.md conventions: "Default to writing no comments. Only add one when the WHY is non-obvious." The 4-line comment on observeMutation at kv/sharded_coordinator.go:978–984 primarily describes what the function does (already evident from the code) rather than a non-obvious constraint. The important non-obvious points are: (a) reads never reach this path, and (b) the nil guard avoids a branch on the hot path. Those two lines are worth keeping; the rest is redundant with the code.


Issue 3 — Test assertion gap for "no WithSampler call" case (minor, echoing CodeRabbit)

TestShardedCoordinatorWithoutSamplerStaysSafe verifies that dispatch doesn't panic, but for the first subcase (no WithSampler call) it's possible to assert the structural invariant too. Since sampler is unexported you can't directly assert coord.sampler == nil, but you could verify no observations were recorded by asserting the count on a shared recordingSampler remains zero (similar to how the first test uses a recordingSampler explicitly). The typed-nil case can't be checked this way, as the PR notes, but the "no option" case can.


Design note — Route registration is required for MemSampler.Observe to record data in production

Observe in MemSampler silently drops calls for unregistered RouteIDs (documented in the sampler pkg). The test uses recordingSampler which has no route table, so the route-registration gap doesn't surface there. In production, route registration in MemSampler (via RegisterRoute) must happen via the distribution watcher before observations land — otherwise every call to observeMutation is a no-op. If that follow-up wiring isn't yet tracked, it might be worth adding a TODO comment to WithSampler noting the dependency, or logging it in the design doc as an open milestone.


Positive notes

  • The observeMutation nil-interface guard (c.sampler == nil) is allocation-free and correct — the hot path stays clean when keyviz is disabled.
  • Calling observeMutation after engine.GetRoute ensures the RouteID is valid; observing pre-route-resolution or on failed routes would produce junk data.
  • groupMutations is shared by both the direct dispatch and the transactional path (txnLogs), so transactional writes are automatically observed without duplicating the call.
  • recordingSampler with a sync.Mutex and snapshot() copy is the right pattern for a test double on this hot path.
  • The cross-shard assertion in TestShardedCoordinatorObservesEveryDispatchedMutation correctly verifies route.RouteID via engine.GetRoute rather than hard-coding expected values.

Round-1 review fixes for PR #645:

- Claude bot Issue 1: WithSampler stores the interface value as
  supplied; the guard at the call site only checks interface-nil,
  not typed-nil. Document the nil-receiver requirement on the
  Sampler interface so a future implementor cannot land a
  panic-on-nil Observe and silently break dispatch.
- Claude bot Issue 2: trim observeMutation comment to the two
  non-obvious points (reads bypass this path; the early return
  keeps the disabled hot path branch-only). Per CLAUDE.md, comments
  for the "what" duplicate the code.
- Claude / CodeRabbit nit: TestShardedCoordinatorWithoutSamplerStaysSafe
  now asserts c.sampler stays the zero interface value for the
  "no WithSampler call" subcase, so a future refactor that silently
  initialises the field would fail the test.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 1m 43s —— View job


Code Review

  • Gather context and read changed files
  • Pass 1: Data loss
  • Pass 2: Concurrency / distributed failures
  • Pass 3: Performance
  • Pass 4: Data consistency
  • Pass 5: Test coverage

This is a clean, well-scoped change. The issues from the previous review have all been resolved:

  • Interface nil-receiver contract documentedkeyviz/sampler.go:58-75 now explicitly states in both the interface godoc and Observe's method doc that implementations must be nil-receiver-safe. This is the correct approach since it makes the requirement a binding contract rather than an implementation detail of MemSampler.
  • observeMutation comment explains WHY — the comment now reads "reads never reach this path; the early return keeps the disabled-keyviz hot path allocation-free." — captures the non-obvious invariants rather than restating the code.
  • coord.sampler nil assertion addedTestShardedCoordinatorWithoutSamplerStaysSafe:141-143 now asserts require.Nil(t, coord.sampler) for the "no WithSampler call" subtest, guarding against silent initialisation.

Five-pass findings

Pass 1 — Data loss: No issues. observeMutation is fire-and-forget on the sampling side; a failed Observe (silently dropped for unregistered routes) has no bearing on the storage path. WithSampler stores a reference and doesn't touch any write path.

Pass 2 — Concurrency / distributed failures: No issues. WithSampler follows the same builder pattern as WithLeaseReadObserver — it is expected to be called at startup before any concurrent Dispatch. The c.sampler field is written once and then read-only on the hot path, so no synchronisation is needed. recordingSampler in tests correctly uses sync.Mutex.

Pass 3 — Performance: No issues. The nil check (c.sampler == nil) at kv/sharded_coordinator.go:978 is a single branch, zero-allocation, and correctly placed so it short-circuits before the interface dispatch. When a sampler is wired, MemSampler.Observe is lock-free (atomic.Pointer load + map lookup + two atomic adds) as documented in the package header. len(mut.Key) / len(mut.Value) are O(1).

Pass 4 — Data consistency: No issues. observeMutation is called after engine.GetRoute succeeds (kv/sharded_coordinator.go:995), so the RouteID fed to the sampler always matches the RouteID used for actual routing — consistent attribution. Calling observe before grouped[route.GroupID] = append(...) means a dispatch failure (Raft proposal rejected) will have already been counted; this is expected/acceptable for a best-effort heatmap sampler. txnLogs reuses groupMutations so transactional writes are automatically covered.

Pass 5 — Test coverage: Mostly solid, with two minor gaps:

  1. Delete mutations are not explicitly tested. Both Put and Delete go through groupMutations, so the wiring is identical, but a table-driven entry with Op: Delete in TestShardedCoordinatorObservesEveryDispatchedMutation would verify that len(mut.Value) == 0 passes through correctly and that OpWrite is still used (delete is not a read). Low risk but a trivial addition.

  2. Transactional path is not directly tested. txnLogs calls groupMutations, which now contains the observe call. The unit test only exercises the non-txn Dispatch path. This is a minor gap; given the shared code path the contract is effectively the same, but a test comment or a single txn subtest would lock it down.

Neither gap is a blocker. The cross-shard assertion in TestShardedCoordinatorObservesEveryDispatchedMutation correctly resolves the expected RouteID via engine.GetRoute rather than hard-coding it — that's the right approach.


Design note — Production route-registration dependency

MemSampler.Observe silently drops calls for unregistered RouteIDs (as documented). In production, RegisterRoute must be called before the coordinator publishes a new RouteID — i.e., the distribution watcher must wire route registration before the first dispatch. If that follow-up wiring isn't yet tracked, it's worth recording as an open milestone in the design doc or as a TODO on WithSampler.


Overall: approve. The sampling hot path is correct, nil-safe, and allocation-free when disabled. The two test coverage gaps are minor and don't block merging.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
kv/sharded_coordinator.go (1)

978-985: LGTM — clean, nil-safe single call site.

observeMutation correctly guards the interface-nil case, and typed-nil is delegated to the implementation per the keyviz.Sampler contract (see keyviz/sampler.go:68-75 and MemSampler.Observe lines 324-327). Placing the call before grouping by GroupID makes the per-mutation attribution unambiguous.

One thing worth being explicit about (in a follow-up doc tweak, not a blocker): observation fires before the Raft commit, so failed dispatches are still counted in the heatmap. That's almost certainly what you want for a workload heatmap, but it's worth pinning in the comment so a future contributor doesn't "fix" it by moving the call after router.Commit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/sharded_coordinator.go` around lines 978 - 985, Update the comment above
the observeMutation method in ShardedCoordinator to explicitly state that
sampler.Observe is called before the Raft commit/router.Commit and therefore
observations will include failed/aborted dispatches (they are counted in the
heatmap); reference the observeMutation function and the call to
c.sampler.Observe so future readers understand the intentional ordering and its
implications for workload attribution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@kv/sharded_coordinator_sampler_test.go`:
- Around line 87-96: The test currently calls engine.GetRoute(elem.Key) but
production routes via engine.GetRoute(routeKey(...)), so update the assertion to
resolve the route using the same transform: call
engine.GetRoute(routeKey(elem.Key)) and use the returned route.RouteID in the
sampleCall comparison (keeping sampleCall, ops.Elems, and calls[i] as before) to
ensure the test mirrors production routing behavior.

---

Nitpick comments:
In `@kv/sharded_coordinator.go`:
- Around line 978-985: Update the comment above the observeMutation method in
ShardedCoordinator to explicitly state that sampler.Observe is called before the
Raft commit/router.Commit and therefore observations will include failed/aborted
dispatches (they are counted in the heatmap); reference the observeMutation
function and the call to c.sampler.Observe so future readers understand the
intentional ordering and its implications for workload attribution.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5156ae81-8500-4932-8e02-54f15984b10f

📥 Commits

Reviewing files that changed from the base of the PR and between d449057 and 2897546.

📒 Files selected for processing (3)
  • keyviz/sampler.go
  • kv/sharded_coordinator.go
  • kv/sharded_coordinator_sampler_test.go
✅ Files skipped from review due to trivial changes (1)
  • keyviz/sampler.go

Comment thread kv/sharded_coordinator_sampler_test.go
Round-2 review fixes for PR #645:

CodeRabbit nit on observeMutation: spell out the pre-commit ordering
intent so a future contributor doesn't "fix" it by moving the call
after router.Commit. The heatmap is meant to reflect offered load
(useful for planning capacity); counting only committed writes
would hide hotspots that show up as repeated proposal failures.

CodeRabbit nit on TestShardedCoordinatorObservesEveryDispatchedMutation:
production routes via engine.GetRoute(routeKey(mut.Key)) but the
test was calling engine.GetRoute(elem.Key) directly. Both inputs
("b", "x") happen to be identity through normalizeRouteKey today,
but a future change to internal-prefix handling could silently
diverge the test from production. Mirror the production transform.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 1m 53s —— View job


Code Review

  • Gather context from previous reviews
  • Read current state of changed files
  • Review keyviz/sampler.go changes
  • Review kv/sharded_coordinator.go changes
  • Review kv/sharded_coordinator_sampler_test.go
  • Post review feedback

All issues from the two previous review rounds have been addressed in commit 0c89321. This is ready to merge.

What was fixed since the last review

  • Pre-commit semantics documentedobserveMutation's comment now explicitly states observations are pre-commit and intentionally reflect offered load ("the heatmap reflects offered load, not just committed writes") — the concern CodeRabbit flagged about a future contributor moving the call is locked down.
  • routeKey() in test assertionengine.GetRoute(routeKey(elem.Key)) at kv/sharded_coordinator_sampler_test.go:89 now mirrors production's routing transform.

Five-pass findings

Pass 1 — Data loss: No issues. observeMutation is a side-effect-only sampling call; any silent drop (unregistered route in MemSampler) has zero bearing on the Raft write path.

Pass 2 — Concurrency: No issues. c.sampler is written once in WithSampler before concurrent use. recordingSampler in tests correctly guards with sync.Mutex. observeMutation is read-only on the field.

Pass 3 — Performance: No issues. The c.sampler == nil guard at kv/sharded_coordinator.go:984 is a single zero-allocation branch. MemSampler.Observe is atomic-only (no mutex on the hot path). Placement before grouped[route.GroupID] = append(...) means no extra work for keys that fail routing.

Pass 4 — Data consistency: No issues. The RouteID fed to the sampler comes directly from engine.GetRoute(routeKey(mut.Key)) — the same route that drives actual grouping. Attribution is always coherent with dispatch.

Pass 5 — Test coverage: Mostly solid. Two minor gaps remain (unchanged from the second review), neither is a blocker:

  1. No explicit Del mutation subtest. Del goes through groupMutations identically to Put, producing len(mut.Value) == 0 and OpWrite — correct behavior, low risk, but a single table-driven entry would lock it down.
  2. Transactional path not directly covered. txnLogs calls groupMutations which now carries the observe call; the wiring is identical but not independently tested.

Design note — Production route registration

MemSampler.Observe silently drops calls for unregistered RouteIDs. In production, the distribution watcher must call RegisterRoute before a route becomes active. This follow-up wiring isn't currently tracked in the design doc or as a TODO on WithSampler. Worth recording before the production wiring PR.

Positive notes

  • The Sampler interface godoc (keyviz/sampler.go:58–75) now carries a first-class nil-receiver-safe contract — future implementors get the requirement upfront rather than discovering it at runtime.
  • observeMutation method boundary is clean: one concern, one nil guard, one Observe call.
  • recordingSampler.snapshot() uses a copy to avoid holding the lock during assertions — correct pattern.
  • wantNilField logic in TestShardedCoordinatorWithoutSamplerStaysSafe correctly distinguishes interface-nil (assertable) from typed-nil (not assertable without a recorder).

Verdict: approve. The implementation is correct, nil-safe, allocation-free on the disabled path, and the interface contract is clearly documented.

@bootjp bootjp merged commit 5b3169f into main Apr 26, 2026
8 checks passed
@bootjp bootjp deleted the feat/keyviz-coordinator-wiring branch April 26, 2026 09:11
bootjp added a commit that referenced this pull request Apr 26, 2026
…nator + admin) (#647)

## Summary

End-to-end keyviz wiring on each node binary. Stacked on top of #645
(coordinator `WithSampler`) and #646 (admin server `RegisterSampler`) —
those two PRs are merged into this branch via merge commits, so this
PR's incremental diff is `main.go` + `main_keyviz_test.go`.

- Three new flags: `--keyvizEnabled` (off by default — opt-in),
`--keyvizStep`, `--keyvizMaxTrackedRoutes`,
`--keyvizMaxMemberRoutesPerSlot`. Defaults pull from `keyviz.Default*`
so flag help and code stay in sync.
- `buildKeyVizSampler()` returns `*keyviz.MemSampler` when enabled,
`nil` otherwise — the single decision point for "keyviz on/off."
- `seedKeyVizRoutes()` copies the engine's startup route catalogue into
the sampler. Route-watch propagation post-startup is a follow-up (Phase
3 in the design doc).
- `startKeyVizFlusher()` runs `RunFlusher` in the existing errgroup and
calls one final `Flush()` after ctx fires so the in-progress step is
harvested at graceful shutdown.
- Coordinator: chained `.WithSampler(...)` onto the existing
`.WithLeaseReadObserver(...)` call so the dispatch hot path observes
mutations.
- AdminServer: `setupAdminService` now takes the `*MemSampler` and only
calls `RegisterSampler` when it's non-nil — operators with keyviz
disabled get `codes.Unavailable` on `GetKeyVizMatrix` instead of a
spurious empty success.

Implements `docs/admin_ui_key_visualizer_design.md` §5 / §10. After this
lands the heatmap is fully usable end-to-end.

## Test plan

- [x] `TestBuildKeyVizSamplerHonorsEnabledFlag` — flag on/off contract.
- [x] `TestSeedKeyVizRoutesCopiesEngineCatalogue` — `engine.Stats()`
seed path.
- [x] `TestSeedKeyVizRoutesNoOpOnNilSampler` — disabled-sampler safety.
- [x] `TestStartKeyVizFlusherReturnsAfterCancel` — graceful shutdown
drains the pre-cancel counters via the final `Flush`.
- [x] `go build .`, `go vet .`, `golangci-lint run ./...` clean.
- [x] `go test -race -count=1 -run
'TestBuildKeyVizSampler|TestSeedKeyVizRoutes|TestStartKeyVizFlusher' .`
passes.

## Merge order

1. Merge #645 (coordinator wiring).
2. Merge #646 (admin server wiring).
3. Rebase this branch on `main` (the merge commits in this PR collapse
to no-ops) and merge.
bootjp added a commit that referenced this pull request Apr 26, 2026
…646)

## Summary

- Implements `AdminServer.GetKeyVizMatrix` against the proto declared in
`proto/admin.proto` (already generated; no proto edits).
- New narrow `KeyVizSampler` interface in the adapter package (just
`Snapshot(from, to time.Time) []keyviz.MatrixColumn`), so production
wires `*keyviz.MemSampler` while tests pass an in-memory fake.
- `AdminServer.RegisterSampler` mirrors `RegisterGroup`. Without it,
`GetKeyVizMatrix` returns `codes.Unavailable` so callers can distinguish
"keyviz disabled on this node" from "no data yet" (which is a successful
empty response).
- Pivots the column-major `MatrixColumn` slice into the row-major proto
layout: one `KeyVizRow` per `RouteID` with values aligned to a parallel
`column_unix_ms` slice. `KeyVizSeries` selection picks the matching
per-row counter; `UNSPECIFIED` defaults to `Reads`.
- `bucket_id` encodes `route:<id>` for individual slots and
`virtual:<syntheticID>` for aggregate buckets. Aggregate rows carry
`MemberRoutes` verbatim through `route_ids` and `route_count`.

Implements the read-side half of
`docs/admin_ui_key_visualizer_design.md` §5.2 / §6. The dispatch-side
`Observe` wiring is in #645.

## Test plan

- [x] `TestGetKeyVizMatrixReturnsUnavailableWhenSamplerNotRegistered` —
verifies `codes.Unavailable` when no sampler registered.
- [x] `TestGetKeyVizMatrixPivotsColumnsToRows` — two-column / two-route
fixture, verifies the missing-row-becomes-zero contract.
- [x] `TestGetKeyVizMatrixSeriesSelection` — table-driven across all
five enum values including `UNSPECIFIED` defaulting to `Reads`.
- [x] `TestGetKeyVizMatrixEncodesAggregateBucket` — virtual bucket
layout (`bucket_id` prefix, `aggregate=true`, `route_ids`,
`route_count`).
- [x] `go test -race -count=1 -run TestGetKeyVizMatrix ./adapter/...`
clean.
- [x] `golangci-lint run ./adapter/...` clean.
bootjp added a commit that referenced this pull request Apr 26, 2026
## Summary

Phase 2-B of the Key Visualizer design
(`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into
the existing `web/admin/` SPA rather than building a parallel dashboard.
Doc + implementation in one PR (doc commit first, per CLAUDE.md
design-doc-first workflow).

- New page at `/keyviz` — canvas heatmap polling
`/admin/api/v1/keyviz/matrix` with series picker (writes / reads /
write_bytes / read_bytes), row-budget input (clamped at 1024), and off /
5 s / 30 s auto-refresh.
- Cold cells (value 0) render as the page background, not a faint blue —
spotting actually-cold routes stays the dominant visual signal.
- Row-detail flyout on hover: bucket_id, start, end, aggregate,
route_count, route_ids (with truncation indicator). `start` / `end`
decode through a printable-or-hex preview so binary keys do not render
as mojibake.
- Dependency-free: hand-rolled five-stop colour ramp in
`lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208
kB (raw), 64 kB gzipped.

Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and
its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 /
#646 / #647 / #651 / #660 / #661 / #672).

## Design

`docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed
status, lands with this PR. Parent design §12 phase table is split into
2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out,
open).

## Five-lens self-review

1. **Data loss** — n/a; SPA is read-only against an existing handler.
2. **Concurrency / distributed** — n/a; single browser tab polling a
single handler. Sampler concurrency was already covered by Phase 2-A
tests.
3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget
at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower
bound (sampler flush is 1 s).
4. **Data consistency** — SPA renders whatever the handler returns;
consistency guarantees come from the existing leader-issued counters in
the sampler.
5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go
build ./internal/admin/...` clean (embed glob unaffected); `go test
./internal/admin/...` clean. Manual verification documented in the
design doc §5.

## Test plan

- [x] `npm run lint` (`tsc -b --noEmit`) — clean
- [x] `npm run build` (Vite) — clean, output goes to
`internal/admin/dist`
- [x] `go build ./internal/admin/...` — clean
- [x] `go test ./internal/admin/...` — clean
- [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot
routes light up red within ~5 s of write traffic
- [ ] Manual: series picker swaps the displayed counter; row-budget
input clamps at 1024; auto-refresh polls without flicker

## Out of scope

- **Cluster fan-out** — handler is currently node-local. Phase 2-C will
add a cross-node admin RPC; this PR will pick up the aggregate view
automatically once that ships.
- **Drill-down per-route sparkline** — Phase 3.
- **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet
built; correlation lands when those pages do.
- **`localStorage` for series / rows / refresh** — punt to follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant