Skip to content

feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix#646

Merged
bootjp merged 4 commits intomainfrom
feat/keyviz-admin-server
Apr 26, 2026
Merged

feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix#646
bootjp merged 4 commits intomainfrom
feat/keyviz-admin-server

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 25, 2026

Summary

  • Implements AdminServer.GetKeyVizMatrix against the proto declared in proto/admin.proto (already generated; no proto edits).
  • New narrow KeyVizSampler interface in the adapter package (just Snapshot(from, to time.Time) []keyviz.MatrixColumn), so production wires *keyviz.MemSampler while tests pass an in-memory fake.
  • AdminServer.RegisterSampler mirrors RegisterGroup. Without it, GetKeyVizMatrix returns codes.Unavailable so callers can distinguish "keyviz disabled on this node" from "no data yet" (which is a successful empty response).
  • Pivots the column-major MatrixColumn slice into the row-major proto layout: one KeyVizRow per RouteID with values aligned to a parallel column_unix_ms slice. KeyVizSeries selection picks the matching per-row counter; UNSPECIFIED defaults to Reads.
  • bucket_id encodes route:<id> for individual slots and virtual:<syntheticID> for aggregate buckets. Aggregate rows carry MemberRoutes verbatim through route_ids and route_count.

Implements the read-side half of docs/admin_ui_key_visualizer_design.md §5.2 / §6. The dispatch-side Observe wiring is in #645.

Test plan

  • TestGetKeyVizMatrixReturnsUnavailableWhenSamplerNotRegistered — verifies codes.Unavailable when no sampler registered.
  • TestGetKeyVizMatrixPivotsColumnsToRows — two-column / two-route fixture, verifies the missing-row-becomes-zero contract.
  • TestGetKeyVizMatrixSeriesSelection — table-driven across all five enum values including UNSPECIFIED defaulting to Reads.
  • TestGetKeyVizMatrixEncodesAggregateBucket — virtual bucket layout (bucket_id prefix, aggregate=true, route_ids, route_count).
  • go test -race -count=1 -run TestGetKeyVizMatrix ./adapter/... clean.
  • golangci-lint run ./adapter/... clean.

Implements the `GetKeyVizMatrix` admin gRPC RPC declared in
`proto/admin.proto`, completing the read-side of the key visualizer
flow (design doc §5.2 / §6).

- New `KeyVizSampler` interface in adapter (Snapshot only) so tests
  can pass a deterministic in-memory fake while production wires
  `*keyviz.MemSampler` directly.
- `AdminServer.RegisterSampler` mirrors `RegisterGroup`. Without it,
  GetKeyVizMatrix returns codes.Unavailable so callers can
  distinguish "keyviz disabled on this node" from "no data yet"
  (which is a successful empty response).
- Pivots the column-major MatrixColumn slice into the row-major proto
  layout: one KeyVizRow per RouteID with values aligned to a parallel
  column_unix_ms slice. KeyVizSeries selection picks the matching
  per-row counter; UNSPECIFIED defaults to Reads.
- bucket_id encodes "route:<id>" for individual slots and
  "virtual:<syntheticID>" for aggregate buckets. Aggregate rows carry
  MemberRoutes verbatim through route_ids and route_count.

Tests:
- TestGetKeyVizMatrixReturnsUnavailableWhenSamplerNotRegistered
- TestGetKeyVizMatrixPivotsColumnsToRows — two-column / two-route
  fixture, verifies the missing-row-becomes-zero contract.
- TestGetKeyVizMatrixSeriesSelection — table-driven across all five
  enum values including UNSPECIFIED defaulting to Reads.
- TestGetKeyVizMatrixEncodesAggregateBucket — virtual bucket layout.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 29 minutes and 28 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 29 minutes and 28 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da4a0727-664e-4620-8aba-c734df67d7ca

📥 Commits

Reviewing files that changed from the base of the PR and between 29011bb and 82dd4c4.

📒 Files selected for processing (4)
  • adapter/admin_grpc.go
  • adapter/admin_grpc_keyviz_test.go
  • keyviz/sampler.go
  • keyviz/sampler_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/keyviz-admin-server

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33d21818be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread adapter/admin_grpc.go Outdated
Comment on lines +536 to +538
cols := sampler.Snapshot(from, to)
pickValue := matrixSeriesPicker(req.GetSeries())
return matrixToProto(cols, pickValue), nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect requested rows limit in GetKeyVizMatrix

GetKeyVizMatrix never reads req.GetRows(), and always returns all sampled rows via matrixToProto(cols, pickValue). In deployments with many tracked routes, a caller requesting a compact matrix (for example rows=256) can still receive thousands of rows, which breaks the API’s row-budget contract and can create very large response payloads for the admin UI.

Useful? React with 👍 / 👎.

Comment thread adapter/admin_grpc.go Outdated
Start: append([]byte(nil), mr.Start...),
End: append([]byte(nil), mr.End...),
Aggregate: mr.Aggregate,
RouteCount: uint64(len(mr.MemberRoutes)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve true aggregate route count when building KeyVizRow

newKeyVizRowFrom sets route_count to len(mr.MemberRoutes), but MemberRoutes is intentionally capped in the sampler (MaxMemberRoutesPerSlot) while extra routes still contribute to the bucket counters. For capped aggregate buckets, this underreports contributors and leaves route_ids_truncated false, so consumers cannot tell that membership was truncated and may make incorrect drill-down decisions.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 41s —— View job


Code Review: feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix

  • Read changed files
  • Read keyviz package for context
  • Read proto definition and design doc
  • Five-lens review
  • Post final review

The overall shape is clean — a narrow KeyVizSampler interface, correct RLock-fetch-unlock pattern for sampler access, proper deep-copy semantics in newKeyVizRowFrom, and well-targeted tests for the happy paths. Two correctness issues need addressing before this lands.


P1 — req.GetRows() row-budget contract is violated

GetKeyVizMatrix ignores req.GetRows() entirely:

// adapter/admin_grpc.go:538
cols := sampler.Snapshot(from, to)
pickValue := matrixSeriesPicker(req.GetSeries())
return matrixToProto(cols, pickValue), nil  // req.GetRows() never read

The design doc (§5.2, line 199) says:

The API's rows parameter is a target, not a guarantee. The server walks the route list in lexicographic order and greedily merges adjacent routes until the row count fits.

And line 299:

API endpoints cap to − from at 7 days and rows at 1024 to bound server work.

A deployment tracking thousands of routes (within MaxTrackedRoutes = 10_000) can serve a response orders of magnitude larger than the caller requested. The UI typically sends rows=256 for a compact heat-map view; today it receives all rows unconditionally.

Fix needed: After sortKeyVizRowsByStart, if req.GetRows() > 0 && uint32(len(resp.Rows)) > req.GetRows(), greedily merge adjacent low-activity rows (the design's stated algorithm) or, at minimum, truncate to the budget and document the truncation until the full merge logic lands. The design spec says this is the server's responsibility.

Fix this →


P2 — route_ids_truncated is never set; route_count underreports under the cap

newKeyVizRowFrom sets RouteCount = len(mr.MemberRoutes), but MemberRoutes is already capped at MaxMemberRoutesPerSlot (default 256) inside keyviz.foldIntoBucket:

// keyviz/sampler.go:446-448
if !memberRoutesContains(bucket.MemberRoutes, routeID) &&
    len(bucket.MemberRoutes) < s.opts.MaxMemberRoutesPerSlot {
    bucket.MemberRoutes = append(bucket.MemberRoutes, routeID)
}

So for a bucket with 500 folded routes, mr.MemberRoutes carries at most 256 entries, RouteCount reports 256 (not 500), and RouteIdsTruncated stays false. The admin UI design doc says:

the drawer explicitly says the row is a coarsened virtual bucket and lists the constituent routes or the truncated route count

Consumers can't distinguish "this bucket has exactly 256 members" from "it has ≥256 members and the rest were dropped."

The keyviz.MatrixRow struct carries no true-total field today, so newKeyVizRowFrom can't fix this unilaterally. The proper fix is a two-part change:

  1. Add a TotalMemberCount uint64 (or MemberTruncated bool) field to keyviz.MatrixRow and populate it in appendDrainedRow / Flush — set to the real member count (tracked separately in routeSlot) vs. the capped MemberRoutes length.
  2. In newKeyVizRowFrom, wire it:
    row.RouteCount = mr.TotalMemberCount
    row.RouteIdsTruncated = mr.MemberTruncated

A stop-gap using len(mr.MemberRoutes) >= DefaultMaxMemberRoutesPerSlot is heuristic and inaccurate when the cap is customised. That path should be avoided.

Fix this →


Minor issues

Lock reuse for unrelated fields (groupsMu).
Using the same sync.RWMutex for groups, now, and sampler is documented in the comment and is correct for the current call pattern (sampler is set once at startup). No change needed, but worth tracking if RegisterSampler is ever called dynamically.

label, lineage_id, soft_columns, sample_roles are always empty.
These are Phase 3 features (design doc §3 / persistence + lineage). Intentional for this milestone — fine as long as it's noted in the design-doc lifecycle marker when Phase 3 lands.

Missing test cases:

  • No test for req.Rows > 0 (the budget path) — even a test asserting current "rows ignored" behaviour would lock it down until the fix lands, per the CLAUDE.md defect-test convention.
  • No test for from_unix_ms == to_unix_ms (empty window) — ringBuffer.Range returns nil for this case (lo == hi) so matrixToProto returns a valid empty response; a test pinning that is cheap.
  • No test for MemberRoutes truncation scenario to pin RouteIdsTruncated behaviour once the P2 fix lands.

Correctness of unixMsToTime with negative input.
The proto declares from_unix_ms / to_unix_ms as int64, so a client can pass a negative ms (pre-epoch). time.UnixMilli handles negative values correctly and the ring buffer will return an empty range. No bug, but a doc comment noting "negative values are treated as pre-epoch timestamps, not as unbounded" would prevent future confusion.


What works well

  • The KeyVizSampler interface is correctly narrow — production wires *keyviz.MemSampler, tests use fakeKeyVizSampler.
  • Unavailable vs. "successful empty response" distinction for disabled keyviz is the right design choice.
  • Deep copies of Start, End, and RouteIds in newKeyVizRowFrom correctly prevent aliasing with the sampler's internal state.
  • The RLock-only pattern for the sampler field (fetch pointer → release lock → call through pointer) is correct; MemSampler.Snapshot carries its own historyMu.
  • sortKeyVizRowsByStart provides a deterministic secondary sort by BucketId when routes share the same Start key.
  • Tests cover the four non-trivial paths (unavailable, pivot correctness, all five series enum values, aggregate bucket encoding) with clear fixture helpers.

Round-1 review fixes for PR #646:

Codex P1: GetKeyVizMatrix never read req.GetRows() and always
returned every row, breaking the documented row-budget contract for
deployments with many tracked routes. Add applyKeyVizRowBudget — when
budget > 0, sort by per-row activity total (sum of values across
all columns of the requested series) and truncate to the top-N
before the final Start-order sort.

Codex P2: route_count was set to len(MemberRoutes), but MemberRoutes
is intentionally capped at MaxMemberRoutesPerSlot in the sampler
while extra routes still drive the bucket counters. Capped buckets
under-reported their contributors. Add MemberRoutesTotal to the
keyviz routeSlot/MatrixRow surface, increment it in
foldIntoBucket / RegisterRoute / bucket creation regardless of the
visible cap, and decrement in pruneMemberRoute. The adapter now
surfaces it as route_count and flips route_ids_truncated when the
visible list is shorter than the total so consumers can tell their
drill-down list is partial.

Tests:
- TestGetKeyVizMatrixHonorsRowsBudget — 4 routes, rows=2 returns
  top-2 by activity, sorted by Start.
- TestGetKeyVizMatrixSurfacesRouteCountTruncation — visible cap=2,
  total=9 → route_count=9, route_ids_truncated=true.
- TestGetKeyVizMatrixEncodesAggregateBucket updated for the new
  MemberRoutesTotal field on the test fixture.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 5m 2s —— View job


Code Review: feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix

  • Read changed files
  • Read keyviz package for context
  • Read proto definition and design doc
  • Five-lens review
  • Post final review

Both issues from the prior round have been resolved. P1 (req.GetRows() ignored) and P2 (route_ids_truncated / RouteCount underreporting) are fixed with clean implementations and regression tests. Below is a fresh pass over the new code.


P1 — Row budget: truncation vs. merge (design §5.5 divergence)

applyKeyVizRowBudget implements truncation — it sorts by activity and keeps the top-N rows, silently dropping the rest:

// adapter/admin_grpc.go:611-618
func applyKeyVizRowBudget(rows []*pb.KeyVizRow, budget int) []*pb.KeyVizRow {
    if budget <= 0 || len(rows) <= budget {
        return rows
    }
    sort.Slice(rows, func(i, j int) bool {
        return rowActivityTotal(rows[i]) > rowActivityTotal(rows[j])
    })
    return rows[:budget]
}

Design doc §5.5 specifies a different algorithm:

"The server walks the route list in lexicographic order and greedily merges adjacent routes until the row count fits. Merge priority: lowest total activity across the requested window, so hotspots stay un-merged and visible."

The behavioral difference matters for the UI: truncation creates key-space gaps (ranges that appear entirely absent), while the specified merge produces a continuous heatmap where low-activity adjacent ranges become coarser aggregate buckets. A user looking at the heatmap with rows=50 over 100 routes would see 50 blank rows where routes 51–100 were dropped instead of seeing them coarsened. The test TestGetKeyVizMatrixHonorsRowsBudget pins the current truncation behavior, which is clearly intentional.

If this is a deliberate Phase-1 simplification, it should be:

  1. Documented as a known divergence from the design spec in the PR description and/or the design doc lifecycle marker.
  2. Tracked for Phase-2 completion (a follow-up issue or the design doc's TODO list).

Fix this →


P2 — No unit test for MemberRoutesTotal in keyviz package

TestMemberRoutesCappedAtConfiguredCap (keyviz/sampler_test.go:892) exercises the cap but doesn't assert MemberRoutesTotal:

// Current test only checks MemberRoutes length and counter values,
// not MemberRoutesTotal — the new field has no sampler-level test.
if len(agg.MemberRoutes) > 3 { ... }
if agg.Reads != 8 { ... }
// MemberRoutesTotal never inspected

Per CLAUDE.md: "New/changed branches must have unit tests (table-driven, co-located *_test.go)." The adapter-level tests (TestGetKeyVizMatrixSurfacesRouteCountTruncation) cover the proto wiring through a fake, but don't exercise the actual foldIntoBucketMemberRoutesTotal increment path. A test in keyviz/sampler_test.go should verify:

  • MemberRoutesTotal equals len(MemberRoutes) when under cap
  • MemberRoutesTotal continues to grow past cap while MemberRoutes is capped
  • MemberRoutesTotal is decremented by pruneMemberRoute only when the route was visible

Fix this →


Minor Issues

newKeyVizRowFrom uses first-column metadata, not latest. The pivot loop calls newKeyVizRowFrom(mr, len(cols)) only for the first column where a RouteID appears. For a virtual bucket that gained members over the window, MemberRoutesTotal in the proto row reflects the bucket's state at the start of the window, not the end. For RouteIdsTruncated, this could show false on an early column and true on a later one — the proto response will show whichever came first. Acceptable for Phase 1, but worth noting: if precision matters, the pivot could merge per-column metadata and take the latest value.

No upper-bound validation for rows. Design doc §5.5 notes "API endpoints cap rows at 1024 to bound server work." The implementation applies no such cap before sorting. With MaxTrackedRoutes=10000 the sort is bounded but the proto spec isn't enforced. A simple if req.GetRows() > 1024 { req.Rows = 1024 } (or a codes.InvalidArgument response) would match the spec.

MemberRoutesTotal drift for past-cap churn. pruneMemberRoute correctly documents that it only decrements MemberRoutesTotal for routes visible in MemberRoutes. A route folded past the cap that later retires and re-registers will increment MemberRoutesTotal again without a matching decrement, causing slow drift upward. The current comment is accurate but doesn't bound the drift. For high-churn deployments with many routes compressing into a single virtual bucket this could eventually cause RouteIdsTruncated=true even when the actual active route count fits within MaxMemberRoutesPerSlot. Documenting the bound (or resetting MemberRoutesTotal during slot reclaim if the foldIntoBucket path hasn't been called yet) would help.


What Works Well

  • Both prior P1/P2 issues are resolved. applyKeyVizRowBudget is correctly wired to req.GetRows(), and MemberRoutesTotal is correctly tracked through foldIntoBucket, drained in appendDrainedRow, and surfaced via newKeyVizRowFromRouteCount / RouteIdsTruncated.
  • Lock discipline is consistent. MemberRoutesTotal is always accessed under metaMu (write in foldIntoBucket / pruneMemberRoute / RegisterRoute; read in snapshotMeta). The RLock-fetch-unlock pattern for the sampler field remains correct.
  • cloneColumn correctly propagates MemberRoutesTotal via struct value copy (rows[i] = row) — no manual copy needed since it's a uint64.
  • newKeyVizRowFrom fallback for pre-MemberRoutesTotal data (!mr.Aggregate && total == 0 → total = 1) is a safe compatibility shim.
  • Five new tests cover the four key behaviors: Unavailable sentinel, pivot correctness, all five series enum values, aggregate bucket encoding, truncation signal (RouteIdsTruncated), and the rows budget cap. Table-driven structure and isolated fixtures are clean.
  • foldIntoBucket correctly separates the total-counter increment from the visible-list append — the two concerns are now independent, which was the core correctness gap in the prior round.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 237a0902d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread keyviz/sampler.go Outdated
Comment on lines +462 to +464
if !memberRoutesContains(bucket.MemberRoutes, routeID) {
bucket.MemberRoutesTotal++
if len(bucket.MemberRoutes) < s.opts.MaxMemberRoutesPerSlot {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid double-counting hidden members on virtual-route rejoin

foldIntoBucket increments MemberRoutesTotal whenever the route ID is not in MemberRoutes, but over-cap members are intentionally never added to MemberRoutes. If one of those hidden routes is removed and then re-registered into the same bucket within the prune grace window, RegisterRoute calls foldIntoBucket before canceling the pending prune, so the same logical member is counted twice and route_count drifts upward even though the bucket membership did not grow.

Useful? React with 👍 / 👎.

Comment thread keyviz/sampler.go
Comment on lines +741 to +742
if removed && bucket.MemberRoutesTotal > 0 {
bucket.MemberRoutesTotal--
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decrement aggregate totals when pruning capped-out members

pruneMemberRoute only decrements MemberRoutesTotal when the route ID is present in the visible MemberRoutes slice. For members folded past MaxMemberRoutesPerSlot, the ID is absent from that slice, so removals never reduce MemberRoutesTotal; after grace expiry the bucket can report a permanently inflated route_count (and keep route_ids_truncated true) even when those routes are no longer mapped to the bucket.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

Round-2 review fixes for PR #646 (Codex P2 ×2):

P2 #1: foldIntoBucket dedup only consulted the visible MemberRoutes
slice, so re-registering a past-cap (hidden) routeID inside the
prune grace incremented MemberRoutesTotal a second time —
route_count drifted upward on remove/re-register churn even though
the bucket membership didn't grow.

P2 #2: pruneMemberRoute only decremented MemberRoutesTotal for
routes visible in MemberRoutes. Past-cap members were absent from
that list, so removals never reduced the total — route_count stayed
permanently inflated and route_ids_truncated stayed permanently
true even after the route was fully gone.

Add a `hiddenMembers map[uint64]struct{}` set on routeSlot. Past-cap
folds land there instead of being silently dropped from accounting:
- foldIntoBucket dedups against both MemberRoutes and hiddenMembers
  before incrementing.
- pruneMemberRoute checks hiddenMembers when the routeID isn't in
  the visible list, deletes from the set (releasing the map when it
  empties), and decrements the total.

Refactor: extract addMemberToBucket helper so foldIntoBucket stays
under the cyclop budget.

Tests:
- TestPastCapMemberRejoinDoesNotInflateTotal — Remove + Register a
  hidden member → MemberRoutesTotal stays the same, no drift.
- TestPastCapMemberPruneDecrementsTotal — Remove + grace expiry of a
  hidden member → MemberRoutesTotal decrements after the prune fires
  on the second post-grace flush.
- Updated TestMemberRoutesCappedAtConfiguredCap to assert
  MemberRoutesTotal=8 alongside the existing visible-cap=3 check.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 4m 8s —— View job


Code Review: feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix (Round 3)

  • Read changed files
  • Read design doc and proto for spec compliance
  • Five-lens review
  • Post final review

Round 2's two P1/P2 fixes are in and structurally correct. Two Codex-flagged bugs in keyviz/sampler.go (lines 464 and 742) remain open and are the main new findings this round.


P1 — MemberRoutesTotal double-counts on hidden-route re-registration (keyviz/sampler.go:462)

Codex flagged this at line 464 and it is correct. The bug path:

  1. Route X registers over the MaxTrackedRoutes cap → foldIntoBucket is called → MemberRoutesTotal++ (X is not in MemberRoutes, so the guard memberRoutesContains is false) → MemberRoutesTotal = N.
  2. RemoveRoute(X) — deletes X from virtualForRoute, queues a memberPrune(bucket=B, routeID=X).
  3. RegisterRoute(X) within the grace window — X is gone from virtualForRoute, so RegisterRoute proceeds.
  4. findVirtualBucket finds B again → foldIntoBucket(B, X) is called again.
  5. memberRoutesContains(B.MemberRoutes, X) is false (X was never visible, cap was exceeded) → MemberRoutesTotal++ → N+1.
  6. cancelPendingPruneFor(B, X) correctly cancels the prune.

Result: MemberRoutesTotal is N+1 even though only N distinct routes ever folded in. Each remove-and-rejoin within the grace window for a hidden route inflates the counter by 1. For a deployment where routes register and deregister frequently against a full bucket, RouteIdsTruncated becomes permanently true, and route_count drifts monotonically upward.

The root cause is that foldIntoBucket has no way to detect "this route was already counted but is not visible in MemberRoutes". The fix options:

Option A (preferred): Before calling foldIntoBucket, check whether a memberPrune is pending for (bucket, routeID) — if so, this is a re-registration of an already-counted member. Skip the MemberRoutesTotal++ in foldIntoBucket in that case. The existing cancelPendingPruneFor already covers the route staying visible; it just needs a companion "was this already counted?" signal. In practice, cancelPendingPruneFor is called immediately after foldIntoBucket, so the check can be done in RegisterRoute before the foldIntoBucket call and passed as a parameter:

alreadyCounted := s.hasPendingPruneFor(bucket, routeID)
s.foldIntoBucket(next, bucket, routeID, start, end, alreadyCounted)

Option B (simpler): Add a hiddenMembers map[uint64]struct{} to routeSlot tracking past-cap route IDs. foldIntoBucket checks this set before incrementing; pruneMemberRoute removes from the set so the decrement can be applied. This also fixes P2 below.

Fix this →


P2 — pruneMemberRoute never decrements MemberRoutesTotal for past-cap members (keyviz/sampler.go:728)

Codex flagged this at line 742 and is correct. When a hidden (past-cap) route's grace window expires, advancePendingPrunes calls pruneMemberRoute(bucket, routeID). Inside:

// keyviz/sampler.go:732-743
filtered := bucket.MemberRoutes[:0]
removed := false
for _, m := range bucket.MemberRoutes {
    if m == routeID { removed = true; continue }
    filtered = append(filtered, m)
}
bucket.MemberRoutes = filtered
if removed && bucket.MemberRoutesTotal > 0 {
    bucket.MemberRoutesTotal--
}

For a past-cap route, routeID is never in MemberRoutesremoved stays falseMemberRoutesTotal is not decremented. The comment on the function acknowledges this: "routes pruned past the visible cap stay in the total because we don't track individual past-cap members."

Combined with P1, the result is that for any deployment where routes churn past the cap, MemberRoutesTotal grows without bound. After O(churn) remove/re-register cycles, RouteIdsTruncated is permanently true and route_count is meaningless.

Both P1 and P2 share the same root fix: track hidden members so prune can decrement. The hiddenMembers set approach (Option B above) handles both in one change.

Fix this →


Minor Issues

TestMemberRoutesCappedAtConfiguredCap does not assert MemberRoutesTotal (keyviz/sampler_test.go:892).
The test checks len(agg.MemberRoutes) <= 3 and agg.Reads == 8 but never reads agg.MemberRoutesTotal. Per CLAUDE.md, new fields need test coverage. Adding if agg.MemberRoutesTotal != 8 { t.Fatalf(...) } would both pin the correct behavior and would have caught P1 if the re-registration scenario had been exercised.

Design doc §4.1 says series default is writes, code defaults to Reads.

Design doc:

series | enum(reads,writes,readBytes,writeBytes) | writes | Selects which counter is returned

matrixSeriesPicker (adapter/admin_grpc.go:563):

case pb.KeyVizSeries_KEYVIZ_SERIES_UNSPECIFIED, pb.KeyVizSeries_KEYVIZ_SERIES_READS:
    return func(r keyviz.MatrixRow) uint64 { return r.Reads }

The code defaults UNSPECIFIED to Reads; the design doc says the default is writes. One of them needs to be updated. The PR description says "UNSPECIFIED defaults to Reads" so the intent is clear — the design doc is stale. The doc should be updated to say reads, or the proto comment for KEYVIZ_SERIES_UNSPECIFIED should be updated to reflect the actual behavior.

No upper-bound enforcement on rows (design §4.1 says cap is 1024).
int(req.GetRows()) is passed directly to applyKeyVizRowBudget with no validation. A client sending rows=0 means "no cap" (fine), but the design doc mentions a server-side cap of 1024 to bound work. With MaxTrackedRoutes=10_000 the sort is O(N log N) but still proportional to route count, not to the requested rows. Consider returning codes.InvalidArgument for rows > 1024 or silently clamping.

applyKeyVizRowBudget divergence from §5.5 still needs design-doc annotation (carried from Round 2).
The test TestGetKeyVizMatrixHonorsRowsBudget explicitly pins the truncation behavior ("top-N by activity, then sort by Start"), which differs from §5.5's "lexicographic walk + greedy merge". This is an intentional Phase-1 simplification, but the design doc's lifecycle marker (*_proposed_*.md*_partial_*.md) should call out which parts of §5.5 have shipped vs. are deferred, so a future reader doesn't assume the merge algorithm is implemented.


What Works Well

  • The RegisterSampler / KeyVizSampler interface boundary is clean. GetKeyVizMatrix correctly RLocks, fetches the pointer, releases, then calls through — no lock held during Snapshot.
  • applyKeyVizRowBudget correctly applies AFTER pivot and BEFORE sort, so the top-N rows are selected by activity then displayed in key order.
  • newKeyVizRowFrom correctly falls back to RouteCount=1 for pre-MemberRoutesTotal individual slots — a safe compatibility shim.
  • Deep copies in fakeKeyVizSampler.Snapshot mirror the real sampler's contract; tests are isolated and table-driven.
  • Both prior P1/P2 (rows-budget, RouteIdsTruncated) are correctly wired through the proto response.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 5m 51s —— View job


Code Review: Round 3 — feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix

  • Read latest commit diff (e4b11af)
  • Read changed files (keyviz/sampler.go, adapter/admin_grpc.go, *_test.go)
  • Check prior round issues (P1 truncation vs. merge, P2 sampler test, Codex issues)
  • Five-lens review pass
  • Post final review

Both Codex round-2 P2 issues are resolved with clean correctness. One issue from round 2 remains open and undocumented.


What the latest commit (e4b11af) fixes — confirmed correct

Codex P2 #1 — Double-count on rejoin of hidden member

addMemberToBucket now deduplicates against both MemberRoutes and hiddenMembers before incrementing MemberRoutesTotal. A remove+re-register for a past-cap route within the prune grace no longer inflates the count:

// keyviz/sampler.go:487-504
func addMemberToBucket(bucket *routeSlot, routeID uint64, visibleCap int) {
    if memberRoutesContains(bucket.MemberRoutes, routeID) {
        return  // visible dedup
    }
    if bucket.hiddenMembers != nil {
        if _, ok := bucket.hiddenMembers[routeID]; ok {
            return  // hidden dedup — the fix
        }
    }
    // ...
    bucket.MemberRoutesTotal++
}

Codex P2 #2 — Prune never decrementing for hidden members

pruneMemberRoute now falls through to hiddenMembers when the ID isn't in MemberRoutes, deletes it, and decrements MemberRoutesTotal. The null-out of the map when it empties (bucket.hiddenMembers = nil) keeps the nil-check fast on the happy path.

Lock disciplinehiddenMembers is always accessed under bucket.metaMu.Lock() (writes in addMemberToBucket / pruneMemberRoute) and bucket.metaMu.RLock() (reads in snapshotMeta). The field is never exposed through MatrixRow, only MemberRoutesTotal is surfaced — correct level of abstraction.

cancelPendingPruneFor interaction is correct — when a hidden route rejoins the SAME bucket, the prune is cancelled and addMemberToBucket finds the ID in hiddenMembers and returns without a double-increment. When it joins a DIFFERENT bucket, the original bucket's prune fires as intended and decrements correctly.

reclaimRetiredSlot safety — the reclaim path only reclaims !Aggregate slots (individual routes). Individual slots never have hiddenMembers set, so the fields reset in RegisterRoute (Start, End, MemberRoutesTotal = 1) are sufficient.

TestsTestPastCapMemberRejoinDoesNotInflateTotal, TestPastCapMemberPruneDecrementsTotal, and the updated TestMemberRoutesCappedAtConfiguredCap (asserting MemberRoutesTotal=8) provide exact regression coverage for both defects. The test structure (minimal sampler options, direct Flush/Advance control) matches the rest of the package.


P1 — Row budget divergence from §5.5 still undocumented (round-2 carry-over)

applyKeyVizRowBudget truncates to the top-N rows by activity, silently dropping the rest:

// adapter/admin_grpc.go:611-618
func applyKeyVizRowBudget(rows []*pb.KeyVizRow, budget int) []*pb.KeyVizRow {
    if budget <= 0 || len(rows) <= budget {
        return rows
    }
    sort.Slice(rows, func(i, j int) bool {
        return rowActivityTotal(rows[i]) > rowActivityTotal(rows[j])
    })
    return rows[:budget]
}

Design doc §5.5 specifies a different algorithm: walk in lexicographic order, greedily merging adjacent low-activity rows until the count fits (hotspots stay un-merged and visible). Truncation creates key-space gaps; merge preserves continuity.

If this is an intentional Phase-1 simplification, it needs to be documented. The PR description and design doc currently don't mention this divergence. Before this lands, one of the following should happen:

  • Preferred: Update the design doc lifecycle marker from *_proposed_*.md to *_partial_*.md and note "row-budget truncation only (§5.5 merge deferred to Phase 2)."
  • Acceptable: Add a comment in applyKeyVizRowBudget naming the divergence and linking the TODO — e.g. // NOTE: design §5.5 specifies merge-adjacent; this is simpler truncation for Phase 1.

The TestGetKeyVizMatrixHonorsRowsBudget test is fine as-is — it pins the current truncation behavior, which is the right approach while the merge is deferred.

Fix this →


Minor Issues

No upper-bound cap on req.GetRows() (still from round 2): Design doc §5.5 specifies max 1024. applyKeyVizRowBudget applies no server-side cap, so a client sending Rows=100000 gets a full sort of all sampled rows before truncation. With MaxTrackedRoutes=10000 this is bounded, but the proto contract isn't enforced. A simple guard at the top of GetKeyVizMatrix (if rows > 1024 { rows = 1024 }) or a codes.InvalidArgument response would close this.

hiddenMembers-only aggregate visible list (new edge case): When all visible routes are pruned (their MemberRoutes entries removed by deferred prunes) while hidden routes remain active — possible during the grace window — snapshotMeta returns members = nil, membersTotal = N > 0. This produces RouteIds=[], RouteCount=N, RouteIdsTruncated=true. Technically correct (counts are accurate), but visually odd: the drawer shows "N routes (truncated)" with zero visible IDs. This is a transient state bounded by the grace window and requires specific route-churn patterns. Not a blocker, but worth a comment in snapshotMeta noting the transient empty-visible-list case.


What works well across all three rounds

  • Both prior P1/P2 and Codex round-2 issues are resolved. The hiddenMembers approach is the right fix — minimal per-slot memory cost (nil map on non-aggregate slots), O(1) dedup for past-cap members, correct prune accounting.
  • MemberRoutesTotal invariant (len(MemberRoutes) + len(hiddenMembers)) is maintained correctly across all mutation paths: fold, prune, cancel, reclaim.
  • Lock discipline is consistent throughout all three commits. No lock promotions, no deadlock paths.
  • Test quality — all three commits include targeted regression tests for every defect identified in review. This matches the CLAUDE.md convention precisely.
  • applyKeyVizRowBudget + sortKeyVizRowsByStart ordering — budget is applied before the final Start-order sort, so the final slice is both activity-filtered and deterministically ordered. Correct.

bootjp added a commit that referenced this pull request Apr 26, 2026
…nator + admin) (#647)

## Summary

End-to-end keyviz wiring on each node binary. Stacked on top of #645
(coordinator `WithSampler`) and #646 (admin server `RegisterSampler`) —
those two PRs are merged into this branch via merge commits, so this
PR's incremental diff is `main.go` + `main_keyviz_test.go`.

- Three new flags: `--keyvizEnabled` (off by default — opt-in),
`--keyvizStep`, `--keyvizMaxTrackedRoutes`,
`--keyvizMaxMemberRoutesPerSlot`. Defaults pull from `keyviz.Default*`
so flag help and code stay in sync.
- `buildKeyVizSampler()` returns `*keyviz.MemSampler` when enabled,
`nil` otherwise — the single decision point for "keyviz on/off."
- `seedKeyVizRoutes()` copies the engine's startup route catalogue into
the sampler. Route-watch propagation post-startup is a follow-up (Phase
3 in the design doc).
- `startKeyVizFlusher()` runs `RunFlusher` in the existing errgroup and
calls one final `Flush()` after ctx fires so the in-progress step is
harvested at graceful shutdown.
- Coordinator: chained `.WithSampler(...)` onto the existing
`.WithLeaseReadObserver(...)` call so the dispatch hot path observes
mutations.
- AdminServer: `setupAdminService` now takes the `*MemSampler` and only
calls `RegisterSampler` when it's non-nil — operators with keyviz
disabled get `codes.Unavailable` on `GetKeyVizMatrix` instead of a
spurious empty success.

Implements `docs/admin_ui_key_visualizer_design.md` §5 / §10. After this
lands the heatmap is fully usable end-to-end.

## Test plan

- [x] `TestBuildKeyVizSamplerHonorsEnabledFlag` — flag on/off contract.
- [x] `TestSeedKeyVizRoutesCopiesEngineCatalogue` — `engine.Stats()`
seed path.
- [x] `TestSeedKeyVizRoutesNoOpOnNilSampler` — disabled-sampler safety.
- [x] `TestStartKeyVizFlusherReturnsAfterCancel` — graceful shutdown
drains the pre-cancel counters via the final `Flush`.
- [x] `go build .`, `go vet .`, `golangci-lint run ./...` clean.
- [x] `go test -race -count=1 -run
'TestBuildKeyVizSampler|TestSeedKeyVizRoutes|TestStartKeyVizFlusher' .`
passes.

## Merge order

1. Merge #645 (coordinator wiring).
2. Merge #646 (admin server wiring).
3. Rebase this branch on `main` (the merge commits in this PR collapse
to no-ops) and merge.
…gence

Round-3 review polish for PR #646 (Claude bot):

- Default series swapped from Reads to Writes for KEYVIZ_SERIES_UNSPECIFIED
  to match design doc §4.1, which spec's writes as the default. The
  read-sampling path is intentionally Phase 2 (no reads will land in
  the matrix until then) so defaulting to a permanently-zero series
  was actively misleading.
- Add keyVizRowBudgetCap = 1024 + clampRowBudget. Design §4.1 caps
  rows at 1024 to bound server work; pathological clients asking for
  more get the cap, not an error. Regression test
  TestGetKeyVizMatrixClampsRowsBudgetToCap stages 1029 rows and
  verifies the response carries 1024.
- Document the design §5.5 divergence on applyKeyVizRowBudget:
  truncation is a Phase-1 simplification of the spec'd
  "lexicographic walk + greedy merge" so future contributors don't
  treat the truncation behavior as the contract.
- Update TestGetKeyVizMatrixSeriesSelection's UNSPECIFIED case to
  expect Writes instead of Reads.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 44s —— View job


Code Review (Round 4): feat(adapter): expose keyviz heatmap via AdminServer.GetKeyVizMatrix

  • Read changed files (latest state — commit 82dd4c4)
  • Read design doc and proto for spec compliance
  • Five-lens review
  • Post final review

All three prior rounds' P1/P2 issues are resolved. The latest commit (82dd4c4 — "clamp rows to design cap + writes default + doc divergence") addressed the three Round 3 minor issues: clampRowBudget + keyVizRowBudgetCap = 1024, UNSPECIFIED → Writes, and the applyKeyVizRowBudget §5.5 divergence comment. Two small documentation bugs were introduced in the same commit and are the only new findings.


P1 — Stale GetKeyVizMatrix function docstring (adapter/admin_grpc.go:519)

The function comment contradicts the implementation:

// counter; KEYVIZ_SERIES_UNSPECIFIED defaults to Reads.

The series was switched to Writes in this commit, and matrixSeriesPicker's own docstring is already correct ("falls through to Writes per design doc §4.1"). The GetKeyVizMatrix docstring was not updated. An API consumer reading only the public function comment would configure their client to expect Reads on the default enum value and then observe Writes — opposite of what's documented.

Fix: Change line 519 to counter; KEYVIZ_SERIES_UNSPECIFIED defaults to Writes.

Fix this →


Minor — Stale test docstring (adapter/admin_grpc_keyviz_test.go:108)

// TestGetKeyVizMatrixSeriesSelection pins the request.Series →
// MatrixRow counter mapping including the UNSPECIFIED → Reads default.

The test case immediately below says {"unspecified defaults to writes", ...} and asserts want: 22 (Writes). The function docstring was not updated when the test case expectation was fixed. A reader adding a new series variant would see the comment say "Reads" and trust it over the test body.

Fix: Change line 108 to // MatrixRow counter mapping including the UNSPECIFIED → Writes default.

Fix this →


Minor — //nolint:gosec conflicts with CLAUDE.md policy (adapter/admin_grpc_keyviz_test.go:225)

idx := uint64(i + 1) //nolint:gosec // i is bounded by keyVizRowBudgetCap+5

CLAUDE.md: "Avoid //nolint — refactor instead." The gosec flag (G115 integer overflow on type conversion) is trivially avoided by changing the operand order so the conversion operates on a bounded type, not on the result of addition:

idx := uint64(i) + 1

i is bounded to [0, keyVizRowBudgetCap+4] by the range loop; uint64(i) is always safe, and the +1 is unsigned arithmetic. No //nolint needed.

Fix this →


Five-lens summary

  1. Data loss — No new paths. applyKeyVizRowBudget is documented as intentional Phase-1 truncation. hiddenMembers + pruneMemberRoute correctly maintain MemberRoutesTotal across all churn paths.
  2. Concurrency — No new paths. RLock-fetch-unlock for sampler, metaMu guards on all hiddenMembers access, and cancelPendingPruneFor pairing with routesMu remain correct.
  3. PerformanceclampRowBudget is a single comparison; no new allocations in the hot path.
  4. Data consistency — P1 above: the stale "defaults to Reads" docstring is the only consistency issue (documentation vs. actual behavior). The implementation itself is correct.
  5. Test coverageTestGetKeyVizMatrixClampsRowsBudgetToCap correctly pins the 1024 cap. TestMemberRoutesCappedAtConfiguredCap now asserts MemberRoutesTotal == 8. Both prior regression tests (TestPastCapMemberRejoinDoesNotInflateTotal, TestPastCapMemberPruneDecrementsTotal) are well-structured and cover the key correctness invariants.

What works well across all four rounds

  • Four rounds of defects addressed, all with regression tests. Every P1/P2 raised in rounds 1–3 has a corresponding test pinning the corrected behavior. The repo's CLAUDE.md defect-test convention is satisfied throughout.
  • hiddenMembers fix is clean: nil-on-empty map, correct metaMu discipline, pruneMemberRoute handles both visible and hidden routes, cancelPendingPruneFor correctly cancels prunes for rejoining hidden members.
  • clampRowBudget + keyVizRowBudgetCap correctly enforces the §4.1 server-side bound; rows=0 is preserved as "no cap."
  • applyKeyVizRowBudget comment clearly documents the §5.5 Phase-1 divergence so future contributors understand truncation is intentional.
  • Pivot logic (matrixToProtoapplyKeyVizRowBudgetsortKeyVizRowsByStart) is in the right order: activity-based budget first, then deterministic key-order output.

@bootjp bootjp merged commit 34651ad into main Apr 26, 2026
8 checks passed
@bootjp bootjp deleted the feat/keyviz-admin-server branch April 26, 2026 09:48
bootjp added a commit that referenced this pull request Apr 26, 2026
Bridges the in-process keyviz.MemSampler into the in-node admin
HTTP listener (Layer C) so the SPA can render the heatmap without
an extra hop through the gRPC Admin service. The SPA was merged in
PR #649 but has no KeyViz page yet — this PR is the backend half;
the SPA page is task B in the design's Phase 2 follow-up.

- New KeyVizSource interface in internal/admin (Snapshot only) so
  *keyviz.MemSampler satisfies it without a wrapper. Tests pass an
  in-memory fake.
- KeyVizHandler serves GET /admin/api/v1/keyviz/matrix. Query
  parameters: series (default writes per design §4.1),
  from_unix_ms, to_unix_ms, rows (capped at keyVizRowBudgetCap=1024
  per design §4.1).
- Returns 503 keyviz_disabled when the source is nil so the SPA
  renders an explicit "feature off" state instead of an empty matrix.
- Pivots column-major MatrixColumn into row-major JSON, mirroring
  the adapter (gRPC) handler from PR #646. The duplication is
  intentional for now — extracting a shared pivot helper is a
  future cleanup.
- ServerDeps.KeyViz field; buildAPIMux always registers the route
  (even when source is nil) so 503 is served instead of 404.
- main.go threads *keyviz.MemSampler through startAdminFromFlags →
  startAdminServer → buildAdminHTTPServer; keyvizSourceFromSampler
  converts the typed-nil to interface-nil so the handler's "is
  keyviz disabled" check fires correctly.

Tests:
- TestKeyVizHandlerReturnsServiceUnavailableWhenNoSource
- TestKeyVizHandlerRejectsNonGet
- TestKeyVizHandlerPivotsMatrix (two-column, two-route fixture)
- TestKeyVizHandlerSeriesParam (table-driven across all four enums)
- TestKeyVizHandlerSeriesParamRejectsUnknown (400 invalid_query)
- TestKeyVizHandlerHonorsRowsBudget
- TestKeyVizHandlerEncodesAggregateBucket
bootjp added a commit that referenced this pull request Apr 26, 2026
Bridges the in-process keyviz.MemSampler into the in-node admin
HTTP listener (Layer C) so the SPA can render the heatmap without
an extra hop through the gRPC Admin service. The SPA was merged in
PR #649 but has no KeyViz page yet — this PR is the backend half;
the SPA page is task B in the design's Phase 2 follow-up.

- New KeyVizSource interface in internal/admin (Snapshot only) so
  *keyviz.MemSampler satisfies it without a wrapper. Tests pass an
  in-memory fake.
- KeyVizHandler serves GET /admin/api/v1/keyviz/matrix. Query
  parameters: series (default writes per design §4.1),
  from_unix_ms, to_unix_ms, rows (capped at keyVizRowBudgetCap=1024
  per design §4.1).
- Returns 503 keyviz_disabled when the source is nil so the SPA
  renders an explicit "feature off" state instead of an empty matrix.
- Pivots column-major MatrixColumn into row-major JSON, mirroring
  the adapter (gRPC) handler from PR #646. The duplication is
  intentional for now — extracting a shared pivot helper is a
  future cleanup.
- ServerDeps.KeyViz field; buildAPIMux always registers the route
  (even when source is nil) so 503 is served instead of 404.
- main.go threads *keyviz.MemSampler through startAdminFromFlags →
  startAdminServer → buildAdminHTTPServer; keyvizSourceFromSampler
  converts the typed-nil to interface-nil so the handler's "is
  keyviz disabled" check fires correctly.

Tests:
- TestKeyVizHandlerReturnsServiceUnavailableWhenNoSource
- TestKeyVizHandlerRejectsNonGet
- TestKeyVizHandlerPivotsMatrix (two-column, two-route fixture)
- TestKeyVizHandlerSeriesParam (table-driven across all four enums)
- TestKeyVizHandlerSeriesParamRejectsUnknown (400 invalid_query)
- TestKeyVizHandlerHonorsRowsBudget
- TestKeyVizHandlerEncodesAggregateBucket
bootjp added a commit that referenced this pull request Apr 26, 2026
)

## Summary

Bridges the in-process `keyviz.MemSampler` into the in-node admin HTTP
listener (Layer C, `internal/admin`) so the SPA can render the heatmap
without an extra hop through the gRPC Admin service. The SPA was merged
in #649 but has no KeyViz page yet — this PR is the backend half; the
SPA page will follow as a separate slice.

- New `KeyVizSource` interface in `internal/admin` (Snapshot only) so
`*keyviz.MemSampler` satisfies it without a wrapper. Tests pass an
in-memory fake.
- `KeyVizHandler` serves `GET /admin/api/v1/keyviz/matrix`. Query
parameters: `series` (default `writes` per design §4.1), `from_unix_ms`,
`to_unix_ms`, `rows` (capped at `keyVizRowBudgetCap=1024` per design
§4.1).
- Returns `503 keyviz_disabled` when the source is nil so the SPA
renders an explicit "feature off" state instead of an empty matrix.
- Pivots column-major `MatrixColumn` into row-major JSON, mirroring the
adapter (gRPC) handler from PR #646. The duplication is intentional for
now — extracting a shared pivot helper is a future cleanup.
- `ServerDeps.KeyViz` field; `buildAPIMux` always registers the route
(even when source is nil) so `503` is served instead of `404`.
- `main.go` threads `*keyviz.MemSampler` through `startAdminFromFlags` →
`startAdminServer` → `buildAdminHTTPServer`; `keyvizSourceFromSampler`
converts the typed-nil to interface-nil so the handler's "is keyviz
disabled" check fires correctly.

## Test plan

- [x] `TestKeyVizHandlerReturnsServiceUnavailableWhenNoSource` — 503 +
`keyviz_disabled` code.
- [x] `TestKeyVizHandlerRejectsNonGet` — 405.
- [x] `TestKeyVizHandlerPivotsMatrix` — two-column, two-route fixture;
verifies missing-row-becomes-zero contract.
- [x] `TestKeyVizHandlerSeriesParam` — table-driven across all four enum
values.
- [x] `TestKeyVizHandlerSeriesParamRejectsUnknown` — 400 `invalid_query`
for typo'd series.
- [x] `TestKeyVizHandlerHonorsRowsBudget` — `?rows=2` returns top-2 by
activity, sorted by Start.
- [x] `TestKeyVizHandlerEncodesAggregateBucket` — virtual-bucket layout
with `route_count` from `MemberRoutesTotal` and `route_ids_truncated`
flag.
- [x] `go test ./internal/admin/... .` clean.
- [x] `golangci-lint run . ./internal/admin/...` clean.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added `/admin/api/v1/keyviz/matrix` API endpoint for retrieving key
visualization matrix data
* Supports filtering by metric type: reads, writes, read_bytes, and
write_bytes
  * Supports time-range filtering with Unix millisecond precision
  * Supports row limiting and activity-based sorting
  * Returns 503 status when key visualization sampling is disabled

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 26, 2026
## Summary

Phase 2-B of the Key Visualizer design
(`docs/admin_ui_key_visualizer_design.md`): integrate the heatmap into
the existing `web/admin/` SPA rather than building a parallel dashboard.
Doc + implementation in one PR (doc commit first, per CLAUDE.md
design-doc-first workflow).

- New page at `/keyviz` — canvas heatmap polling
`/admin/api/v1/keyviz/matrix` with series picker (writes / reads /
write_bytes / read_bytes), row-budget input (clamped at 1024), and off /
5 s / 30 s auto-refresh.
- Cold cells (value 0) render as the page background, not a faint blue —
spotting actually-cold routes stays the dominant visual signal.
- Row-detail flyout on hover: bucket_id, start, end, aggregate,
route_count, route_ids (with truncation indicator). `start` / `end`
decode through a printable-or-hex preview so binary keys do not render
as mojibake.
- Dependency-free: hand-rolled five-stop colour ramp in
`lib/colorRamp.ts`. No d3, no ECharts. Bundle grew from ~155 kB to ~208
kB (raw), 64 kB gzipped.

Backend is unchanged. The handler at `/admin/api/v1/keyviz/matrix` and
its sampler wiring already shipped under Phase 2-A (PRs #639 / #645 /
#646 / #647 / #651 / #660 / #661 / #672).

## Design

`docs/design/2026_04_27_proposed_keyviz_spa_integration.md` — proposed
status, lands with this PR. Parent design §12 phase table is split into
2-A (server, shipped) / 2-B (SPA, this PR) / 2-C (cluster fan-out,
open).

## Five-lens self-review

1. **Data loss** — n/a; SPA is read-only against an existing handler.
2. **Concurrency / distributed** — n/a; single browser tab polling a
single handler. Sampler concurrency was already covered by Phase 2-A
tests.
3. **Performance** — Canvas + per-cell `fillRect` runs under §10 budget
at 1024 × 500. Auto-refresh defaults off; 5-second cadence is the lower
bound (sampler flush is 1 s).
4. **Data consistency** — SPA renders whatever the handler returns;
consistency guarantees come from the existing leader-issued counters in
the sampler.
5. **Test coverage** — `tsc -b --noEmit` clean; `vite build` clean; `go
build ./internal/admin/...` clean (embed glob unaffected); `go test
./internal/admin/...` clean. Manual verification documented in the
design doc §5.

## Test plan

- [x] `npm run lint` (`tsc -b --noEmit`) — clean
- [x] `npm run build` (Vite) — clean, output goes to
`internal/admin/dist`
- [x] `go build ./internal/admin/...` — clean
- [x] `go test ./internal/admin/...` — clean
- [ ] Manual: `make run` + `make client`, navigate to `/keyviz`, see hot
routes light up red within ~5 s of write traffic
- [ ] Manual: series picker swaps the displayed counter; row-budget
input clamps at 1024; auto-refresh polls without flicker

## Out of scope

- **Cluster fan-out** — handler is currently node-local. Phase 2-C will
add a cross-node admin RPC; this PR will pick up the aggregate view
automatically once that ships.
- **Drill-down per-route sparkline** — Phase 3.
- **Routes / Raft Groups correlation** — Phase 1 SPA pages not yet
built; correlation lands when those pages do.
- **`localStorage` for series / rows / refresh** — punt to follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant