Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
a6270d3
docs: propose KeyViz adapter / namespace labels (follow-up)
bootjp Apr 27, 2026
3c14f29
docs: address PR #694 round-1 (Claude bot + Gemini)
bootjp Apr 27, 2026
ce8f95b
docs: address PR #694 round-2 (Claude bot)
bootjp Apr 27, 2026
ae43bfe
docs: address PR #694 round-3 (Claude bot)
bootjp Apr 28, 2026
2750f4c
docs: address PR #694 round-4 (Claude bot + CodeRabbit)
bootjp Apr 28, 2026
9e1dcf1
Merge branch 'main' into docs/keyviz-adapter-labels
bootjp Apr 28, 2026
3ae3417
docs: address PR #694 round-5 (Codex P1 + P2)
bootjp Apr 28, 2026
3793366
docs: address PR #694 round-5 (Claude bot moderate items)
bootjp Apr 28, 2026
e9769e1
docs: address PR #694 round-6 (Claude bot + Codex)
bootjp Apr 28, 2026
ab0162c
docs: address PR #694 round-7 (Claude bot minor + Codex P2)
bootjp Apr 28, 2026
38e25bf
docs: address PR #694 round-8 (Claude bot: PR-B L0 + Sampler interface)
bootjp Apr 28, 2026
8d6d290
docs: address PR #694 round-8 (Claude bot: gRPC matrixToProto pivot-key)
bootjp Apr 28, 2026
5b83bd1
docs: address PR #694 round-9 (Codex P2 + carry-over)
bootjp Apr 28, 2026
bda27ca
docs: address PR #694 round-9 minor (Claude bot: AllLabels constants …
bootjp Apr 28, 2026
71a24f4
docs: address PR #694 round-10 (Codex P1+P2 + Claude minor)
bootjp Apr 28, 2026
5a14dd6
docs: address PR #694 round-11 (Claude moderate + Codex P2 rolling-up…
bootjp Apr 28, 2026
545b1a1
docs: address PR #694 round-12 (Claude bot: rollout-gate refinements)
bootjp Apr 28, 2026
1c18d2e
Merge branch 'main' into docs/keyviz-adapter-labels
bootjp Apr 28, 2026
f6c4dd7
docs: address PR #694 round-13 (Copilot review)
bootjp Apr 28, 2026
de7dad5
docs: address PR #694 round-14 (Codex P2 §6 contradiction)
bootjp Apr 28, 2026
292918d
Merge branch 'main' into docs/keyviz-adapter-labels
bootjp Apr 28, 2026
88742d4
Merge branch 'main' into docs/keyviz-adapter-labels
bootjp Apr 28, 2026
14f7b98
docs: address PR #694 round-15 (Copilot 2nd pass)
bootjp Apr 28, 2026
6ee4146
docs: address PR #694 round-16 (Copilot 3rd pass)
bootjp Apr 28, 2026
cee5460
docs: address PR #694 round-17 (Codex P1 + Copilot 4th pass)
bootjp Apr 28, 2026
0e4db28
docs: address PR #694 round-20 (Copilot 6th pass)
bootjp Apr 28, 2026
7b99d65
docs: address PR #694 round-21 (CodeRabbit nitpick)
bootjp Apr 28, 2026
c763753
docs: address PR #694 round-22 (Copilot 7th pass)
bootjp Apr 29, 2026
e5c7e39
docs: address PR #694 round-23 (CodeRabbit minor)
bootjp Apr 29, 2026
795c504
docs: address PR #694 round-23 (Copilot 8th pass)
bootjp Apr 29, 2026
09c1c04
docs: address PR #694 round-24 (Claude bot — final fixes)
bootjp Apr 29, 2026
788b685
docs: address PR #694 round-25 (Claude R25 + Codex P2)
bootjp Apr 29, 2026
e7b0c13
docs: address PR #694 round-26 (Codex P2)
bootjp Apr 29, 2026
44d7cd4
docs: address PR #694 round-27 (Claude R27)
bootjp Apr 29, 2026
c788a96
docs: address PR #694 round-28 (Codex P1 #2 stale divisor)
bootjp Apr 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
269 changes: 269 additions & 0 deletions docs/design/2026_04_28_proposed_keyviz_adapter_labels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
---
status: proposed
phase: keyviz / follow-up
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Front-matter phase values in other KeyViz design docs are short, machine-friendly tokens like 2-B / 2-C (see docs/design/2026_04_27_proposed_keyviz_spa_integration.md:1-6 and 2026_04_27_proposed_keyviz_cluster_fanout.md:1-6). phase: keyviz / follow-up is the only variant with spaces and /, which may be harder to grep/sort consistently; consider aligning it to the existing phase scheme (e.g. 2-C+ or similar).

Suggested change
phase: keyviz / follow-up
phase: 2-C+

Copilot uses AI. Check for mistakes.
parent_design: docs/admin_ui_key_visualizer_design.md
date: 2026-04-28
---

# KeyViz adapter / namespace labels

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design doc uses only a YAML frontmatter block; however docs/design/README.md asks new design docs to include a Status/Author/Date header under the title. Consider adding that header (or updating the README in a separate change) so readers get consistent metadata across design docs.

Suggested change
Status: proposed
Author: TBD
Date: 2026-04-28

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +10
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc uses a YAML front-matter block (lines 1–7), but docs/design/README.md specifies the standard design-doc metadata as a header under the title (Status/Author/Date, see README.md:40–53). To stay consistent with the documented convention, please convert this metadata to the standard header format (or update the README if YAML front-matter is now the intended standard).

Suggested change
---
status: proposed
phase: 2-C+
parent_design: docs/admin_ui_key_visualizer_design.md
author: bootjp
date: 2026-04-28
---
# KeyViz adapter / namespace labels
# KeyViz adapter / namespace labels
Status: proposed
Phase: 2-C+
Parent design: `docs/admin_ui_key_visualizer_design.md`
Author: bootjp
Date: 2026-04-28

Copilot uses AI. Check for mistakes.
## 1. Background

Phase 2-A through 2-C ship a fully functional KeyViz heatmap, but
the smallest unit of attribution is a **Raft route** — a contiguous
key range owned by one group. When multiple adapters share a route
(the default single-group config has every adapter writing into a
single `[-∞, +∞)` route), the heatmap shows one row with the
combined traffic and the operator cannot tell whether the spike
came from DynamoDB, Redis, S3, SQS, or RawKV.

A user observed this in production: 6-node cluster, fan-out
returning a single row at `route:1` with `Total = 378` writes,
covering all five adapters indistinguishably.

The hotspot-shard-split design and the multi-group startup flags
(`--raftRedisMap` / `--raftDynamoMap` / `--raftS3Map` /
`--raftSqsMap`) already give operators a way to split traffic
across distinct Raft groups so per-route attribution becomes
per-adapter attribution. But that is an **operational** workaround:
single-group deployments — the most common shape for first-time
operators and small clusters — still get the all-traffic-in-one-row
view.

This proposal adds an **independent label dimension** to the
sampler so a single Raft group can still surface per-adapter
breakdown in the heatmap.

## 2. Goals and non-goals

### 2.1 Goals

- Attribution **inside a single route**: a row that today reads
`route:1, total=378` should optionally split into sub-rows like
`route:1 / dynamo`, `route:1 / s3`, `route:1 / redis`, …
- Zero hot-path penalty when labels are **not** configured: a
default deployment continues to behave exactly as today.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

§2.1 claims "Zero hot-path penalty when labels are not configured", but §4.1 later notes the slots map key widens from uint64 to {uint64, Label}, which adds extra hash/equality work even when callers always pass the empty label. Consider rewording the goal to match the actual expected impact (e.g., "no additional lookups/allocations" / "minimal penalty"), or explicitly proposing a dual representation so the legacy unlabeled path keeps a map[uint64]*routeSlot fast path when labels are disabled.

Suggested change
- Zero hot-path penalty when labels are **not** configured: a
default deployment continues to behave exactly as today.
- Minimal hot-path penalty when labels are **not** configured: a
default deployment preserves current semantics and should incur
no additional lookups or allocations, with only minimal extra
hash/equality work if the implementation widens the in-memory
slot key to include the label.

Copilot uses AI. Check for mistakes.
- Adapter-side wiring is the natural place to set labels (every
adapter already has the dispatch entry into
`ShardedCoordinator.Observe…` — see `kv/sharded_coordinator.go`).
No global key-prefix table on the operator side.
- Cluster fan-out merges per-(route, label) cells, not just
per-route — operators see the same per-adapter breakdown across
the whole cluster.

### 2.2 Non-goals (deferred)

- **Free-form custom labels** (per-table, per-bucket, per-queue).
This proposal stops at one level: the adapter family. A
follow-up can add a second dimension (e.g. `dynamo / users`,
`dynamo / orders`) once the wire format settles.
- **Persistence**. Labels are in-memory like the rest of the
Phase 2 sampler.
- **Per-key-byte attribution**. Sub-route classification is done
at the dispatch entry (the adapter knows its identity); we do
not attempt to classify by inspecting the key bytes.
- **Backwards-incompatible wire-format changes**. The label is
added as an optional field; old SPAs against new servers keep
working.

## 3. Surface

The user-visible delta is one extra label on every heatmap row.
Default config emits the empty label (legacy behaviour); when
adapters set labels, the heatmap splits a single route into one
row per (route, label) pair:

```
Today (single group, no labels):
Row 0 route:1 Total 378

After (adapters tag their traffic):
Row 0 route:1 / dynamo Total 142
Row 1 route:1 / redis Total 200
Row 2 route:1 / sqs Total 18
Row 3 route:1 / s3 Total 11
Row 4 route:1 / rawkv Total 7
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

## 4. Options for label propagation

Four ways to get the label from the adapter to the sampler. Pick
one based on hot-path cost vs. plumbing weight.

### 4.1 (Recommended) Per-Observe label string

Extend the sampler signature:

```go
Observe(routeID uint64, op Op, keyLen, valueLen int, label string)
```

Empty `label` → existing behaviour, single row per route. Adapters
set their own constant string (`"dynamo"`, `"redis"`, …) at the
dispatch site they already own.

Cost on hot path: one extra map lookup per Observe — slot is now
keyed by `(routeID, label)` instead of just `routeID`. The map
key is a struct of `{uint64, string}`; the `label` is a small
interned constant (`"dynamo"`), so allocation should be zero.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposal states that Observe will use a map keyed by (routeID, label) and that RegisterRoute will be a "no-op" for labels. However, Observe is designed to be a lockless hot-path operation that relies on a pre-populated routeTable. If RegisterRoute does not pre-register the (routeID, label) combinations, Observe will fail to find a slot. If Observe were to create slots dynamically, it would require a mutex or a copy-on-write operation, violating the "zero hot-path penalty" goal. Please clarify how labeled slots are populated into the routeTable while maintaining a lockless Observe path.

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says the hot path cost is "one extra map lookup per Observe" when switching from tbl.slots[routeID] to tbl.slots[slotKey{routeID,label}], but the current implementation already does a single slots lookup (with an occasional fallback lookup into virtualForRoute on miss). Consider rewording to reflect the real change: wider key hashing/comparison cost (and possibly a higher miss rate if labels aren't pre-created), rather than an additional lookup on the hit path.

Suggested change
Cost on hot path: one extra map lookup per Observe — slot is now
keyed by `(routeID, label)` instead of just `routeID`. The map
key is a struct of `{uint64, string}`; the `label` is a small
interned constant (`"dynamo"`), so allocation should be zero.
Cost on hot path: still one `slots` map lookup per `Observe` on
the hit path, but keyed by `(routeID, label)` instead of just
`routeID`. That means somewhat wider key hashing/comparison for a
`{uint64, string}` map key, and potentially a higher miss rate if
labeled slots are not pre-created. The `label` is a small interned
constant (`"dynamo"`), so allocation should be zero.

Copilot uses AI. Check for mistakes.

Storage: each (route, label) gets its own `routeSlot`. With 5
adapters and a 1024-route budget, the worst case is 5120 slots
per node — still well below the existing `MaxTrackedRoutes` cap
when operators raise it for label use.

### 4.2 Per-adapter sampler instance

Wire one `*MemSampler` per adapter, each with a fixed label. The
admin handler queries every sampler and concatenates the results.

- Pro: zero hot-path code change in the existing sampler.
- Con: every adapter gets its own ring buffer, history, and
retention machinery. Memory is N× higher and per-route
metadata duplicates across samplers.

### 4.3 Per-key-prefix taxonomy (operator-configured)

Static `{prefix → label}` map registered at startup. Sampler
classifies each key at Observe time by prefix-matching.

- Pro: no adapter wiring; works with any caller that goes through
`ShardedCoordinator`.
- Con: prefix-match per Observe is a hot-path cost, and the
taxonomy is a new operator-facing config the design has been
careful to avoid.

### 4.4 Hash the adapter into the route catalog

Make `distribution.Route` carry an adapter label and route by
adapter at the catalog layer. The sampler stays single-keyed.

- Pro: solves attribution at the catalog level, where it actually
belongs.
- Con: the catalog is the wrong place for this — adapters share
Raft groups by design, and forcing `Route` to carry adapter
identity bakes a different separation into the route topology.
Reverses the multi-group startup-flag story.

**Recommendation: Option 4.1.** Lowest plumbing weight, smallest
hot-path delta (one map lookup), and the label originates where
it is most naturally available (the adapter's dispatch entry).
The `routeSlot` map shape changes from `map[uint64]*routeSlot` to
`map[slotKey]*routeSlot` with `slotKey = {uint64, string}`.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recommendation sentence reintroduces the earlier corrected wording: calling the change a "hot-path delta (one map lookup)" can be read as adding an extra lookup, but §4.1 states lookup count is unchanged and only the key shape widens. Consider rephrasing to explicitly say the hot path remains a single lookup (with wider key) to avoid confusion.

Suggested change
**Recommendation: Option 4.1.** Lowest plumbing weight, smallest
hot-path delta (one map lookup), and the label originates where
it is most naturally available (the adapter's dispatch entry).
The `routeSlot` map shape changes from `map[uint64]*routeSlot` to
`map[slotKey]*routeSlot` with `slotKey = {uint64, string}`.
**Recommendation: Option 4.1.** Lowest plumbing weight, and the
label originates where it is most naturally available (the
adapter's dispatch entry). The hot path remains a single map
lookup; only the lookup key widens. The `routeSlot` map shape
changes from `map[uint64]*routeSlot` to `map[slotKey]*routeSlot`
with `slotKey = {uint64, string}`.

Copilot uses AI. Check for mistakes.

## 5. Wire format extension

`MatrixRow` (Go) and `KeyVizRow` (proto + JSON):

```diff
type MatrixRow struct {
RouteID uint64
+ Label string // optional; empty for legacy unlabelled traffic
Start, End []byte
}
```

The SPA reads the label and renders rows as
`route:<id> / <label>` when label is non-empty; legacy rows
(empty label) render as `route:<id>` exactly like today.

`bucket_id` formatting in the wire shape stays as-is; the label
is a separate field. This keeps the `route_ids` / `aggregate`
semantics from §5 of the parent design unchanged.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The design suggests that bucket_id formatting stays as-is (e.g., route:1) and the label is a separate field. This creates a problem for the GetRouteDetail(bucket_id) RPC and the SPA's row identification, as bucket_id is no longer unique per heatmap row. This change presents a significant operational risk for rolling upgrades and system migration. Per repository guidelines, please detail potential mitigation strategies (e.g., a temporary 'bridge' or 'proxy' mode) and strategies for zero-downtime cutovers to avoid service interruption.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.
  2. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers to avoid service interruption.


Forward compatibility: an old SPA against a new server ignores
the unknown `label` field and shows just the route as it always
did. A new SPA against an old server sees no `label` field and
falls back to the legacy formatting. Both directions are
non-breaking.

## 6. Aggregator merge changes

The fan-out aggregator's per-cell merge key gains the label:

- Phase 2-C (current): `(bucketID, raftGroupID, leaderTerm,
windowStart)` per design `2026_04_27_proposed_keyviz_cluster_fanout.md`
§4.
- With labels: `(bucketID, label, raftGroupID, leaderTerm,
windowStart)`.

Reads still sum, writes still max-with-conflict; nothing about
the merge rules changes other than the key-tuple width.

## 7. Implementation plan

| PR | Scope |
|---|---|
| **PR-A** | Land this design doc. |
| **PR-B** | Sampler API extension: `Observe(... label string)`, `RegisterRoute` no-op (label is per-Observe, not per-route). `MatrixRow.Label`. Update the existing tests to pass empty labels. |
| **PR-C** | Adapter wiring: each adapter sets its own label at the dispatch entry into `ShardedCoordinator.Observe…`. Default constants live in one file (`adapter/keyviz_labels.go`) so a future audit lists every label in one place. |
| **PR-D** | Wire-format extension: proto + JSON `label` field; SPA renders `route:N / <label>` when present. |
| **PR-E** | Aggregator merge key gains `label`. Mirrors PR-3 of the Phase 2-C+ track. |

Comment thread
coderabbitai[bot] marked this conversation as resolved.
PRs B and C are independent of the wire format; the heatmap will
keep showing one row per route until D ships, but the per-label
counts are already accumulating in the sampler so the wire
extension is "switch on the field".

## 8. Five-lens checklist

1. **Data loss** — n/a; per-Observe label is metadata. The
existing "no counts lost across flush" invariant
(`keyviz/sampler_test.go`) extends straightforwardly with the
label dimension; the `(routeID, label)` slot is still atomic-
add updated like the current `routeID` slot.
2. **Concurrency / distributed** — slot-key change is contained
in the routesMu COW path; the hot-path Load + map lookup keeps
the same shape (one lookup, one atomic add). Burst test
updates: parametrise on (route, label) instead of just route.
3. **Performance** — One extra map lookup per Observe via the
wider key. `BenchmarkObserveParallel` already pins the hot-path
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Performance lens says "One extra map lookup per Observe" here, but §4.1 explicitly notes the lookup count is unchanged and only the key widens. These two statements contradict; please make them consistent (e.g., describe the cost as a wider hash/equality on the same single lookup, not an additional lookup).

Suggested change
3. **Performance** — One extra map lookup per Observe via the
wider key. `BenchmarkObserveParallel` already pins the hot-path
3. **Performance** — The hot path still does one map lookup per
Observe; the only extra work is hashing/equality on the wider
key. `BenchmarkObserveParallel` already pins the hot-path

Copilot uses AI. Check for mistakes.
cost; the new bench should land within run-to-run variance of
the current 4 ns/op. If it doesn't, the design is wrong and we
fall back to Option 4.2.
4. **Data consistency** — Cluster fan-out merge gains a tuple
field; the dedup invariant (per-cell, per-(route, label,
group, term, window)) still holds. Old SPA against new server
sees the label-collapsed view; new SPA against old server
sees the legacy view; both are coherent.
5. **Test coverage** — New test categories:
- `Observe(label="dynamo")` and `Observe(label="redis")`
against the same routeID produce two distinct rows in
`Snapshot`.
- Empty `label` matches no other rows (legacy behaviour pinned).
- Burst test: many goroutines hitting the same route with
different labels — exact-counting invariant must hold per
(route, label).
- Aggregator merge: same route, two labels, two nodes —
each label dedupes correctly without bleeding into the
other.

## 9. Open questions

1. **Should the label be hierarchical** (`dynamo / users`) from
day one, or restricted to a single segment now and extended
later? Proposal: single segment now (cheapest sampler change),
extend with a `/`-delimited convention later if adapters want
sub-tenant attribution.
2. **Label allocation discipline** — who owns the canonical label
set? Proposal: `adapter/keyviz_labels.go` exports
`LabelDynamo`, `LabelRedis`, `LabelS3`, `LabelSQS`,
`LabelRawKV` constants. Adapters refer to these; nothing
stops a future adapter from inventing its own label, but the
review burden of adding to the central file catches accidental
variants like `"DynamoDB"` vs `"dynamo"`.
3. **Should the aggregator collapse same-route different-label
rows for operators who don't want the breakdown?** Proposal:
no — the SPA already lets operators pick which row to
examine; the wire form should always carry the breakdown so
the data is queryable.

## 10. Out of scope (explicit deferrals)

- Per-table / per-bucket / per-queue / per-Redis-DB sub-labels.
- Operator-configurable label taxonomy.
- Persistence of labelled rows (Phase 3 covers persistence
generally; labels ride along once the persistence path lands).
- Adapter-aware splitting of routes (`SplitRange` triggered by
adapter-label hotspots) — that is a Phase 3+ idea.
Loading