Skip to content

Commit 08b2fb4

Browse files
authored
fix(admin): forward session cookie on KeyViz fan-out so peers do not 401 (#692)
## Summary **Hotfix for the screenshot a user shared after enabling fan-out** — the heatmap collapsed to "1 of 6 nodes responded" with every peer returning `401 missing session cookie`. **Root cause**: PR #686 shipped fan-out with the design assumption (#685 §3) that peers would accept anonymous calls on a private network. But the receiving side runs the same `SessionAuth` middleware as the browser path, so anonymous fan-out calls are rejected with 401. The "anonymous on a private network" framing was wrong: the implementation didn't put a bypass on the peer side, so anonymous calls never went through end-to-end. **Fix**: forward the inbound user's cookies on every peer call. Cluster nodes already share `--adminSessionSigningKey` for HA, so a cookie minted on node A is verifiable on node B — no new infrastructure needed. ## What changed - `KeyVizFanout.Run` gains a `cookies []*http.Cookie` parameter; nil preserves the legacy behaviour for tests / future non-browser callers. - `KeyVizHandler` passes `r.Cookies()` so a SPA poll forwards `admin_session` (and `admin_csrf`, harmless on a GET) transparently. - `TestKeyVizFanoutRunForwardsCookies` pins that both cookies reach the peer verbatim. - Design doc §3 rewritten: documents the cookie-forwarding scheme + the operator-trust assumption that `--keyvizFanoutNodes` points at trusted hosts only. Phase 2-C+ will replace cookie forwarding with a dedicated inter-node token to decouple browser-session expiry from inter-node call validity. ## Five-lens self-review 1. **Data loss** — n/a; auth path change. 2. **Concurrency / distributed** — Cookies are passed by reference into per-peer goroutines but the slice is read-only inside `fetchPeer` (each goroutine `req.AddCookie`s independently). No shared mutation. 3. **Performance** — Adds N `req.AddCookie` calls per fan-out request. Trivial. 4. **Data consistency** — Auth-only change; merge semantics unchanged. 5. **Test coverage** — New `TestKeyVizFanoutRunForwardsCookies` asserts both cookies reach the peer with the original values. Existing fan-out tests pass `nil` and continue to exercise the unauthenticated path the existing code handled (peer 401s, request reports `ok=false`). ## Test plan - [x] `go test -race -count=1 ./internal/admin/...` — clean - [x] `golangci-lint run ./internal/admin/...` — clean - [ ] Manual: 6-node cluster with shared `ELASTICKV_ADMIN_SESSION_SIGNING_KEY`, `--keyvizFanoutNodes` set on every node. Heatmap should now show "6 of 6 nodes" instead of "1 of 6". ## Trust / threat model note Cookie forwarding is safe when (a) every peer is operator-configured and trusted, and (b) the network is private. **Do NOT point `--keyvizFanoutNodes` at an untrusted host** — the user's admin session would be replayed there. The design doc §3 update is explicit about this. Phase 2-C+ removes this footgun by switching to a dedicated inter-node token. Closes the production-impacting symptom from screenshot review.
2 parents edaaa64 + 3c42378 commit 08b2fb4

5 files changed

Lines changed: 221 additions & 31 deletions

File tree

docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md

Lines changed: 39 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -132,22 +132,45 @@ elastickv \
132132
network); the HTTP client uses `http://`. A follow-up will
133133
introduce `--keyvizFanoutTLS` once the rest of the admin path
134134
has TLS too.
135-
- **Auth (Phase 2-C MVP)**: the fan-out path is anonymous — the
136-
aggregator issues unauthenticated GETs to peers. This is only
137-
acceptable on a fully-private intra-cluster network, which is
138-
the contract `--keyvizFanoutNodes` documents. **Do NOT enable
139-
fan-out across an untrusted network until Phase 2-C+ ships
140-
proper auth.** Per-call timeout: 2 s default, override via
141-
`--keyvizFanoutTimeout`.
142-
- **Auth (Phase 2-C+)**: a follow-up extends the fan-out path with
143-
a short-lived signed token derived from the existing admin
144-
session-signing key (`ELASTICKV_ADMIN_SESSION_SIGNING_KEY`).
145-
Pre-shared inter-node token, NOT a replay of the browser's
146-
session cookie — re-using the cookie would couple browser session
147-
TTL to inter-node call validity, and a compromised browser
148-
session would gain peer-call authority. The two paths are
149-
intentionally distinct. The earlier draft of this section
150-
conflated them; this paragraph supersedes it.
135+
- **Auth (Phase 2-C MVP)**: the aggregator forwards a whitelist of
136+
the inbound user's cookies (`admin_session` + `admin_csrf` only)
137+
on every peer call. The peer's `SessionAuth` middleware verifies
138+
`admin_session` against its own `--adminSessionSigningKey`;
139+
cluster nodes already share that key for HA, so a cookie minted
140+
on node A is verifiable on node B without any new infrastructure.
141+
Unrelated cookies the browser may carry (analytics, feature
142+
flags, other-app sessions on the same domain) are dropped at the
143+
fan-out boundary so they are not leaked across the internal
144+
network (Gemini security-medium on PR #692).
145+
- **Recursion guard**: peer requests carry an `X-Admin-Fanout-Peer`
146+
header. The receiving handler short-circuits its own fan-out
147+
when this header is set; without the guard, a symmetric
148+
configuration (every node lists every other node) would generate
149+
O(N²) HTTP calls per browser poll as each peer recursively
150+
fanned out. (Claude bot P1 on PR #692.)
151+
- The earlier draft of this paragraph said "anonymous on a
152+
private network" — that was wrong: the receiving side enforces
153+
session auth, so anonymous calls are rejected with 401 and the
154+
cluster heatmap collapses to "1 of N nodes responded". The
155+
cookie-forwarding scheme above is what actually works.
156+
- **Trust model**: forwarding a session cookie to peers is safe
157+
when (a) every peer is operator-configured and trusted, and
158+
(b) the network is private (cookies are HttpOnly but ride
159+
plaintext HTTP for now). `--keyvizFanoutNodes` is the
160+
operator's explicit trust list. **Do NOT point
161+
`--keyvizFanoutNodes` at an untrusted host** — the user's
162+
admin session would be replayed there.
163+
- Per-call timeout: 2 s default, override via
164+
`--keyvizFanoutTimeout`.
165+
- **Auth (Phase 2-C+)**: a follow-up replaces cookie forwarding
166+
with a short-lived **inter-node** token derived from
167+
`ELASTICKV_ADMIN_SESSION_SIGNING_KEY`. Pre-shared, NOT a replay
168+
of the browser cookie — re-using the cookie couples browser
169+
session TTL to inter-node call validity, and a compromised
170+
browser session gains peer-call authority. The follow-up
171+
decouples the two paths so the inter-node call survives a
172+
browser-session expiry and a compromised cookie doesn't extend
173+
laterally.
151174

152175
## 4. Merge rules
153176

internal/admin/keyviz_fanout.go

Lines changed: 48 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,14 @@ const keyVizFanoutDefaultTimeout = 2 * time.Second
3434
// misbehaving peer flood operator logs.
3535
const keyVizPeerErrorBodyLimit = 512
3636

37+
// keyVizFanoutPeerHeader marks a request as originating from another
38+
// node's fan-out aggregator. The receiving handler skips its own
39+
// fan-out step when this header is set, breaking the
40+
// every-node-fans-to-every-node recursion that would otherwise turn
41+
// a single browser poll into O(N²) peer HTTP calls in a symmetric
42+
// cluster (Claude bot P1 on PR #692).
43+
const keyVizFanoutPeerHeader = "X-Admin-Fanout-Peer"
44+
3745
// keyVizPeerResponseBodyLimit caps the JSON body we are willing to
3846
// decode from a peer. A misbehaving or compromised peer that streams
3947
// gigabytes back at us would otherwise pin a goroutine on
@@ -190,6 +198,25 @@ func (f *KeyVizFanout) WithTimeout(d time.Duration) *KeyVizFanout {
190198
return f
191199
}
192200

201+
// attachAdminCookies forwards only the admin session and CSRF
202+
// cookies to a peer request. We whitelist rather than passing every
203+
// inbound cookie through: a logged-in operator may have unrelated
204+
// cookies for other services on the same domain (analytics, feature
205+
// flags, dev-tooling sessions), and the fan-out should not blast
206+
// those across the cluster's internal network. The peer's
207+
// SessionAuth middleware only inspects admin_session, and the CSRF
208+
// double-submit cookie pairs with the X-Admin-CSRF header (which
209+
// fan-out doesn't send because all peer calls are GETs); the cookie
210+
// is forwarded for parity with browser-issued requests but not
211+
// load-bearing. (Gemini security-medium on PR #692.)
212+
func attachAdminCookies(req *http.Request, cookies []*http.Cookie) {
213+
for _, c := range cookies {
214+
if c.Name == sessionCookieName || c.Name == csrfCookieName {
215+
req.AddCookie(c)
216+
}
217+
}
218+
}
219+
193220
// peerResult is the per-peer outcome the goroutine pool collects
194221
// before the synchronous merge phase. Either matrix is non-nil or
195222
// err is non-nil; never both.
@@ -205,9 +232,17 @@ type peerResult struct {
205232
// cluster (peers empty) Run returns local with a Fanout block that
206233
// reports Expected=1, Responded=1.
207234
//
235+
// cookies are attached to every peer request so the receiving node's
236+
// SessionAuth middleware sees a valid admin session. Production
237+
// passes the inbound request's cookies; nil disables cookie
238+
// forwarding (peers will 401 unless they have their own bypass).
239+
// All cluster nodes must share the same --adminSessionSigningKey for
240+
// the cookie minted by node A to be verifiable on node B; the
241+
// existing HA setup already requires this.
242+
//
208243
// Run never returns an error: peer-level failures surface in the
209244
// FanoutResult; aggregation is best-effort.
210-
func (f *KeyVizFanout) Run(ctx context.Context, params keyVizParams, local KeyVizMatrix) KeyVizMatrix {
245+
func (f *KeyVizFanout) Run(ctx context.Context, params keyVizParams, local KeyVizMatrix, cookies []*http.Cookie) KeyVizMatrix {
211246
if f == nil || len(f.peers) == 0 {
212247
merged := local
213248
merged.Fanout = &FanoutResult{
@@ -218,7 +253,7 @@ func (f *KeyVizFanout) Run(ctx context.Context, params keyVizParams, local KeyVi
218253
return merged
219254
}
220255

221-
results := f.fetchPeersParallel(ctx, params)
256+
results := f.fetchPeersParallel(ctx, params, cookies)
222257

223258
matrices := []KeyVizMatrix{local}
224259
statuses := []FanoutNodeStatus{{Node: f.selfName(), OK: true}}
@@ -260,7 +295,7 @@ func countOK(statuses []FanoutNodeStatus) int {
260295
return n
261296
}
262297

263-
func (f *KeyVizFanout) fetchPeersParallel(ctx context.Context, params keyVizParams) []peerResult {
298+
func (f *KeyVizFanout) fetchPeersParallel(ctx context.Context, params keyVizParams, cookies []*http.Cookie) []peerResult {
264299
// Cap per-peer wall time so a single slow node cannot hold the
265300
// SPA poll open beyond the configured timeout. The parent
266301
// context is preserved as the cancellation root so an early
@@ -274,15 +309,15 @@ func (f *KeyVizFanout) fetchPeersParallel(ctx context.Context, params keyVizPara
274309
wg.Add(1)
275310
go func(i int, peer string) {
276311
defer wg.Done()
277-
matrix, err := f.fetchPeer(callCtx, peer, params)
312+
matrix, err := f.fetchPeer(callCtx, peer, params, cookies)
278313
results[i] = peerResult{node: peer, matrix: matrix, err: err}
279314
}(i, peer)
280315
}
281316
wg.Wait()
282317
return results
283318
}
284319

285-
func (f *KeyVizFanout) fetchPeer(ctx context.Context, peer string, params keyVizParams) (*KeyVizMatrix, error) {
320+
func (f *KeyVizFanout) fetchPeer(ctx context.Context, peer string, params keyVizParams, cookies []*http.Cookie) (*KeyVizMatrix, error) {
286321
target, err := buildKeyVizPeerURL(peer, params)
287322
if err != nil {
288323
return nil, pkgerrors.Wrap(err, "build peer url")
@@ -292,6 +327,14 @@ func (f *KeyVizFanout) fetchPeer(ctx context.Context, peer string, params keyViz
292327
return nil, pkgerrors.Wrap(err, "new request")
293328
}
294329
req.Header.Set("Accept", "application/json")
330+
// Mark this request as a peer fan-out call so the receiving
331+
// handler does not recursively fan out to every other peer —
332+
// without this header, a symmetric cluster (every node lists
333+
// every other node) generates O(N²) peer HTTP calls per
334+
// browser poll. The check on the receiving side is in
335+
// KeyVizHandler.ServeHTTP. (Claude bot P1 on PR #692.)
336+
req.Header.Set(keyVizFanoutPeerHeader, "1")
337+
attachAdminCookies(req, cookies)
295338
resp, err := f.client.Do(req)
296339
if err != nil {
297340
return nil, pkgerrors.Wrap(err, "peer request")

internal/admin/keyviz_fanout_test.go

Lines changed: 74 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ import (
66
"net/http"
77
"net/http/httptest"
88
"strings"
9+
"sync"
910
"testing"
1011
"time"
1112

@@ -152,6 +153,71 @@ func TestMergeKeyVizMatricesDistinctRowsPreserveOrder(t *testing.T) {
152153
require.Equal(t, "route:9", merged.Rows[2].BucketID)
153154
}
154155

156+
// TestKeyVizFanoutRunForwardsCookies pins the auth bug fix that
157+
// brought all peers up from 401 to 200: the inbound user's session
158+
// cookies are forwarded on every peer request so the receiving
159+
// node's SessionAuth middleware sees a valid principal. Without
160+
// this, peers reject every fan-out call with 401 missing-session-
161+
// cookie and the cluster heatmap collapses to "1 of N nodes
162+
// responded".
163+
//
164+
// Pins both halves of the cookie contract:
165+
// - admin_session and admin_csrf are forwarded with their values
166+
// verbatim.
167+
// - Unrelated cookies present on the inbound request are
168+
// **dropped** rather than leaked to peer nodes (Gemini
169+
// security-medium on PR #692).
170+
// - The peer call carries the X-Admin-Fanout-Peer header so the
171+
// receiving handler can short-circuit its own fan-out (Claude
172+
// bot P1 on PR #692; the receiving check is in
173+
// KeyVizHandler.ServeHTTP and is exercised by
174+
// TestKeyVizHandlerSkipsFanoutForPeerCall in
175+
// keyviz_handler_test.go).
176+
func TestKeyVizFanoutRunForwardsCookies(t *testing.T) {
177+
t.Parallel()
178+
var seenRequests []*http.Request
179+
var seenMu sync.Mutex
180+
peer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
181+
if !strings.HasSuffix(r.URL.Path, "/admin/api/v1/keyviz/matrix") {
182+
http.NotFound(w, r)
183+
return
184+
}
185+
seenMu.Lock()
186+
seenRequests = append(seenRequests, r.Clone(context.Background()))
187+
seenMu.Unlock()
188+
w.Header().Set("Content-Type", "application/json")
189+
w.WriteHeader(http.StatusOK)
190+
_ = json.NewEncoder(w).Encode(KeyVizMatrix{Series: keyVizSeriesReads})
191+
}))
192+
defer peer.Close()
193+
194+
f := NewKeyVizFanout("self:8080", []string{peer.URL}).WithHTTPClient(peer.Client())
195+
cookies := []*http.Cookie{
196+
{Name: "admin_session", Value: "session-token-abc"},
197+
{Name: "admin_csrf", Value: "csrf-token-def"},
198+
{Name: "unrelated_app_session", Value: "should-not-leak"},
199+
}
200+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads}, cookies)
201+
require.NotNil(t, merged.Fanout, "fan-out result must be present")
202+
require.Len(t, merged.Fanout.Nodes, 2, "self + 1 peer = 2 nodes (Claude bot P2 on PR #692)")
203+
require.True(t, merged.Fanout.Nodes[1].OK)
204+
205+
seenMu.Lock()
206+
defer seenMu.Unlock()
207+
require.Len(t, seenRequests, 1)
208+
got := seenRequests[0].Cookies()
209+
names := make(map[string]string, len(got))
210+
for _, c := range got {
211+
names[c.Name] = c.Value
212+
}
213+
require.Equal(t, "session-token-abc", names["admin_session"], "session cookie must be forwarded verbatim")
214+
require.Equal(t, "csrf-token-def", names["admin_csrf"], "csrf cookie forwarded; harmless on a GET but kept for parity")
215+
require.NotContains(t, names, "unrelated_app_session",
216+
"unrelated cookies must be dropped; only admin_session/admin_csrf are whitelisted")
217+
require.Equal(t, "1", seenRequests[0].Header.Get(keyVizFanoutPeerHeader),
218+
"peer marker header must be set so the receiver short-circuits its own fan-out")
219+
}
220+
155221
// TestKeyVizFanoutRunSinglePeerOK exercises the end-to-end happy
156222
// path: one peer responds with a parseable matrix; the aggregator
157223
// merges it with the local view and reports both nodes ok.
@@ -176,7 +242,7 @@ func TestKeyVizFanoutRunSinglePeerOK(t *testing.T) {
176242
},
177243
}
178244

179-
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, local)
245+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, local, nil)
180246
require.Equal(t, []uint64{10}, merged.Rows[0].Values, "reads must sum across local + peer")
181247
require.NotNil(t, merged.Fanout)
182248
require.Equal(t, 2, merged.Fanout.Expected)
@@ -209,7 +275,7 @@ func TestKeyVizFanoutRunPeerHTTPError(t *testing.T) {
209275
},
210276
}
211277

212-
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, local)
278+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, local, nil)
213279
require.Equal(t, []uint64{3}, merged.Rows[0].Values, "5xx peer must not perturb local counts")
214280
require.NotNil(t, merged.Fanout)
215281
require.Equal(t, 2, merged.Fanout.Expected)
@@ -238,7 +304,7 @@ func TestKeyVizFanoutRunPeerTimeout(t *testing.T) {
238304
WithTimeout(50 * time.Millisecond)
239305

240306
start := time.Now()
241-
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads})
307+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads}, nil)
242308
require.Less(t, time.Since(start), 1*time.Second, "fan-out must not wait beyond its per-peer timeout")
243309
require.NotNil(t, merged.Fanout)
244310
require.Equal(t, 1, merged.Fanout.Responded)
@@ -258,7 +324,7 @@ func TestKeyVizFanoutRunNoPeers(t *testing.T) {
258324
{BucketID: "route:1", Values: []uint64{99}},
259325
},
260326
}
261-
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesWrites, rows: 1024}, local)
327+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesWrites, rows: 1024}, local, nil)
262328
require.Equal(t, []uint64{99}, merged.Rows[0].Values)
263329
require.Equal(t, 1, merged.Fanout.Expected)
264330
require.Equal(t, 1, merged.Fanout.Responded)
@@ -322,7 +388,7 @@ func TestKeyVizFanoutRunPeerOrder(t *testing.T) {
322388
defer second.Close()
323389

324390
f := NewKeyVizFanout("self:8080", []string{first.URL, second.URL}).WithHTTPClient(first.Client())
325-
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, matrix)
391+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, matrix, nil)
326392
require.Equal(t, []FanoutNodeStatus{
327393
{Node: "self:8080", OK: true},
328394
{Node: first.URL, OK: true},
@@ -369,7 +435,7 @@ func TestKeyVizFanoutRunPeerExceedsBodyLimit(t *testing.T) {
369435
WithResponseBodyLimit(testCap)
370436

371437
start := time.Now()
372-
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads})
438+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads}, nil)
373439
require.Less(t, time.Since(start), 5*time.Second, "decode must respect the size cap and complete promptly")
374440
require.NotNil(t, merged.Fanout)
375441
require.Equal(t, 2, merged.Fanout.Expected)
@@ -416,7 +482,7 @@ func TestKeyVizFanoutRunPeerNearCapSucceedsWithWarning(t *testing.T) {
416482
WithHTTPClient(peer.Client()).
417483
WithResponseBodyLimit(testCap)
418484

419-
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads})
485+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads}, nil)
420486
require.NotNil(t, merged.Fanout)
421487
require.Len(t, merged.Fanout.Nodes, 2)
422488
require.True(t, merged.Fanout.Nodes[0].OK, "self always reports ok")
@@ -458,7 +524,7 @@ func TestKeyVizFanoutRunPeerOverlargeBody(t *testing.T) {
458524
f := NewKeyVizFanout("self:8080", []string{peer.URL}).WithHTTPClient(peer.Client())
459525

460526
start := time.Now()
461-
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads})
527+
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads}, nil)
462528
require.Less(t, time.Since(start), 5*time.Second, "decode must respect the size cap and complete promptly")
463529
require.NotNil(t, merged.Fanout)
464530
require.True(t, merged.Fanout.Nodes[1].OK, "in-cap response must succeed; cap is 64 MiB and the synthetic body is well under that")

internal/admin/keyviz_handler.go

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -177,8 +177,18 @@ func (h *KeyVizHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
177177
cols := h.source.Snapshot(params.from, params.to)
178178
matrix := pivotKeyVizColumns(cols, params.series, params.rows)
179179
matrix.GeneratedAt = h.now()
180-
if h.fanout != nil {
181-
matrix = h.fanout.Run(r.Context(), params, matrix)
180+
if h.fanout != nil && r.Header.Get(keyVizFanoutPeerHeader) == "" {
181+
// Forward the inbound user's cookies so the peer's SessionAuth
182+
// middleware sees a valid principal. Cluster nodes share
183+
// --adminSessionSigningKey for HA, so a cookie minted on this
184+
// node verifies on any peer. nil cookies make peers 401,
185+
// which surfaces as ok=false in the per-node status.
186+
//
187+
// Skip the fan-out when keyVizFanoutPeerHeader is present:
188+
// the request is itself a peer call and recursing would
189+
// generate O(N²) HTTP calls per browser poll on a symmetric
190+
// cluster (Claude bot P1 on PR #692).
191+
matrix = h.fanout.Run(r.Context(), params, matrix, r.Cookies())
182192
}
183193
w.Header().Set("Content-Type", "application/json; charset=utf-8")
184194
w.Header().Set("Cache-Control", "no-store")

internal/admin/keyviz_handler_test.go

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -450,3 +450,51 @@ func TestKeyVizHandlerAggregateFallbackWhenTotalZero(t *testing.T) {
450450
require.True(t, matrix.Rows[0].Aggregate)
451451
require.Equal(t, uint64(2), matrix.Rows[0].RouteCount, "fallback to len(MemberRoutes)")
452452
}
453+
454+
// TestKeyVizHandlerSkipsFanoutForPeerCall pins the recursion guard
455+
// added on PR #692 (Claude bot P1): when the inbound request carries
456+
// the X-Admin-Fanout-Peer header, the handler must serve the local
457+
// view without invoking its own KeyVizFanout — otherwise a
458+
// symmetric cluster (every node lists every peer) would generate
459+
// O(N²) HTTP calls per browser poll. The test uses a recording
460+
// fanout that fails the test if Run is ever called.
461+
func TestKeyVizHandlerSkipsFanoutForPeerCall(t *testing.T) {
462+
t.Parallel()
463+
src := &fakeKeyVizSource{cols: []keyviz.MatrixColumn{
464+
{At: time.Unix(1_700_000_000, 0), Rows: []keyviz.MatrixRow{
465+
{RouteID: 1, Start: []byte("a"), End: []byte("b"), Writes: 5},
466+
}},
467+
}}
468+
// A real *KeyVizFanout pointed at a peer URL that, if dialled,
469+
// would record the call. We assert the call-count stays at zero
470+
// because the handler's recursion guard short-circuits before
471+
// Run runs the parallel-fetch step.
472+
var peerHits int
473+
peer := httptest.NewServer(http.HandlerFunc(func(_ http.ResponseWriter, _ *http.Request) {
474+
peerHits++
475+
}))
476+
defer peer.Close()
477+
478+
fanout := NewKeyVizFanout("self", []string{peer.URL}).WithHTTPClient(peer.Client())
479+
h := NewKeyVizHandler(src).
480+
WithClock(func() time.Time { return time.Unix(1_700_000_000, 0).UTC() }).
481+
WithFanout(fanout)
482+
srv := httptest.NewServer(h)
483+
defer srv.Close()
484+
485+
req, err := http.NewRequestWithContext(context.Background(), http.MethodGet, srv.URL, nil)
486+
require.NoError(t, err)
487+
req.Header.Set(keyVizFanoutPeerHeader, "1")
488+
resp, err := http.DefaultClient.Do(req)
489+
require.NoError(t, err)
490+
defer resp.Body.Close()
491+
492+
require.Equal(t, http.StatusOK, resp.StatusCode)
493+
var matrix KeyVizMatrix
494+
require.NoError(t, json.NewDecoder(resp.Body).Decode(&matrix))
495+
require.Len(t, matrix.Rows, 1, "local matrix served directly")
496+
require.Nil(t, matrix.Fanout,
497+
"peer-marked request must not produce a Fanout block — fan-out was skipped, no aggregator metadata to surface")
498+
require.Equal(t, 0, peerHits,
499+
"recursion guard violated: handler dialled a peer despite X-Admin-Fanout-Peer being set")
500+
}

0 commit comments

Comments
 (0)