Skip to content

Commit 3c42378

Browse files
committed
fix(admin): KeyViz fan-out recursion guard + cookie whitelist (PR #692 r1)
Three items from round-1 review on PR #692: - Claude bot P1 / recursion storm: with cookie forwarding now letting peer calls succeed, a symmetric --keyvizFanoutNodes configuration (every node lists every other) would generate O(N^2) HTTP calls per browser poll as each peer recursively fanned out. Add a marker header X-Admin-Fanout-Peer that the fetcher sets on every peer request; KeyVizHandler.ServeHTTP short-circuits its own fan-out when the header is present. TestKeyVizHandlerSkipsFanoutForPeerCall pins this with a peer stub that fails the test if dialled. - Gemini security-medium / cookie whitelist: forward only admin_session and admin_csrf to peers, not every inbound cookie. Operators may have unrelated cookies on the same domain (analytics, feature flags, other-app sessions); blasting those across the internal network needlessly widens the data exposure. attachAdminCookies helper centralises the whitelist. TestKeyVizFanoutRunForwardsCookies extended to assert unrelated_app_session is dropped while admin_session and admin_csrf pass through verbatim. - Claude bot P2 / nil-guard the test: TestKeyVizFanoutRunForwardsCookies now require.NotNil(merged.Fanout) and require.Len(Nodes, 2) before indexing Nodes[1], so a regression that produced a nil Fanout block surfaces with a precise assertion failure rather than a generic nil-pointer panic. Design doc 3 updated with the whitelist + recursion-guard behavior; the Phase 2-C MVP now matches what the code actually does.
1 parent 92bbd15 commit 3c42378

5 files changed

Lines changed: 129 additions & 22 deletions

File tree

docs/design/2026_04_27_proposed_keyviz_cluster_fanout.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -132,12 +132,22 @@ elastickv \
132132
network); the HTTP client uses `http://`. A follow-up will
133133
introduce `--keyvizFanoutTLS` once the rest of the admin path
134134
has TLS too.
135-
- **Auth (Phase 2-C MVP)**: the aggregator forwards the inbound
136-
user's `admin_session` cookie (and any other cookies the request
137-
carries) on every peer call. The peer's `SessionAuth` middleware
138-
verifies the cookie against its own `--adminSessionSigningKey`;
135+
- **Auth (Phase 2-C MVP)**: the aggregator forwards a whitelist of
136+
the inbound user's cookies (`admin_session` + `admin_csrf` only)
137+
on every peer call. The peer's `SessionAuth` middleware verifies
138+
`admin_session` against its own `--adminSessionSigningKey`;
139139
cluster nodes already share that key for HA, so a cookie minted
140140
on node A is verifiable on node B without any new infrastructure.
141+
Unrelated cookies the browser may carry (analytics, feature
142+
flags, other-app sessions on the same domain) are dropped at the
143+
fan-out boundary so they are not leaked across the internal
144+
network (Gemini security-medium on PR #692).
145+
- **Recursion guard**: peer requests carry an `X-Admin-Fanout-Peer`
146+
header. The receiving handler short-circuits its own fan-out
147+
when this header is set; without the guard, a symmetric
148+
configuration (every node lists every other node) would generate
149+
O(N²) HTTP calls per browser poll as each peer recursively
150+
fanned out. (Claude bot P1 on PR #692.)
141151
- The earlier draft of this paragraph said "anonymous on a
142152
private network" — that was wrong: the receiving side enforces
143153
session auth, so anonymous calls are rejected with 401 and the

internal/admin/keyviz_fanout.go

Lines changed: 35 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,14 @@ const keyVizFanoutDefaultTimeout = 2 * time.Second
3434
// misbehaving peer flood operator logs.
3535
const keyVizPeerErrorBodyLimit = 512
3636

37+
// keyVizFanoutPeerHeader marks a request as originating from another
38+
// node's fan-out aggregator. The receiving handler skips its own
39+
// fan-out step when this header is set, breaking the
40+
// every-node-fans-to-every-node recursion that would otherwise turn
41+
// a single browser poll into O(N²) peer HTTP calls in a symmetric
42+
// cluster (Claude bot P1 on PR #692).
43+
const keyVizFanoutPeerHeader = "X-Admin-Fanout-Peer"
44+
3745
// keyVizPeerResponseBodyLimit caps the JSON body we are willing to
3846
// decode from a peer. A misbehaving or compromised peer that streams
3947
// gigabytes back at us would otherwise pin a goroutine on
@@ -190,6 +198,25 @@ func (f *KeyVizFanout) WithTimeout(d time.Duration) *KeyVizFanout {
190198
return f
191199
}
192200

201+
// attachAdminCookies forwards only the admin session and CSRF
202+
// cookies to a peer request. We whitelist rather than passing every
203+
// inbound cookie through: a logged-in operator may have unrelated
204+
// cookies for other services on the same domain (analytics, feature
205+
// flags, dev-tooling sessions), and the fan-out should not blast
206+
// those across the cluster's internal network. The peer's
207+
// SessionAuth middleware only inspects admin_session, and the CSRF
208+
// double-submit cookie pairs with the X-Admin-CSRF header (which
209+
// fan-out doesn't send because all peer calls are GETs); the cookie
210+
// is forwarded for parity with browser-issued requests but not
211+
// load-bearing. (Gemini security-medium on PR #692.)
212+
func attachAdminCookies(req *http.Request, cookies []*http.Cookie) {
213+
for _, c := range cookies {
214+
if c.Name == sessionCookieName || c.Name == csrfCookieName {
215+
req.AddCookie(c)
216+
}
217+
}
218+
}
219+
193220
// peerResult is the per-peer outcome the goroutine pool collects
194221
// before the synchronous merge phase. Either matrix is non-nil or
195222
// err is non-nil; never both.
@@ -300,18 +327,14 @@ func (f *KeyVizFanout) fetchPeer(ctx context.Context, peer string, params keyViz
300327
return nil, pkgerrors.Wrap(err, "new request")
301328
}
302329
req.Header.Set("Accept", "application/json")
303-
// Forward the inbound user's admin session cookie so the peer's
304-
// SessionAuth middleware sees a valid principal. Without this
305-
// the peer rejects every fan-out request with 401 — the missing
306-
// piece in the Phase 2-C MVP design (#685 §3 said "anonymous on
307-
// a private network" but the receiving side still enforces
308-
// session auth, so anonymous calls don't actually go through).
309-
// Forwarding the cookie works because all admin nodes share
310-
// --adminSessionSigningKey for HA; a cookie minted by node A is
311-
// verifiable on node B.
312-
for _, c := range cookies {
313-
req.AddCookie(c)
314-
}
330+
// Mark this request as a peer fan-out call so the receiving
331+
// handler does not recursively fan out to every other peer —
332+
// without this header, a symmetric cluster (every node lists
333+
// every other node) generates O(N²) peer HTTP calls per
334+
// browser poll. The check on the receiving side is in
335+
// KeyVizHandler.ServeHTTP. (Claude bot P1 on PR #692.)
336+
req.Header.Set(keyVizFanoutPeerHeader, "1")
337+
attachAdminCookies(req, cookies)
315338
resp, err := f.client.Do(req)
316339
if err != nil {
317340
return nil, pkgerrors.Wrap(err, "peer request")

internal/admin/keyviz_fanout_test.go

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -160,17 +160,30 @@ func TestMergeKeyVizMatricesDistinctRowsPreserveOrder(t *testing.T) {
160160
// this, peers reject every fan-out call with 401 missing-session-
161161
// cookie and the cluster heatmap collapses to "1 of N nodes
162162
// responded".
163+
//
164+
// Pins both halves of the cookie contract:
165+
// - admin_session and admin_csrf are forwarded with their values
166+
// verbatim.
167+
// - Unrelated cookies present on the inbound request are
168+
// **dropped** rather than leaked to peer nodes (Gemini
169+
// security-medium on PR #692).
170+
// - The peer call carries the X-Admin-Fanout-Peer header so the
171+
// receiving handler can short-circuit its own fan-out (Claude
172+
// bot P1 on PR #692; the receiving check is in
173+
// KeyVizHandler.ServeHTTP and is exercised by
174+
// TestKeyVizHandlerSkipsFanoutForPeerCall in
175+
// keyviz_handler_test.go).
163176
func TestKeyVizFanoutRunForwardsCookies(t *testing.T) {
164177
t.Parallel()
165-
var seenCookies [][]*http.Cookie
178+
var seenRequests []*http.Request
166179
var seenMu sync.Mutex
167180
peer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
168181
if !strings.HasSuffix(r.URL.Path, "/admin/api/v1/keyviz/matrix") {
169182
http.NotFound(w, r)
170183
return
171184
}
172185
seenMu.Lock()
173-
seenCookies = append(seenCookies, r.Cookies())
186+
seenRequests = append(seenRequests, r.Clone(context.Background()))
174187
seenMu.Unlock()
175188
w.Header().Set("Content-Type", "application/json")
176189
w.WriteHeader(http.StatusOK)
@@ -182,19 +195,27 @@ func TestKeyVizFanoutRunForwardsCookies(t *testing.T) {
182195
cookies := []*http.Cookie{
183196
{Name: "admin_session", Value: "session-token-abc"},
184197
{Name: "admin_csrf", Value: "csrf-token-def"},
198+
{Name: "unrelated_app_session", Value: "should-not-leak"},
185199
}
186200
merged := f.Run(context.Background(), keyVizParams{series: keyVizSeriesReads, rows: 1024}, KeyVizMatrix{Series: keyVizSeriesReads}, cookies)
201+
require.NotNil(t, merged.Fanout, "fan-out result must be present")
202+
require.Len(t, merged.Fanout.Nodes, 2, "self + 1 peer = 2 nodes (Claude bot P2 on PR #692)")
187203
require.True(t, merged.Fanout.Nodes[1].OK)
204+
188205
seenMu.Lock()
189206
defer seenMu.Unlock()
190-
require.Len(t, seenCookies, 1)
191-
got := seenCookies[0]
207+
require.Len(t, seenRequests, 1)
208+
got := seenRequests[0].Cookies()
192209
names := make(map[string]string, len(got))
193210
for _, c := range got {
194211
names[c.Name] = c.Value
195212
}
196213
require.Equal(t, "session-token-abc", names["admin_session"], "session cookie must be forwarded verbatim")
197-
require.Equal(t, "csrf-token-def", names["admin_csrf"], "all inbound cookies forwarded; CSRF is innocuous on a GET")
214+
require.Equal(t, "csrf-token-def", names["admin_csrf"], "csrf cookie forwarded; harmless on a GET but kept for parity")
215+
require.NotContains(t, names, "unrelated_app_session",
216+
"unrelated cookies must be dropped; only admin_session/admin_csrf are whitelisted")
217+
require.Equal(t, "1", seenRequests[0].Header.Get(keyVizFanoutPeerHeader),
218+
"peer marker header must be set so the receiver short-circuits its own fan-out")
198219
}
199220

200221
// TestKeyVizFanoutRunSinglePeerOK exercises the end-to-end happy

internal/admin/keyviz_handler.go

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,12 +177,17 @@ func (h *KeyVizHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
177177
cols := h.source.Snapshot(params.from, params.to)
178178
matrix := pivotKeyVizColumns(cols, params.series, params.rows)
179179
matrix.GeneratedAt = h.now()
180-
if h.fanout != nil {
180+
if h.fanout != nil && r.Header.Get(keyVizFanoutPeerHeader) == "" {
181181
// Forward the inbound user's cookies so the peer's SessionAuth
182182
// middleware sees a valid principal. Cluster nodes share
183183
// --adminSessionSigningKey for HA, so a cookie minted on this
184184
// node verifies on any peer. nil cookies make peers 401,
185185
// which surfaces as ok=false in the per-node status.
186+
//
187+
// Skip the fan-out when keyVizFanoutPeerHeader is present:
188+
// the request is itself a peer call and recursing would
189+
// generate O(N²) HTTP calls per browser poll on a symmetric
190+
// cluster (Claude bot P1 on PR #692).
186191
matrix = h.fanout.Run(r.Context(), params, matrix, r.Cookies())
187192
}
188193
w.Header().Set("Content-Type", "application/json; charset=utf-8")

internal/admin/keyviz_handler_test.go

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -450,3 +450,51 @@ func TestKeyVizHandlerAggregateFallbackWhenTotalZero(t *testing.T) {
450450
require.True(t, matrix.Rows[0].Aggregate)
451451
require.Equal(t, uint64(2), matrix.Rows[0].RouteCount, "fallback to len(MemberRoutes)")
452452
}
453+
454+
// TestKeyVizHandlerSkipsFanoutForPeerCall pins the recursion guard
455+
// added on PR #692 (Claude bot P1): when the inbound request carries
456+
// the X-Admin-Fanout-Peer header, the handler must serve the local
457+
// view without invoking its own KeyVizFanout — otherwise a
458+
// symmetric cluster (every node lists every peer) would generate
459+
// O(N²) HTTP calls per browser poll. The test uses a recording
460+
// fanout that fails the test if Run is ever called.
461+
func TestKeyVizHandlerSkipsFanoutForPeerCall(t *testing.T) {
462+
t.Parallel()
463+
src := &fakeKeyVizSource{cols: []keyviz.MatrixColumn{
464+
{At: time.Unix(1_700_000_000, 0), Rows: []keyviz.MatrixRow{
465+
{RouteID: 1, Start: []byte("a"), End: []byte("b"), Writes: 5},
466+
}},
467+
}}
468+
// A real *KeyVizFanout pointed at a peer URL that, if dialled,
469+
// would record the call. We assert the call-count stays at zero
470+
// because the handler's recursion guard short-circuits before
471+
// Run runs the parallel-fetch step.
472+
var peerHits int
473+
peer := httptest.NewServer(http.HandlerFunc(func(_ http.ResponseWriter, _ *http.Request) {
474+
peerHits++
475+
}))
476+
defer peer.Close()
477+
478+
fanout := NewKeyVizFanout("self", []string{peer.URL}).WithHTTPClient(peer.Client())
479+
h := NewKeyVizHandler(src).
480+
WithClock(func() time.Time { return time.Unix(1_700_000_000, 0).UTC() }).
481+
WithFanout(fanout)
482+
srv := httptest.NewServer(h)
483+
defer srv.Close()
484+
485+
req, err := http.NewRequestWithContext(context.Background(), http.MethodGet, srv.URL, nil)
486+
require.NoError(t, err)
487+
req.Header.Set(keyVizFanoutPeerHeader, "1")
488+
resp, err := http.DefaultClient.Do(req)
489+
require.NoError(t, err)
490+
defer resp.Body.Close()
491+
492+
require.Equal(t, http.StatusOK, resp.StatusCode)
493+
var matrix KeyVizMatrix
494+
require.NoError(t, json.NewDecoder(resp.Body).Decode(&matrix))
495+
require.Len(t, matrix.Rows, 1, "local matrix served directly")
496+
require.Nil(t, matrix.Fanout,
497+
"peer-marked request must not produce a Fanout block — fan-out was skipped, no aggregator metadata to surface")
498+
require.Equal(t, 0, peerHits,
499+
"recursion guard violated: handler dialled a peer despite X-Admin-Fanout-Peer being set")
500+
}

0 commit comments

Comments
 (0)