Skip to content

Commit 689f2e4

Browse files
committed
fix(admin): default-group-scoped leader probe + RFC-compliant 405
Address findings from PR #689 round-1 review: - [Codex P1] Scope newAdminLeaderProbe to the **default Raft group** via coordinate.IsLeader / coordinate.VerifyLeader instead of the prior "any local runtime" loop. Admin write paths (kv/sharded_coordinator.go, adapter/sqs_admin.go) all key off the default group, so a node leading only a non-default group could return 200 here while still rejecting admin writes -- breaking the endpoint's stated load-balancer contract. Routing through the coordinator keeps the healthz answer aligned with what the AdminForward proxy and SQS admin write path treat as authoritative. newAdminLeaderProbe(runtimes []*raftGroupRuntime) becomes newAdminLeaderProbe(coordinate kv.Coordinator). The change drops the per-runtime engine.VerifyLeader fan-out and the adminLeaderProbeTimeout constant -- coordinate.VerifyLeader is the same ReadIndex round-trip lease reads use, with the engine's own bounded deadline, so an outer 2 s timeout is no longer needed. - [Claude / Gemini] Add the RFC 7231 §6.5.5 Allow: GET, HEAD header to every router-served 405 response (healthz, healthz/leader, asset, SPA). Pulled the body shape into a writeMethodNotAllowed helper so the four call sites stay uniform; load balancers and synthetic-monitor tools now see the same Allow header the S3 and DynamoDB /healthz/leader handlers already advertise. New TestRouter_405_AllowHeader sweeps every 405-emitting path so a future handler that bypasses the helper would regress this test. - [Claude / Gemini] Drop the redundant rt.leader == nil check inside serveLeaderHealth. dispatch() already short-circuits the nil case to rt.notFound before this handler is ever invoked; the in-body check was dead code that misled readers about the precondition. Documented the precondition explicitly on the function. - [Gemini] Rewrite the misleading "ordering before /admin/healthz" comment in classify(). Both healthz paths are exact-match equality checks; their relative order does not matter for correctness. The load-bearing rule is that BOTH must run before the SPA prefix branch -- now spelled out that way. Self-review (CLAUDE.md 5 lenses): 1. Data loss -- None. Read-only handlers; no Raft proposals touched. 2. Concurrency -- The probe now serialises through the coordinator's default-group view, which is the same path lease reads use. Behaviour during leadership change is unchanged: VerifyLeader's ReadIndex returns an error, the probe returns false, the LB takes the node out of rotation. 3. Performance -- Cheaper than the prior per-runtime loop in the common single-group case (one IsLeader / one VerifyLeader vs N per-runtime Status checks plus M VerifyLeader calls). 4. Data consistency -- coordinate.VerifyLeader is the same ReadIndex round-trip the lease-read path uses; the load balancer can now key off the same liveness signal admin writes do. 5. Test coverage -- new TestRouter_405_AllowHeader. Existing six /admin/healthz/leader tests still cover happy / 503 / HEAD / POST / nil-probe / SPA-no-swallow. The redundant-nil-guard removal is covered by TestRouter_HealthzLeader_NilProbeReturns404 (still passes via dispatch()'s nil check). Body-text parity nit: the PR description's "identical responses to S3 and DynamoDB" is technically true vs DynamoDB but not S3 (S3 omits the trailing newline; this admin path matches DynamoDB). Not worth a behaviour change -- LBs key off status codes, not body strings -- but worth noting in case a future PR aligns the three.
1 parent 5f50223 commit 689f2e4

3 files changed

Lines changed: 89 additions & 39 deletions

File tree

internal/admin/router.go

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -150,9 +150,11 @@ func (rt *Router) classify(p string) routeKind {
150150
if k, ok := classifyAssets(p); ok {
151151
return k
152152
}
153-
// /admin/healthz/leader must be checked before /admin/healthz so
154-
// the longer path does not fall through the equality check and
155-
// resolve to the SPA fallback.
153+
// Both /admin/healthz and /admin/healthz/leader are exact-match
154+
// equality checks — relative ordering between them does not affect
155+
// correctness, but BOTH must run before the SPA prefix branch
156+
// below; otherwise the catch-all "anything under /admin/" resolves
157+
// these paths to index.html.
156158
if p == pathHealthzLeader {
157159
return routeHealthzLeader
158160
}
@@ -214,9 +216,27 @@ func (rt *Router) dispatch(k routeKind) http.Handler {
214216
return rt.notFound
215217
}
216218

219+
// allowGetHead is the canonical Allow value for read-only handlers
220+
// (healthz / SPA / static assets). RFC 7231 §6.5.5 requires every 405
221+
// to advertise the supported methods; load balancers and synthetic-
222+
// monitor tools key off this header to discover the right verbs
223+
// without scraping the body. Mirrors the value the S3 and DynamoDB
224+
// /healthz/leader handlers already set
225+
// (adapter/s3.go:2404, adapter/dynamodb.go:399).
226+
const allowGetHead = "GET, HEAD"
227+
228+
// writeMethodNotAllowed centralises the read-only 405 response shape
229+
// for router-served paths. The Allow header is set BEFORE
230+
// writeJSONError because writeJSONError calls w.WriteHeader, after
231+
// which header mutations would silently no-op.
232+
func writeMethodNotAllowed(w http.ResponseWriter) {
233+
w.Header().Set("Allow", allowGetHead)
234+
writeJSONError(w, http.StatusMethodNotAllowed, "method_not_allowed", "only GET or HEAD supported")
235+
}
236+
217237
func (rt *Router) serveHealth(w http.ResponseWriter, r *http.Request) {
218238
if r.Method != http.MethodGet && r.Method != http.MethodHead {
219-
writeJSONError(w, http.StatusMethodNotAllowed, "method_not_allowed", "only GET or HEAD supported")
239+
writeMethodNotAllowed(w)
220240
return
221241
}
222242
w.Header().Set("Content-Type", "text/plain; charset=utf-8")
@@ -233,13 +253,17 @@ func (rt *Router) serveHealth(w http.ResponseWriter, r *http.Request) {
233253
// /healthz/leader endpoints (adapter/s3.go:serveS3LeaderHealthz,
234254
// adapter/dynamodb.go:serveDynamoLeaderHealthz) so an upstream load
235255
// balancer that probes any of the three sees identical semantics.
256+
//
257+
// Precondition: rt.leader is non-nil. dispatch() short-circuits the
258+
// nil case to rt.notFound before this handler is ever invoked, so a
259+
// belt-and-braces nil check inside this body would be dead code.
236260
func (rt *Router) serveLeaderHealth(w http.ResponseWriter, r *http.Request) {
237261
if r.Method != http.MethodGet && r.Method != http.MethodHead {
238-
writeJSONError(w, http.StatusMethodNotAllowed, "method_not_allowed", "only GET or HEAD supported")
262+
writeMethodNotAllowed(w)
239263
return
240264
}
241265
status, body := http.StatusOK, "ok\n"
242-
if rt.leader == nil || !rt.leader.IsVerifiedLeader() {
266+
if !rt.leader.IsVerifiedLeader() {
243267
status, body = http.StatusServiceUnavailable, "not leader\n"
244268
}
245269
w.Header().Set("Content-Type", "text/plain; charset=utf-8")
@@ -258,7 +282,7 @@ func (rt *Router) serveAsset(w http.ResponseWriter, r *http.Request) {
258282
// the file body for a write request) or surface as a confusing
259283
// 404 — neither matches the API contract for assets.
260284
if r.Method != http.MethodGet && r.Method != http.MethodHead {
261-
writeJSONError(w, http.StatusMethodNotAllowed, "method_not_allowed", "only GET or HEAD supported")
285+
writeMethodNotAllowed(w)
262286
return
263287
}
264288
if rt.static == nil {
@@ -335,7 +359,7 @@ func (rt *Router) serveSPA(w http.ResponseWriter, r *http.Request) {
335359
// /admin/something returned a JSON 404 with a nil static and a
336360
// JSON 405 with a populated static — same URL, different answer.
337361
if r.Method != http.MethodGet && r.Method != http.MethodHead {
338-
writeJSONError(w, http.StatusMethodNotAllowed, "method_not_allowed", "only GET or HEAD supported")
362+
writeMethodNotAllowed(w)
339363
return
340364
}
341365
if rt.static == nil {

internal/admin/router_test.go

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,12 +119,47 @@ func TestRouter_HealthzLeader_HeadOmitsBody(t *testing.T) {
119119

120120
// TestRouter_HealthzLeader_RejectsPost guards the method allowlist:
121121
// only GET / HEAD are accepted. Mirrors TestRouter_HealthzRejectsPost.
122+
// Also asserts the Allow: GET, HEAD header required by RFC 7231
123+
// §6.5.5 — load balancers and synthetic-monitor tools key off this
124+
// header to discover supported verbs.
122125
func TestRouter_HealthzLeader_RejectsPost(t *testing.T) {
123126
r := NewRouterWithLeaderProbe(nil, nil, LeaderProbeFunc(func() bool { return true }))
124127
req := httptest.NewRequest(http.MethodPost, "/admin/healthz/leader", strings.NewReader(""))
125128
rec := httptest.NewRecorder()
126129
r.ServeHTTP(rec, req)
127130
require.Equal(t, http.StatusMethodNotAllowed, rec.Code)
131+
require.Equal(t, "GET, HEAD", rec.Header().Get("Allow"))
132+
}
133+
134+
// TestRouter_405_AllowHeader sweeps every router-served path that
135+
// rejects non-GET/HEAD verbs and asserts each emits the
136+
// RFC-required Allow header. Locks down the writeMethodNotAllowed
137+
// invariant — a future handler that bypasses the helper would
138+
// regress this test rather than silently shipping a non-compliant
139+
// 405.
140+
func TestRouter_405_AllowHeader(t *testing.T) {
141+
t.Parallel()
142+
cases := []struct {
143+
name string
144+
path string
145+
}{
146+
{"healthz", "/admin/healthz"},
147+
{"healthz_leader", "/admin/healthz/leader"},
148+
{"asset", "/admin/assets/app.js"},
149+
{"spa", "/admin/somewhere"},
150+
}
151+
r := NewRouterWithLeaderProbe(nil, newTestStatic(), LeaderProbeFunc(func() bool { return true }))
152+
for _, tc := range cases {
153+
t.Run(tc.name, func(t *testing.T) {
154+
t.Parallel()
155+
req := httptest.NewRequest(http.MethodPost, tc.path, strings.NewReader(""))
156+
rec := httptest.NewRecorder()
157+
r.ServeHTTP(rec, req)
158+
require.Equal(t, http.StatusMethodNotAllowed, rec.Code)
159+
require.Equal(t, "GET, HEAD", rec.Header().Get("Allow"),
160+
"path %s missing RFC 7231 Allow header on 405", tc.path)
161+
})
162+
}
128163
}
129164

130165
// TestRouter_HealthzLeader_NilProbeReturns404 locks down the

main_admin.go

Lines changed: 22 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -135,55 +135,46 @@ func startAdminFromFlags(
135135
if err != nil {
136136
return errors.Wrap(err, "build admin leader forwarder")
137137
}
138-
leaderProbe := newAdminLeaderProbe(runtimes)
138+
leaderProbe := newAdminLeaderProbe(coordinate)
139139
_, err = startAdminServer(ctx, lc, eg, cfg, staticCreds, clusterSrc, tablesSrc, bucketsSrc, queuesSrc, forwarder, leaderProbe, keyvizSampler, keyvizFanoutCfg, buildVersion())
140140
return err
141141
}
142142

143-
// adminLeaderProbeTimeout caps the per-runtime VerifyLeader ReadIndex
144-
// round-trip the /admin/healthz/leader handler triggers. Keep it short:
145-
// a load balancer that probes every few seconds wants a fast "yes / no"
146-
// even when the cluster is mid-election; surfacing 503 on a stalled
147-
// ReadIndex is the right answer for routing.
148-
const adminLeaderProbeTimeout = 2 * time.Second
149-
150143
// newAdminLeaderProbe builds the LeaderProbe consumed by
151144
// /admin/healthz/leader. It mirrors the verified-leader pattern S3 and
152145
// DynamoDB use on their own /healthz/leader endpoints
153146
// (adapter/s3.go:isVerifiedS3Leader,
154-
// adapter/dynamodb.go:isVerifiedDynamoLeader): a cheap Status check
147+
// adapter/dynamodb.go:isVerifiedDynamoLeader): a cheap IsLeader check
155148
// short-circuits non-leaders, and only nodes claiming leadership pay
156149
// the ReadIndex round-trip that confirms the claim is still valid.
157150
//
158-
// "Leader of any local Raft group" is treated as "yes" so a multi-group
159-
// deployment that distributes leadership across nodes still surfaces a
160-
// useful answer to a load balancer that wants to skip the
161-
// AdminForward proxy hop. In the common single-group / co-located
162-
// deployment, this collapses to "is this node THE leader".
151+
// Crucially the probe is scoped to the **default Raft group** (via
152+
// coordinate.IsLeader / coordinate.VerifyLeader) — the same group the
153+
// admin write paths key off (kv/sharded_coordinator.go), which the
154+
// AdminForward proxy and the SQS admin write path
155+
// (adapter/sqs_admin.go) both treat as authoritative. An earlier
156+
// design returned true on any local-group leadership; in a multi-group
157+
// deployment that could surface 200 on a node leading only a non-
158+
// default group while admin writes there still 503'd or forwarded.
159+
// Codex P1 on PR #689 caught this; using the coordinator keeps the
160+
// healthz contract aligned with the actual admin-write leader.
163161
//
164-
// Returns nil for an empty runtimes slice so the router answers
162+
// Returns nil when no coordinator is wired so the router answers
165163
// /admin/healthz/leader with the standard JSON 404 (matches the
166164
// "feature off" pattern Tables / Buckets / Queues already use).
167-
func newAdminLeaderProbe(runtimes []*raftGroupRuntime) admin.LeaderProbe {
168-
if len(runtimes) == 0 {
165+
func newAdminLeaderProbe(coordinate kv.Coordinator) admin.LeaderProbe {
166+
if coordinate == nil {
169167
return nil
170168
}
171169
return admin.LeaderProbeFunc(func() bool {
172-
for _, rt := range runtimes {
173-
if rt == nil || rt.engine == nil {
174-
continue
175-
}
176-
if rt.engine.Status().State != raftengine.StateLeader {
177-
continue
178-
}
179-
ctx, cancel := context.WithTimeout(context.Background(), adminLeaderProbeTimeout)
180-
err := rt.engine.VerifyLeader(ctx)
181-
cancel()
182-
if err == nil {
183-
return true
184-
}
170+
if !coordinate.IsLeader() {
171+
return false
185172
}
186-
return false
173+
// VerifyLeader is the same ReadIndex round-trip lease reads
174+
// use; under the hood it carries an engine-bounded deadline,
175+
// so a stalled cluster surfaces 503 here on its own without
176+
// the probe needing an outer timeout.
177+
return coordinate.VerifyLeader() == nil
187178
})
188179
}
189180

0 commit comments

Comments
 (0)