Skip to content

admin: AdminForward Phase 3 — register gRPC server + wire follower-side bridge#648

Merged
bootjp merged 4 commits intomainfrom
feat/admin-forward-bridge
Apr 26, 2026
Merged

admin: AdminForward Phase 3 — register gRPC server + wire follower-side bridge#648
bootjp merged 4 commits intomainfrom
feat/admin-forward-bridge

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 26, 2026

Phase 3 of Task #26 / AdminForward design 3.3. Closes the wiring loop opened by #635 (leader-side dispatcher) and #644 (follower-side client + handler integration): the admin HTTP DynamoHandler now hands ErrTablesNotLeader writes off to the in-process LeaderForwarder, and the leader-side gRPC AdminForward service is registered on every node so the forwarded RPC actually lands somewhere.

Summary

  • admin.ServerDeps.Forwarder (3315bf4) — new optional field on the high-level admin server composition surface. admin.NewServer calls WithLeaderForwarder when set; nil leaves the existing 503 + Retry-After:1 fallback untouched. Single-node / leader-only deployments are unaffected.
  • main_admin_forward.go (f3e529c) — production bridge between kv.GRPCConnCache and the admin layer's typed PBAdminForwardClient (via pb.NewAdminForwardClient). Includes buildLeaderForwarder, roleStoreFromFlags, adminForwardServerDeps, and a readyForRegistration gate.
  • Runtime wiring (19f6b70) — runtimeServerRunner.start() now creates dynamoServer BEFORE startRaftServers, so the leader-side ForwardServer can use it as its TablesSource. startRaftServers accepts the new adminForwardServerDeps bundle and delegates to registerAdminForwardServer. startServers constructs a single *kv.GRPCConnCache and forwards it (with the Coordinator) to startAdminFromFlags, which in turn builds the production LeaderForwarder via buildAdminLeaderForwarder and passes it through ServerDeps.Forwarder.

What is NOT in this PR

  • Rolling-upgrade compatibility flag (criterion 5) — still deferred behind a cluster-version bump.
  • Election-period retry loop on the follower's bridge — the handler returns 503 + Retry-After:1, the SPA / curl re-issues, and the bridge dials again on the next attempt. Inline retry inside the handler would just hide the latency from operators.

Test plan

  • go build ./...
  • go vet ./...
  • golangci-lint run (main + admin packages: 0 issues)
  • go test ./internal/admin/ -count=1 -race — full admin suite passes (existing 21 forward tests + 1 new TestServer_ServerDepsForwarderIsWired)
  • go test . -count=1 -race — main package passes (4 new bridge / role-store / readyForRegistration tests)
  • adapter package times out at 240s on this branch AND on main — pre-existing flake unrelated to this PR (verified by checking out main and running the same suite)
  • End-to-end smoke against a 3-node cluster — needs the local Jepsen runner; blocked on the next PR or a manual run

Acceptance criteria coverage

# Criterion This PR
1 Leader direct write ✓ (in main since #634)
2 Follower forwards transparently ✓ — wiring complete; both leader-side gRPC service AND follower-side bridge are now plumbed
3 Election-period 503 + retry ✓ — handler still returns 503 + Retry-After:1 when no leader is known; the SPA / client retries and the next attempt re-dials
4 Leader demotes stale full role ✓ (in main since #635)
5 Rolling-upgrade compat flag ⏳ deferred (cluster-version bump)
6 forwarded_from in audit log ✓ (in main since #635 — bridge now stamps --raftId onto every forwarded request)

Self-review (5 lenses)

  1. Data loss: No FSM / Raft / Pebble path changes; admin writes still go through the same AdminCreateTable / AdminDeleteTable Raft proposal once they reach the leader.
  2. Concurrency: runner.start() reorder confirmed safe — startDynamoDBServer only depends on coordinate + shardStore (already constructed); raft TCP listeners are independent of the dynamo HTTP listener. Worst case during the 100 ms-or-so reorder window is the same "no leader yet" 503 the old order would have produced if a request landed before raft converged.
  3. Performance: One additional &kv.GRPCConnCache{} allocation per process; one closure for the resolver. No hot-path changes.
  4. Data consistency: Forwarded requests re-validate the principal at the leader (criterion 4, already shipped); commit-ts ordering is unchanged because the leader's AdminCreateTable path does its own HLC.Next.
  5. Test coverage: 4 new main-package tests + 1 new admin-package server-level test. TestServer_ServerDepsForwarderIsWired is the regression test for "future refactor drops the forwarder before it reaches the dynamo handler" — exactly the failure mode the wiring change in this PR introduces the risk of.

Summary by CodeRabbit

  • New Features

    • Added leader-forwarding capability for admin requests sent to follower nodes
    • Implemented role-based access control for admin API operations
    • Enhanced DynamoDB request routing to automatically direct operations to leader nodes
  • Tests

    • Added comprehensive integration and unit tests validating leader-forwarding paths

bootjp added 3 commits April 26, 2026 06:05
Phase 3 of AdminForward (design 3.3) needs the production main.go path
to plug a LeaderForwarder into the Dynamo handler so follower-side
admin writes transparently forward to the leader. The handler-level
WithLeaderForwarder option already shipped in PR #644; this commit
exposes it via the high-level admin.NewServer composition surface so
the wiring lives in main.go without reaching past the public API.

A nil Forwarder leaves the existing behaviour intact (no-forwarder
fallback path: ErrTablesNotLeader → 503 + Retry-After: 1), so
single-node and leader-only deployments are unaffected.

The integration test wires a stub forwarder through ServerDeps and
confirms that a follower-side CreateTable replays the leader's 201 +
payload verbatim — guarding against a future refactor that drops the
forwarder before it reaches the Dynamo handler.
Phase 3 of AdminForward (design 3.3): the production path needs to
construct an admin.LeaderForwarder that the dynamo HTTP handler can
invoke when ErrTablesNotLeader bubbles up. The forwarder, in turn,
needs a way to reach the leader over gRPC and a way to discover the
leader's address.

This commit lands the wiring infrastructure in main_admin_forward.go
without touching main.go's runtime layout yet:

- adminForwardConnFactory bridges kv.GRPCConnCache to
  admin.GRPCConnFactory. The cache hands back *grpc.ClientConn; the
  admin layer wants a typed PBAdminForwardClient.
  pb.NewAdminForwardClient(conn) is the generated adapter.

- buildLeaderForwarder is the production constructor: resolver pulls
  coordinate.RaftLeader() (the default group's leader address), the
  factory above wraps the connection cache, and the local nodeID is
  stamped onto every forwarded request so the leader's audit log
  carries the trace (criterion 6).

- roleStoreFromFlags builds the access-key → role index from the same
  flag values admin.Config.RoleIndex consumes. The gRPC ForwardServer
  registration in startRaftServers can construct it without waiting
  for the (later) startAdminFromFlags admin-config parse.

- adminForwardServerDeps + readyForRegistration is the gate
  startRaftServers will use to decide whether to register the gRPC
  service at all. A nil TablesSource (cluster-only build) or a nil
  RoleStore (admin auth disabled) means the registered service would
  500 every forwarded call, so silently skipping is the correct
  behaviour.

Three sets of unit tests cover the constructors' input validation,
the conn-factory's empty-address path, and the readyForRegistration
truth table — all deterministic, no Raft or TCP required.

The runtime wiring (runner.start reorder + pb.RegisterAdminForwardServer
in startRaftServers + ServerDeps.Forwarder plumb-through) ships in
follow-up commits in this same PR.
Phase 3 of AdminForward (design 3.3). Closes the wiring loop: the
admin HTTP dynamo handler can now hand ErrTablesNotLeader writes off
to the in-process LeaderForwarder, and the leader-side gRPC
AdminForward service is registered on every node so the forwarded
RPC actually lands somewhere.

Changes:

- runtimeServerRunner.start() now constructs the DynamoDBServer
  BEFORE startRaftServers. The previous order — raft listeners first,
  dynamoServer afterwards — was fine when no service needed the
  dynamoServer at registration time, but the leader-side gRPC
  ForwardServer needs it as its TablesSource. Both servers listen on
  different addresses; the dynamo HTTP listener accepting traffic
  before raft TCP listeners are up is no different from the existing
  startup-race semantics.

- startRaftServers gains an adminForwardServerDeps parameter and
  delegates registration to registerAdminForwardServer in
  main_admin_forward.go. The bundle's readyForRegistration gate
  decides silently whether to register at all (cluster-only builds
  with no dynamoServer, or builds with admin auth disabled, skip it).

- runtimeServerRunner gains a roleStore field, populated in
  startServers from the same flag values admin.Config.RoleIndex
  consumes inside startAdminFromFlags. The two paths build their
  own access-key→role index from the same input strings; cost is
  microseconds at startup.

- startServers constructs a single *kv.GRPCConnCache and forwards
  it (along with the Coordinator) to startAdminFromFlags. The
  cache is shared with any future bridge that dials the leader's
  gRPC ports.

- startAdminFromFlags accepts coordinate + connCache, builds a
  production LeaderForwarder via buildAdminLeaderForwarder, and
  passes it through ServerDeps.Forwarder. Single-node / leader-only
  builds (nil connCache or nil coordinate) intentionally fall back
  to the no-forwarder path — the dynamo handler still surfaces 503
  + Retry-After:1 in those cases.

This commit lights up acceptance criterion 2 (follower forwards
transparently) end-to-end and tightens criterion 3's election-period
behaviour: a follower with a configured forwarder hands the request
to the leader instead of bouncing the SPA with a synthetic 503.

Existing main_admin_test.go callers updated to pass nil for the
new LeaderForwarder argument; the tests already exercise the
no-forwarder path.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 53 minutes and 0 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 53 minutes and 0 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36572595-5545-4961-a36b-b00e31e934b2

📥 Commits

Reviewing files that changed from the base of the PR and between 19f6b70 and 964ecd4.

📒 Files selected for processing (3)
  • main.go
  • main_admin.go
  • main_admin_forward_test.go
📝 Walkthrough

Walkthrough

The PR adds leader-forwarding support to the admin API by introducing a Forwarder dependency field to enable request delegation to the leader node. It establishes an admin role store for access control, implements a shared gRPC connection cache for efficiency, and refactors the startup sequence to wire these components through the admin server initialization path.

Changes

Cohort / File(s) Summary
Admin Server Core
internal/admin/server.go, internal/admin/server_test.go
Added optional Forwarder field to ServerDeps; refactored dynamo handler construction to conditionally inject the forwarder via WithLeaderForwarder. New integration test verifies forwarder is wired and request forwarding works as expected.
Admin Forwarding Infrastructure
main_admin_forward.go, main_admin_forward_test.go
New module establishes leader-forwarding infrastructure including adminForwardConnFactory for gRPC client translation, buildLeaderForwarder for constructing the forwarder with coordinator and connection cache, adminForwardServerDeps bundling tables and role store with conditional registration gating, and roleStoreFromFlags to map access keys to role permissions. Comprehensive test coverage validates error cases, happy paths, and conditional logic.
Startup and Wiring
main.go, main_admin.go
Refactored startup sequence to pre-parse role store from admin access-key flags and instantiate process-wide gRPC connection cache. startAdminFromFlags now accepts coordinator and cache, constructs the forwarder, and injects it into admin server dependencies. startRaftServers updated to accept admin-forward dependencies and register the leader-side AdminForward gRPC service with conditional gating.
Test Updates
main_admin_test.go
Updated startAdminServer(...) call sites to pass nil for the new forwarder parameter, adjusting test invocations to match extended function signatures.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant AdminHTTP as Admin HTTP Server
    participant Forwarder
    participant GRPCCache as gRPC Conn Cache
    participant Leader
    
    Client->>AdminHTTP: POST dynamo create-table request
    AdminHTTP->>AdminHTTP: Check ServerDeps.Forwarder != nil
    alt Forwarder configured
        AdminHTTP->>Forwarder: Forward request (TableName, etc.)
        Forwarder->>GRPCCache: Get or dial leader connection
        GRPCCache->>Leader: Establish/reuse gRPC connection
        Leader-->>GRPCCache: Return *grpc.ClientConn
        GRPCCache-->>Forwarder: Return client connection
        Forwarder->>Leader: Call AdminForward.CreateTable
        Leader-->>Forwarder: Response
        Forwarder-->>AdminHTTP: Return forwarded response
    else No forwarder (fallback)
        AdminHTTP->>AdminHTTP: Return 503 or retry locally
    end
    AdminHTTP-->>Client: HTTP response (201/503)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Poem

🐰 A forwarder hops through the gRPC lane,
With roles and caches reducing the strain,
When followers call out to leaders on high,
Requests take flight—no more "503"!
Our admin dreams soar, no fear to abstain!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing Phase 3 of AdminForward by registering the gRPC server and wiring the follower-side bridge, which aligns with the primary objectives of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/admin-forward-bridge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 6s —— View job


PR Review: AdminForward Phase 3

  • Gathering context and reading changed files
  • Reviewing all 7 changed files
  • Synthesizing findings

Overall the wiring is clean and the design rationale is well documented. Three issues worth addressing before merge, one of which is a genuine resource leak, and two minor quality items.


Issue 1 — GRPCConnCache is never closed (resource leak)

main.go:668

connCache := &kv.GRPCConnCache{}

kv.GRPCConnCache.Close() closes all cached *grpc.ClientConn handles (kv/grpc_conn_cache.go:92). The connCache created here is passed into startAdminFromFlagsbuildAdminLeaderForwarderbuildLeaderForwarder and then held for the lifetime of the process — but it is never closed when the server shuts down. Every address the follower dials accumulates an open HTTP/2 connection that is never released.

The fix is to call connCache.Close() when the process context is done — e.g. register a shutdown hook alongside the errgroup:

connCache := &kv.GRPCConnCache{}
in.eg.Go(func() error {
    <-in.ctx.Done()
    return connCache.Close()
})

Fix this →


Issue 2 — //nolint:nilnil violates CLAUDE.md and suppresses a linter that isn't even enabled

main_admin.go:137

return nil, nil //nolint:nilnil // explicit "no forwarder" signal — ...

CLAUDE.md: "Avoid //nolint — refactor instead." Beyond the convention violation, nilnil does not appear in .golangci.yaml's linters.enable list, so this directive suppresses nothing — it is pure dead weight. The refactor is straightforward: change the return type of buildAdminLeaderForwarder to not return an error on the "disabled" path, or use an explicit sentinel. The cleanest option given the existing callers is to return nil, nil without the directive (the linter won't flag it) and remove the comment. The doc comment on the function already explains the nil-means-disabled contract.

Fix this →


Issue 3 — buildAdminLeaderForwarder nil-gate path is untested

TestBuildLeaderForwarder_RejectsMissingDeps tests buildLeaderForwarder directly. buildAdminLeaderForwarder (the wrapper in main_admin.go) has an additional early-return gate:

if coordinate == nil || connCache == nil {
    return nil, nil
}

This branch — the "forwarding disabled" path that the comment describes as the correct no-op for single-node deployments — has no coverage. A future refactor that accidentally drops this guard would silently pass a nil coordinate into buildLeaderForwarder, which would then crash. A one-line table test covering (nil, cache, "n1") and (coord, nil, "n1")(nil, nil) would lock this down. Fix this →


Minor observations (no blocking)

roleStoreFromFlags overlap semantics (main_admin_forward.go:102): The comment correctly notes that overlap validation is deferred to Config.Validate on the HTTP path. However, if startAdminFromFlags fails due to overlap, the gRPC ForwardServer has already been registered on the Raft gRPC port with a RoleReadOnly-biased role store. In practice the process exits on startup failure, so this is not exploitable — but it is worth a note in the comment (or a TODO) so future readers understand the ordering dependency.

Two newDynamoTablesSource wrappers over the same server (main.go:1275 and main_admin.go:119): Both calls wrap the same *adapter.DynamoDBServer. The two resulting TablesSource values are handed to different layers (gRPC ForwardServer vs HTTP admin handler), which is intentional. No action needed, but the redundancy is slightly surprising on first read.

Startup order comment (main.go:1265): The comment says the DynamoDB → Raft reorder is safe because the same 503 would have been returned before. This is accurate, but it's worth being explicit that this also changes observability: a health-check system that connects to the DynamoDB port will now see it become reachable before Raft is converged, rather than both becoming reachable together.


Test coverage

TestServer_ServerDepsForwarderIsWired is the right regression anchor for the wiring change — it exercises the full stack from NewServer through the forwarder hand-off. The four new main-package tests are well-scoped. The gap noted above (Issue 3) is the only missing case.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
main_admin.go (1)

128-146: Two-stage validation has a small inconsistency — minor, can defer.

buildAdminLeaderForwarder chooses between three outcomes for missing inputs:

  • coordinate == nil or connCache == nil(nil, nil) (silent fallback to 503 + Retry-After:1)
  • nodeID == "" → explicit error
  • otherwise → delegate to buildLeaderForwarder, which itself validates coordinate/connCache again and would also catch nodeID == "" via admin.NewGRPCForwardClient

The pre-check on nodeID is only useful when coordinate and connCache are non-nil (otherwise the early-return wins and the empty nodeID is silently tolerated). Production callers always pass non-nil collaborators so this is unreachable in practice, but it makes the error contract slightly fuzzy: "is empty nodeID a hard failure or a fallback signal?" depends on the order of checks.

If you want to make the contract crisper, validate nodeID first (matching upstream resolveRuntimeInputs's "raftId is required" guarantee) so the fallback path is only ever taken because the wiring is intentionally incomplete. Not a blocker.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main_admin.go` around lines 128 - 146, The function buildAdminLeaderForwarder
currently early-returns (nil, nil) when coordinate or connCache are nil but only
checks nodeID afterwards, creating an inconsistent contract; change the
validation order so nodeID is validated first (mirroring resolveRuntimeInputs'
"raftId is required" guarantee) and return an explicit error if nodeID == ""
before checking coordinate/connCache, then keep the existing
fallback-to-(nil,nil) behavior for missing coordinate or connCache and finally
delegate to buildLeaderForwarder when all inputs are present.
main_admin_forward.go (1)

95-120: Acknowledged divergence from admin.Config.RoleIndex is fine, but worth documenting in one place.

The comment explains why roleStoreFromFlags runs without overlap-rejection (admin.Config.Validate owns that), and the test fixture covers the four quadrants. The divergence becomes interesting only if admin.Config.RoleIndex adds new behavior (e.g., normalization, deduplication) — keep this helper in lockstep, or wire admin.Config.RoleIndex() into both paths once the config has been parsed (see the related comment in main.go).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main_admin_forward.go` around lines 95 - 120, The comment notes a maintained
divergence between roleStoreFromFlags and admin.Config.RoleIndex that could
drift if RoleIndex adds behavior; update the code and docs so they stay in sync:
either (A) add a clear single-source comment near roleStoreFromFlags referencing
admin.Config.RoleIndex and the related main.go note (so future editors know to
keep them in lockstep), or (B) refactor so the code path that has a parsed
config uses admin.Config.RoleIndex directly (e.g., call admin.Config.RoleIndex
from startAdminFromFlags / the place that has parsed config instead of
duplicating logic in roleStoreFromFlags), and keep roleStoreFromFlags only as
the minimal flag-only fallback with a comment stating it intentionally differs.
Ensure references to roleStoreFromFlags, admin.Config.RoleIndex and
startAdminFromFlags are present in the comment or refactor to prevent future
divergence.
main_admin_forward_test.go (1)

111-125: Optional: make TestAdminForwardServerDeps_ReadyForRegistration table-driven.

The other tests in this file are already table-driven; this one expresses four cases as separate require lines. Folding them into a cases := []struct{...} table keeps the style consistent with the rest of the file.

♻️ Suggested table-driven form
-func TestAdminForwardServerDeps_ReadyForRegistration(t *testing.T) {
-	require.False(t, adminForwardServerDeps{}.readyForRegistration())
-	require.False(t, adminForwardServerDeps{tables: dummyTablesSource{}}.readyForRegistration())
-	require.False(t, adminForwardServerDeps{roles: admin.MapRoleStore{}}.readyForRegistration())
-	require.True(t, adminForwardServerDeps{
-		tables: dummyTablesSource{},
-		roles:  admin.MapRoleStore{},
-	}.readyForRegistration())
-}
+func TestAdminForwardServerDeps_ReadyForRegistration(t *testing.T) {
+	cases := []struct {
+		name string
+		deps adminForwardServerDeps
+		want bool
+	}{
+		{"both nil", adminForwardServerDeps{}, false},
+		{"only tables", adminForwardServerDeps{tables: dummyTablesSource{}}, false},
+		{"only roles", adminForwardServerDeps{roles: admin.MapRoleStore{}}, false},
+		{"both set", adminForwardServerDeps{tables: dummyTablesSource{}, roles: admin.MapRoleStore{}}, true},
+	}
+	for _, tc := range cases {
+		t.Run(tc.name, func(t *testing.T) {
+			require.Equal(t, tc.want, tc.deps.readyForRegistration())
+		})
+	}
+}

As per coding guidelines: "Co-locate test files in the same package and use table-driven test patterns."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main_admin_forward_test.go` around lines 111 - 125, Refactor
TestAdminForwardServerDeps_ReadyForRegistration into a table-driven test: create
a slice of cases each containing a name, an adminForwardServerDeps value (using
dummyTablesSource{} and admin.MapRoleStore{} as needed) and the expected bool,
then loop over cases and run t.Run(case.name, func(t *testing.T) {
require.Equal(t, case.expected, case.deps.readyForRegistration()) }). Keep
references to adminForwardServerDeps, readyForRegistration, dummyTablesSource,
and admin.MapRoleStore so the test logic remains identical but follows the
file's table-driven style.
main.go (2)

952-998: Optional: tag the forward logger with a more specific component name.

forwardLogger := slog.Default().With(slog.String("component", "admin")) shares the same component label as the admin HTTP listener (buildAdminHTTPServer does the same). When operators tail logs to debug a follower→leader bridge issue, they can't distinguish the HTTP and gRPC sides by component alone.

-	forwardLogger := slog.Default().With(slog.String("component", "admin"))
+	forwardLogger := slog.Default().With(slog.String("component", "admin-forward"))

As per coding guidelines: "Use structured slog logging with stable keys".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.go` around lines 952 - 998, The forwardLogger in startRaftServers
currently reuses the generic component label ("admin"), which makes HTTP vs gRPC
forward logs indistinguishable; change the forwardLogger initialization
(forwardLogger) to use a more specific structured key/value (for example
component "admin_forward" or "admin_forward_grpc" and optionally include
rt.spec.id/group id) so logs from registerAdminForwardServer are clearly
separable from buildAdminHTTPServer logs; ensure the new logger is passed
unchanged to registerAdminForwardServer and update any related tests or log
expectations that rely on the old component key.

662-662: Role index is parsed twice from the same flags — risk of drift.

roleStoreFromFlags(parseCSV(*adminFullAccessKeys), parseCSV(*adminReadOnlyAccessKeys)) builds the same access-key→role map that admin.Config.RoleIndex() will produce later inside buildAdminHTTPServer. The comment in roleStoreFromFlags acknowledges that overlap-rejection is only enforced on the HTTP-path Validate() call. Practically:

  • An overlapping key (in both --adminFullAccessKeys and --adminReadOnlyAccessKeys) ends up as RoleReadOnly in the gRPC ForwardServer's role store, while admin.Config.Validate() rejects the same configuration and aborts startup.
  • Between leader-side gRPC registration (in runner.start()) and the later validation failure in startAdminFromFlags, the ForwardServer is briefly serving with the "wrong" mapping. The window is short and the process aborts, but a forwarded write that races into that window could be authorized inconsistently with what the operator declared.

Consider building the role store once (e.g., parse admin.Config first, then derive roleStoreFromFlags from Config.RoleIndex()) so both paths share the validated map.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.go` at line 662, Currently parseCSV is used twice producing a role map
in roleStoreFromFlags that can differ from admin.Config.RoleIndex(), causing a
brief inconsistent authorization window; instead parse and validate the admin
config first (the same flow used by startAdminFromFlags/admin.Config.Validate),
then derive the role store from admin.Config.RoleIndex() and pass that to the
gRPC ForwardServer/runner.start; update the call site that currently does
roleStoreFromFlags(parseCSV(...), parseCSV(...)) to obtain the role index from
the validated admin.Config and use that single source (remove the duplicate
parseCSV usage and ensure overlap rejection from admin.Config.Validate() applies
before runner.start).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@main.go`:
- Around line 657-668: The AdminForward gRPC path is being prepared regardless
of the --adminEnabled flag because roleStore (via roleStoreFromFlags/parseCSV)
and connCache are created unconditionally and
adminForwardServerDeps.readyForRegistration() doesn't check adminEnabled; fix by
gating the creation/registration of the forward/server deps on the adminEnabled
flag: either stop building/attaching roleStore/connCache to forwardDeps before
startRaftServers (or wrap the call site that constructs forwardDeps) when
*adminEnabled is false, or add an explicit check for *adminEnabled inside
registerAdminForwardServer (or inside
adminForwardServerDeps.readyForRegistration()) so registerAdminForwardServer is
a no-op unless admin is enabled; reference roleStoreFromFlags, parseCSV,
connCache, startRaftServers and registerAdminForwardServer to locate the
changes.
- Line 668: The connCache (kv.GRPCConnCache) created inside startServers() is
never closed on shutdown; call connCache.Close() during process cleanup by
either returning the connCache from startServers() so run() can register it with
the cleanup stack, or register a shutdown goroutine inside startServers() that
runs on context cancellation (e.g., using the errgroup's eg.Go() or a ctx.Done()
watcher) and invokes connCache.Close(); update startServers() and run() to
propagate and register the cleanup accordingly so all grpc.ClientConn entries
are closed on shutdown.

---

Nitpick comments:
In `@main_admin_forward_test.go`:
- Around line 111-125: Refactor TestAdminForwardServerDeps_ReadyForRegistration
into a table-driven test: create a slice of cases each containing a name, an
adminForwardServerDeps value (using dummyTablesSource{} and admin.MapRoleStore{}
as needed) and the expected bool, then loop over cases and run t.Run(case.name,
func(t *testing.T) { require.Equal(t, case.expected,
case.deps.readyForRegistration()) }). Keep references to adminForwardServerDeps,
readyForRegistration, dummyTablesSource, and admin.MapRoleStore so the test
logic remains identical but follows the file's table-driven style.

In `@main_admin_forward.go`:
- Around line 95-120: The comment notes a maintained divergence between
roleStoreFromFlags and admin.Config.RoleIndex that could drift if RoleIndex adds
behavior; update the code and docs so they stay in sync: either (A) add a clear
single-source comment near roleStoreFromFlags referencing admin.Config.RoleIndex
and the related main.go note (so future editors know to keep them in lockstep),
or (B) refactor so the code path that has a parsed config uses
admin.Config.RoleIndex directly (e.g., call admin.Config.RoleIndex from
startAdminFromFlags / the place that has parsed config instead of duplicating
logic in roleStoreFromFlags), and keep roleStoreFromFlags only as the minimal
flag-only fallback with a comment stating it intentionally differs. Ensure
references to roleStoreFromFlags, admin.Config.RoleIndex and startAdminFromFlags
are present in the comment or refactor to prevent future divergence.

In `@main_admin.go`:
- Around line 128-146: The function buildAdminLeaderForwarder currently
early-returns (nil, nil) when coordinate or connCache are nil but only checks
nodeID afterwards, creating an inconsistent contract; change the validation
order so nodeID is validated first (mirroring resolveRuntimeInputs' "raftId is
required" guarantee) and return an explicit error if nodeID == "" before
checking coordinate/connCache, then keep the existing fallback-to-(nil,nil)
behavior for missing coordinate or connCache and finally delegate to
buildLeaderForwarder when all inputs are present.

In `@main.go`:
- Around line 952-998: The forwardLogger in startRaftServers currently reuses
the generic component label ("admin"), which makes HTTP vs gRPC forward logs
indistinguishable; change the forwardLogger initialization (forwardLogger) to
use a more specific structured key/value (for example component "admin_forward"
or "admin_forward_grpc" and optionally include rt.spec.id/group id) so logs from
registerAdminForwardServer are clearly separable from buildAdminHTTPServer logs;
ensure the new logger is passed unchanged to registerAdminForwardServer and
update any related tests or log expectations that rely on the old component key.
- Line 662: Currently parseCSV is used twice producing a role map in
roleStoreFromFlags that can differ from admin.Config.RoleIndex(), causing a
brief inconsistent authorization window; instead parse and validate the admin
config first (the same flow used by startAdminFromFlags/admin.Config.Validate),
then derive the role store from admin.Config.RoleIndex() and pass that to the
gRPC ForwardServer/runner.start; update the call site that currently does
roleStoreFromFlags(parseCSV(...), parseCSV(...)) to obtain the role index from
the validated admin.Config and use that single source (remove the duplicate
parseCSV usage and ensure overlap rejection from admin.Config.Validate() applies
before runner.start).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7cf20a65-0c07-49c8-8f5e-2a235ce2504c

📥 Commits

Reviewing files that changed from the base of the PR and between 29011bb and 19f6b70.

📒 Files selected for processing (7)
  • internal/admin/server.go
  • internal/admin/server_test.go
  • main.go
  • main_admin.go
  • main_admin_forward.go
  • main_admin_forward_test.go
  • main_admin_test.go

Comment thread main.go Outdated
Comment thread main.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 19f6b70a25

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread main.go Outdated
// runner.start() below, before startAdminFromFlags has a chance
// to parse the admin config) can re-use the same access-key map
// the HTTP path will eventually populate.
roleStore := roleStoreFromFlags(parseCSV(*adminFullAccessKeys), parseCSV(*adminReadOnlyAccessKeys))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate AdminForward registration on adminEnabled

This wires roleStore from access-key flags unconditionally, so even when -adminEnabled=false a non-empty -adminFullAccessKeys/-adminReadOnlyAccessKeys still makes forwardDeps.readyForRegistration() true and startRaftServers registers AdminForward. In that mode the admin surface is expected to be off, but this commit re-exposes table write paths over gRPC (the forward server only re-checks access-key role mapping), so templated deployments that keep key flags set while toggling adminEnabled can still process admin mutations.

Useful? React with 👍 / 👎.

Four findings from Claude / Codex / CodeRabbit on PR #648:

1) **P1 / Major**: AdminForward gRPC registration was gated only on
   adminForwardServerDeps.readyForRegistration() (tables + roles both
   non-nil), neither of which referenced *adminEnabled. A node started
   with -adminEnabled=false but with admin access keys still configured
   would have registered the leader-side gRPC AdminForward service and
   accepted forwarded admin writes — re-exposing the table-mutation
   surface a follower-direct admin call could reach. Fix: build
   roleStore + connCache only when *adminEnabled. With admin disabled,
   forwardDeps.roles stays nil and registerAdminForwardServer's
   readyForRegistration gate skips the gRPC bind entirely.

2) **Resource leak (Claude Issue 1, CodeRabbit Minor)**: connCache was
   created in startServers() but never closed. Every address dialled
   accumulates an open HTTP/2 connection that is never released.
   Fix: shutdown goroutine that drains the cache via Close() on
   in.ctx.Done(), wired through the existing errgroup.

3) **Lint hygiene (Claude Issue 2)**: //nolint:nilnil suppresses a
   linter that is not enabled in .golangci.yaml AND violates
   CLAUDE.md's "Avoid //nolint — refactor instead" rule. The directive
   was pure dead weight. Fix: drop the directive; the doc comment on
   the function and the inline comment already explain the
   nil-means-disabled contract.

4) **Coverage gap (Claude Issue 3)**: buildAdminLeaderForwarder's
   nil-coordinate / nil-cache short-circuit was untested. A future
   refactor that drops the guard would silently pass a nil
   collaborator into buildLeaderForwarder, which would either crash
   on the nil deref or build a forwarder that panics on the first
   request. Fix: TestBuildAdminLeaderForwarder_NilGateReturnsNoForwarder
   covers all three nil combinations + the empty-nodeID error path,
   plus a happy-path companion.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

964ecd4 addresses all four findings from the previous review pass:

  • Codex P1 / CodeRabbit Major: roleStore + connCache are now built only when *adminEnabled is true. With admin disabled, forwardDeps.roles stays nil and registerAdminForwardServer's gate skips the gRPC bind entirely — the AdminForward service is no longer reachable when the HTTP admin surface is off.
  • Claude Issue 1 / CodeRabbit Minor: shutdown goroutine drains connCache via Close() on context cancellation; gRPC connections no longer leak.
  • Claude Issue 2: dropped //nolint:nilnil (the linter isn't enabled and CLAUDE.md disallows nolint directives); replaced the inline justification with a clear comment.
  • Claude Issue 3: TestBuildAdminLeaderForwarder_NilGateReturnsNoForwarder covers all three nil combinations + the empty-nodeID error path.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 1m 54s —— View job


PR Review: AdminForward Phase 3 — follow-up on 964ecd4

  • Gathering context
  • Reading changed files (main.go, main_admin.go, main_admin_forward.go, main_admin_forward_test.go)
  • Verifying all four previous findings are addressed
  • Synthesizing findings

All four issues from the previous review pass are correctly addressed. Details below.


Finding 1 — adminEnabled gate (Codex P1 / CodeRabbit Major) ✅

roleStore and connCache are now declared as zero-value nils and only allocated inside if *adminEnabled (main.go:671). With admin disabled:

  • forwardDeps.roles stays nil
  • readyForRegistration() returns false
  • registerAdminForwardServer is a no-op

The gRPC AdminForward service is no longer reachable on nodes where --adminEnabled=false, regardless of what access keys are configured. Fix is correct and minimal.


Finding 2 — connCache leak (Claude Issue 1 / CodeRabbit Minor) ✅

main.go:682-689 registers a shutdown goroutine on the existing errgroup:

cache := connCache
in.eg.Go(func() error {
    <-in.ctx.Done()
    if err := cache.Close(); err != nil {
        return errors.Wrap(err, "close admin gRPC connection cache")
    }
    return nil
})

The cache := connCache capture is defensive good practice (guards against hypothetical future re-assignment of connCache). The errgroup propagation on Close() error is the right behavior — a failed clean shutdown is a signal worth surfacing. Pattern matches every other shutdown hook in this file.


Finding 3 — //nolint:nilnil directive (Claude Issue 2) ✅

main_admin.go:136-143: directive removed, replaced with an explanatory inline comment that correctly notes the linter isn't enabled in .golangci.yaml. The return nil, nil path is intentional and the comment plus the function-level doc are sufficient documentation.


Finding 4 — nil-gate coverage gap (Claude Issue 3) ✅

TestBuildAdminLeaderForwarder_NilGateReturnsNoForwarder (main_admin_forward_test.go:127) is a properly table-driven test covering:

  • nil coordinator(nil, nil)
  • nil conn cache(nil, nil)
  • both nil(nil, nil)
  • complete deps + empty nodeID → error containing "--raftId is required"

TestBuildAdminLeaderForwarder_HappyPathReturnsForwarder covers the success path. Both tests follow the style already established by TestBuildLeaderForwarder_RejectsMissingDeps.


Remaining open items (all non-blocking, carried over from CodeRabbit nitpicks)

These were marked as style/nitpick in the previous review and were not part of the four resolved issues. They can be addressed separately or ignored:

  1. forwardLogger component label (main.go:984): still "admin", not "admin-forward". Cosmetic — HTTP and gRPC-forward logs are indistinguishable by component in structured log queries.

  2. Dual-parse of access keys (main.go:672 vs admin.Config.RoleIndex() later in startAdminFromFlags): the overlap-validation race window is now narrowed to adminEnabled=true deployments only, so the blast radius is smaller. Still present, but the code comments explicitly acknowledge the divergence and its limits (main_admin_forward.go:111–116).

  3. TestAdminForwardServerDeps_ReadyForRegistration not table-driven: four require.X lines vs the table style used elsewhere in the file. Functionally identical; style-only.

  4. Validation order in buildAdminLeaderForwarder: coordinate == nil || connCache == nil checked before nodeID == "". The mixed contract (nil inputs → silent fallback, empty nodeID → hard error) is documented but slightly surprising. Non-blocking.


Verdict: the four blocking issues are resolved correctly. The remaining items are cosmetic; merging as-is is reasonable.

@bootjp bootjp merged commit 41c7f0f into main Apr 26, 2026
8 checks passed
@bootjp bootjp deleted the feat/admin-forward-bridge branch April 26, 2026 09:12
bootjp added a commit that referenced this pull request Apr 26, 2026
P2 slice 1 of
[docs/design/2026_04_24_proposed_admin_dashboard.md](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_04_24_proposed_admin_dashboard.md).
Read-only S3 admin endpoints land first so the SPA's S3List and S3Detail
pages stop hitting 404. Write paths (POST/PUT/DELETE/ACL) ship in slice
2 together with AdminForward integration; until then they reply 405.

## Summary

- **`*adapter.S3Server.AdminListBuckets` / `AdminDescribeBucket`** —
SigV4-bypass read methods. Share `loadBucketMetaAt` + the
metadata-prefix scan with the SigV4 path, so a SigV4 `listBuckets` and
the admin dashboard cannot drift.
- **`internal/admin/buckets_source.go`** — `BucketsSource` interface +
`BucketSummary` DTO + sentinel errors (`ErrBucketsForbidden` /
`ErrBucketsNotLeader` / `ErrBucketsNotFound` /
`ErrBucketsAlreadyExists`). `ErrBucketsForbidden` is wired in this
slice; the others are reserved for slice 2 to keep the error vocabulary
additive.
- **`internal/admin/s3_handler.go`** — `S3Handler` with `handleList`
(paginated) + `handleDescribe` (404 for missing). Sub-paths under
`/buckets/{name}/` (the future `/acl`) return 404 here so a SPA bug
pointed at this build cannot accidentally hit the describe path with a
`"{name}/acl"` string.
- **Shared `list_pagination.go`** — centralises base64url cursor + limit
parsing/clamping so the Dynamo and S3 handlers cannot diverge on
validation rules. Drops the now-redundant `parseDynamoListLimit` /
`decodeDynamoNextToken` / `encodeDynamoNextToken` from
`dynamo_handler.go` in favour of `parseListPaginationParams` /
`decodeListNextToken` / `encodeListNextToken`.
- **`apiRouteTable` in `server.go`** — bundles the precomposed
middleware chains and dispatches by URL prefix. `buildAPIMux`'s body
went from 13 cyclomatic branches to 6, leaving headroom for the next
resource family (SQS queues, etc.) to land without another refactor.
- **Production wiring** — `main_s3.go`'s `startS3Server` now returns
`*adapter.S3Server`; `runtimeServerRunner` stores it;
`startAdminFromFlags` accepts it and threads `bucketsBridge` →
`admin.BucketsSource` → `ServerDeps.Buckets`.

## What is NOT in this PR

- `AdminCreateBucket` / `AdminPutBucketAcl` / `AdminDeleteBucket` on the
adapter and their HTTP/AdminForward counterparts. Slice 2.
- `RoleStore` plumb-through for S3. Slice 2 (read-only is fine for any
authenticated session today; the write paths will need it).
- `forwarded_from` audit trail for S3 admin writes. Slice 2 — needs the
`pb.AdminOperation` enum extended.
- The merge-freeze-acceptable changes to `proto/admin_forward.proto`.
Slice 2.

## Test plan

- [x] `go build ./...`
- [x] `go vet ./...`
- [x] `golangci-lint run` (admin + main + adapter packages: 0 issues)
- [x] `go test ./internal/admin/ -count=1 -race` — 13 new handler tests
pass
- [x] `go test . -count=1 -race` — main package passes
- [x] `go test -run "TestS3Server_Admin" ./adapter/ -count=1 -race` — 4
new adapter tests pass
- [ ] `go test ./adapter/` times out at 120s due to a pre-existing flake
(verified earlier on PR #648 against `main` — unrelated to this branch)
- [ ] End-to-end smoke against a 3-node cluster — needs the next slice +
the SPA PR (#649) to merge before there's anything for the SPA to render

## Acceptance criteria coverage (Section 4.1)

| Endpoint | This PR |
|---|---|
| `GET /admin/api/v1/s3/buckets` | ✓ |
| `POST /admin/api/v1/s3/buckets` | ⏳ slice 2 |
| `GET /admin/api/v1/s3/buckets/{name}` | ✓ |
| `PUT /admin/api/v1/s3/buckets/{name}/acl` | ⏳ slice 2 |
| `DELETE /admin/api/v1/s3/buckets/{name}` | ⏳ slice 2 |

## Self-review (5 lenses)

1. **Data loss**: No FSM / Raft / Pebble path changes.
`AdminListBuckets` / `AdminDescribeBucket` are read-only and use the
same `ScanAt` / `loadBucketMetaAt` as the SigV4 path.
2. **Concurrency**: No new shared state. The handler holds only the
immutable `BucketsSource` interface; the bridge holds only the immutable
`*S3Server`. `pinReadTS` is the same pattern the SigV4 path uses for
snapshot stability.
3. **Performance**: One additional metadata scan per `GET
/admin/api/v1/s3/buckets`, identical to the SigV4 listBuckets.
Pagination caps at 1000 buckets per page (silent clamp on oversize). No
hot-path changes.
4. **Data consistency**: Read-only; no commit-ts ordering risk. Admin
reads use the same lease-read window as the SigV4 path.
5. **Test coverage**: 13 handler tests + 4 adapter tests + the existing
dynamo/cluster test surfaces still pass after the shared-helper
refactor. The pagination cursor round-trip test pins the wire format so
a future cursor-encoding change cannot ship without breaking the test.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added S3 bucket admin API endpoints to list all buckets and retrieve
detailed bucket information, including creation timestamps.
* Implemented pagination support with cursor-based next_token for
browsing large bucket collections.
* Added proper authorization checks and error handling for admin bucket
operations (403 for forbidden, 404 for not found, 500 for errors).

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 26, 2026
Per `docs/design/README.md`'s lifecycle convention, the admin dashboard
design doc is now "partial":

- **P1** (DynamoDB CRUD + AdminForward) — shipped via #634, #635, #644,
#648
- **P2** (S3 buckets list/create/delete/ACL + DescribeTable) — shipped
via #658, with #669 + #673 in flight
- **P3** (React SPA + embed) — shipped via #649, #650
- **P4** (TLS / role / CSRF / operator docs) — TLS, role, CSRF are
already live in P1; operator docs in #674

Independent of the in-flight slice 2 PRs (#669/#673) and the docs PR
(#674) — this rename only reflects what is already on main today, plus
an "Implementation status" table mapping each phase to the PR it landed
in.

## What this PR changes

- `git mv` the design doc from `2026_04_24_proposed_admin_dashboard.md`
to `2026_04_24_partial_admin_dashboard.md` so its history follows
- Add an "Implementation status" header table indexing each phase to the
PRs that landed it
- List the outstanding open items so future readers know what is still
owed against the original proposal:
- AdminForward acceptance criterion 5 (rolling-upgrade compat flag) —
deferred
  - S3 object browser — explicitly out of scope per Section 2 Non-goals
  - TLS cert hot-reload — restart-to-rotate is the documented model

When the rolling-upgrade flag lands, the doc gets renamed once more to
`2026_04_24_implemented_admin_dashboard.md` per the README's lifecycle
convention.
bootjp added a commit that referenced this pull request Apr 26, 2026
P2 slice 2b: a follower-side S3 admin write (POST /buckets, PUT
/buckets/{name}/acl, DELETE /buckets/{name}) now hands off to the leader
transparently, completing the same end-to-end forwarding contract Dynamo
writes received in #644 + #648.

**Stacked on #669 (P2 slice 2a).** Once #669 merges, this rebases
cleanly onto main.

## Summary

- **Proto** — three new `ADMIN_OP` enum values (`CREATE_BUCKET` /
`DELETE_BUCKET` / `PUT_BUCKET_ACL`) appended after the Dynamo block so
existing wire integers stay stable. Regenerated with the pinned protoc
29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.
- **Leader-side** — `ForwardServer.WithBucketsSource` lets deployments
wire the S3 dispatcher optionally; Dynamo-only builds keep it nil and
the new operations return 501. Three new dispatch arms mirror the
leader-direct HTTP path's payload contract (NUL-byte rejection, 64 KiB
limit, `DisallowUnknownFields`, trailing-token rejection, slash-in-name
rejection). `forwardBucketsErrorResponse` mirrors
`forwardErrorResponse`: `ErrBucketsForbidden` / `NotLeader` / `NotFound`
/ `AlreadyExists` / `NotEmpty` + `*ValidationError` each map to the same
HTTP status `writeBucketsError` produces, so forwarded and leader-direct
responses are byte-identical from the SPA's view.
- **Follower-side** — `LeaderForwarder` interface gains
`ForwardCreateBucket` / `ForwardDeleteBucket` / `ForwardPutBucketAcl`.
`ForwardPutBucketAcl` carries both the bucket name (URL path) and the
new ACL (request body) in one JSON payload — same approach
`ForwardDeleteTable` takes for the table name.
- **Handler integration** — `S3Handler.WithLeaderForwarder` +
`tryForwardCreateBucket` / `tryForwardPutBucketAcl` /
`tryForwardDeleteBucket` gated on `ErrBucketsNotLeader && forwarder !=
nil`. `writeForwardResult` / `writeForwardFailure` mirror the Dynamo
handler's pattern (nosniff + Cache-Control:no-store + Retry-After:1 on
503).
- **Wiring** — `runtimeServerRunner.start()` now creates
`*adapter.S3Server` before `startRaftServers` so the leader-side
`ForwardServer` registration sees both adapters.
`adminForwardServerDeps.buckets` is wired through
`registerAdminForwardServer`. `ServerDeps.Forwarder` now plumbs into
`S3Handler` too.

## What is NOT in this PR

- Rolling-upgrade compatibility flag (criterion 5) — still deferred
behind a cluster-version bump.

## Test plan

- [x] `go build ./...`
- [x] `go vet ./...`
- [x] `golangci-lint run` (admin + main + adapter + proto: 0 issues)
- [x] `go test ./internal/admin/ -count=1 -race` — full admin suite
passes (19 new tests across forward_server / forward_client / s3_handler
integration)
- [x] `go test . -count=1 -race` — main package passes
- [x] Targeted adapter tests pass
- [ ] Full `./adapter/` times out due to a pre-existing flake (verified
earlier on PRs #648 / #658 / #669 — unrelated to this branch)
- [ ] End-to-end smoke against a 3-node cluster — manual exercise after
#669 + this land

## Acceptance criteria coverage (Section 3.3.2 + 4.1)

| # | Criterion | Status |
|---|---|---|
| 1 | Leader direct write | ✓ for S3 (#669 / slice 2a) |
| 2 | Follower forwards transparently | ✅ for S3 (this PR) |
| 3 | Election-period 503 + retry | ✅ for S3 (this PR — same fallback as
Dynamo) |
| 4 | Leader demotes stale full role | ✅ for S3 (existing
`validatePrincipal` runs before bucket dispatch) |
| 5 | Rolling-upgrade compat flag | ⏳ deferred (cluster-version bump) |
| 6 | `forwarded_from` in audit log | ✅ for S3 (this PR — leader-side
audit log carries it) |

## Self-review (5 lenses)

1. **Data loss**: No FSM / Raft / Pebble path changes. The leader-side
dispatcher reuses the same `BucketsSource` interface that the
leader-direct HTTP path uses; the only new code path is the JSON
envelope + principal re-validation.
2. **Concurrency**: New methods on `gRPCForwardClient` reuse the
existing `forward()` helper, so connection-cache reuse and
`ErrLeaderUnavailable` signalling behave identically across resource
types. The `runtimeServerRunner.start()` reorder is safe — each adapter
listens on its own address and the raft TCP listeners are independent.
3. **Performance**: One additional gRPC round-trip on follower-side
admin writes; identical to Dynamo's path.
4. **Data consistency**: Leader-side dispatcher applies the same strict
body decode the leader-direct HTTP path does (NUL-byte rejection,
trailing-token rejection, slash-in-name rejection).
`forwardBucketsErrorResponse` produces byte-identical envelopes to
`writeBucketsError` so forwarded and leader-direct responses are
indistinguishable.
5. **Test coverage**: 19 new tests — 9 forward-server (happy paths, 501
no-source, bad-JSON, already-exists, not-empty, slash-in-name,
missing-acl, payload-too-large sweep), 4 forward-client (3 happy-path
payload shapes + ErrLeaderUnavailable for all three), 6
handler-integration (forwarded create/delete/put-acl happy +
leader-unavailable + transport-error + 3-axis non-NotLeader gate sweep).
bootjp added a commit that referenced this pull request Apr 26, 2026
P2 slice 2a of
[docs/design/2026_04_24_proposed_admin_dashboard.md](https://github.com/bootjp/elastickv/blob/main/docs/design/2026_04_24_proposed_admin_dashboard.md).
Ships the S3 admin write endpoints so the SPA's S3 modals stop receiving
405. Slice 2b will plumb AdminForward so a follower can hand these
writes off to the leader transparently.

## Summary

- **`*adapter.S3Server.AdminCreateBucket` / `AdminPutBucketAcl` /
`AdminDeleteBucket`** — SigV4-bypass write methods with three in-method
guards: principal must be `AdminRoleFull`, the local node must be the
verified S3 leader, and bucket-name / ACL must pass the existing
validators. `AdminCreateBucket` reuses the same atomic bucket-meta + ACL
+ generation-key txn the SigV4 path does — no new code path through the
storage layer. `AdminDeleteBucket` rejects non-empty buckets (the
dashboard cannot force recursive delete by design).
- **`internal/admin` write surface** — `BucketsSource` gains the three
write methods + `CreateBucketRequest` / `PutBucketACLRequest` types with
the documented JSON shapes. `S3Handler.serveCollection` +
`servePerBucket` route POST/PUT/DELETE through dedicated handlers with
`principalForWrite` re-validating the role on every request against the
live `MapRoleStore`.
- **Strict body decoder** — `decodeAdminS3JSONBody` is generic over the
request type, applies `DisallowUnknownFields`, rejects NUL bytes,
rejects trailing tokens, and caps at 64 KiB (matches design 4.4). Used
by both POST and PUT.
- **`writeBucketsError`** translates the source-side sentinels into the
design's HTTP statuses: 403 forbidden / 503 + Retry-After:1
leader_unavailable / 404 not_found / 409 already_exists / 409
bucket_not_empty / 400 invalid_request via `*ValidationError`.
- **Bridge** — `bucketsBridge` gains write methods running through
`translateAdminBucketsError`, mirroring `translateAdminTablesError` on
the Dynamo side. Leader-churn sentinels from the kv coordinator route to
`admin.ErrBucketsNotLeader` so the SPA's retry contract stays intact.

## What is NOT in this PR

- AdminForward integration for S3 admin writes — slice 2b.
- Rolling-upgrade compatibility flag (criterion 5) — still deferred
behind a cluster-version bump.

## Test plan

- [x] `go build ./...`
- [x] `go vet ./...`
- [x] `golangci-lint run` (admin + main + adapter packages: 0 issues)
- [x] `go test ./internal/admin/ -count=1 -race` — 19 new handler tests
pass
- [x] `go test . -count=1 -race` — main package passes
- [x] `go test -run "TestS3Server_Admin" ./adapter/ -count=1 -race` — 10
new adapter tests pass
- [ ] Full `go test ./adapter/` times out at 120s due to a pre-existing
flake (verified earlier on PRs #648 / #658 against `main` — unrelated to
this branch)
- [ ] End-to-end smoke against a 3-node cluster — slice 2b first, then a
manual exercise

## Acceptance criteria coverage (Section 4.1)

| Endpoint | This PR |
|---|---|
| `GET /admin/api/v1/s3/buckets` | ✓ (#658) |
| `GET /admin/api/v1/s3/buckets/{name}` | ✓ (#658) |
| `POST /admin/api/v1/s3/buckets` | ✅ |
| `PUT /admin/api/v1/s3/buckets/{name}/acl` | ✅ |
| `DELETE /admin/api/v1/s3/buckets/{name}` | ✅ |

## Self-review (5 lenses)

1. **Data loss**: `AdminCreateBucket` reuses `s.coordinator.Dispatch`
with the same `OperationGroup` shape as the SigV4 path — bucket meta +
generation key in one txn. No new FSM / Pebble / Raft path.
`AdminDeleteBucket`'s "must be empty" guard is a SnapshotAt scan + size
check identical to the SigV4 path.
2. **Concurrency**: Writes go through `retryS3Mutation` which already
handles transient mid-dispatch leader churn. The leader check is
`isVerifiedS3Leader` — same primitive the SigV4 path uses. Role gate is
re-evaluated against the live `MapRoleStore` on every request, so a key
downgrade picked up between login and write is enforced immediately.
3. **Performance**: One additional load-bucket-meta read on PutACL /
Delete. No hot-path changes; admin writes are operator-rate, not
data-plane-rate.
4. **Data consistency**: `AdminCreateBucket` writes (BucketMetaKey,
BucketGenerationKey) atomically. `AdminPutBucketAcl` mutates only
`meta.Acl` and re-encodes the entire BucketMeta — generation is
preserved so existing object references stay valid. `AdminDeleteBucket`
removes only BucketMetaKey (BucketGenerationKey is left behind, matching
the SigV4 path's behaviour — a future re-create gets a fresh
generation).
5. **Test coverage**: 29 new tests (19 admin-package + 10 adapter-level)
covering happy paths, role gates, leader checks, validation rejections,
all four sentinel error mappings, and the cross-method missing-principal
401. The existing `TestS3Handler_DescribeBucket_SubpathReturns404` was
superseded by two more precise tests now that `/acl` is a real
sub-resource.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* S3 bucket admin endpoints: create, update ACL, and delete with
enforced write authorization, input validation, leader-forwarding, and
clear HTTP error mappings.
* **Documentation**
* New admin docs covering dashboard config, TLS/role semantics, audit
logging, and troubleshooting for admin operations.
* **Tests**
* Extensive end-to-end and unit tests for bucket lifecycle, forwarding,
auth/validation, and error scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 28, 2026
Per docs/design/README.md's lifecycle convention. The original
P1–P4 plan has fully shipped:

  - P1 (admin skeleton + Dynamo + AdminForward) — #634/#635/#644/#648
  - P2 (S3 endpoints incl. write paths and AdminForward integration)
        — #658 / #669 / #673 / #695 (TOCTOU safety net)
  - P3 (React SPA + embed) — #649 / #650
  - P4 (TLS / role / CSRF / operator doc / deployment runbook
        / scripts/rolling-update.sh admin support) — #674 / #669
        / #678

The AdminDeleteBucket TOCTOU caught during PR #669 review (the
last "in-flight" item that kept the doc at _partial_) is fully
resolved by the safety-net design landed in #695.

What changed:

  - git mv 2026_04_24_partial_admin_dashboard.md →
            2026_04_24_implemented_admin_dashboard.md (history
            follows the rename)
  - Header Status line: "Partial" → "Implemented", explanation
    updated to reflect the post-fix state and the rationale for
    promotion.
  - "Last updated" bumped to 2026-04-28 with the rename trigger.
  - Section heading "Outstanding open items" → "Out-of-scope
    follow-ups" — the remaining three entries (criterion 5,
    object browser, TLS hot-reload) are not in-flight work; they
    are deferred-at-design or Non-goal items. The TOCTOU bullet
    is removed (resolved) and replaced with a one-line
    cross-link to the safety-net design + admin_deployment.md
    §4.6 contract.
  - Removed the closing "rename trigger" sentence — we just did
    the rename.
  - Status table: P2 row now lists #695 alongside #658/#669/#673
    so a future reader can find the TOCTOU fix from the index.
  - Cross-references updated everywhere the old filename
    appeared:
      docs/admin.md (header link + Cross-references)
      docs/admin_deployment.md (header link + final cross-ref)
      docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md
        (Background section pointer)
      internal/admin/config.go (Section 7.1 reference comment)

No code changes other than the comment-only filename refresh in
config.go.
bootjp added a commit that referenced this pull request Apr 28, 2026
## Summary

Promote the admin dashboard design doc from `_partial_` →
`_implemented_` per `docs/design/README.md`'s lifecycle convention. PR
#695 landed the TOCTOU safety-net fix (the last in-flight item that kept
the doc at `_partial_`), so the original P1–P4 plan is now fully
shipped:

| Phase | Landed via |
|---|---|
| P1 (admin skeleton + Dynamo + AdminForward) | #634 / #635 / #644 /
#648 |
| P2 (S3 endpoints + writes + AdminForward S3 + TOCTOU fix) | #658 /
#669 / #673 / **#695** |
| P3 (React SPA + embed) | #649 / #650 |
| P4 (TLS / role / CSRF / operator doc / runbook / script wiring) | #674
/ #669 / #678 |

The remaining three items in the doc move from "Outstanding open items"
(in-flight) to **"Out-of-scope follow-ups"** (deferred-at-design or
Non-goal):

- AdminForward criterion 5 — rolling-upgrade flag, deferred behind a
cluster-version bump that doesn't exist yet
- S3 object browser — Non-goal per §2.2
- TLS cert hot-reload — out of scope per `docs/admin.md`

## Changes

- `git mv` partial → implemented (history follows the rename)
- Status line / Last-updated / status-table / Out-of-scope section
content reflects the promotion
- Cross-references updated in all 4 referencing files: `docs/admin.md`,
`docs/admin_deployment.md`,
`docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md`,
`internal/admin/config.go` (comment-only)

## Test plan

- [x] No code changes other than a comment-only filename refresh in
`config.go`
- [x] `go build ./...` passes
- [x] `golangci-lint run ./internal/admin/...` — 0 issues
- [x] `grep -rn "2026_04_24_partial\|2026_04_24_proposed_admin"` returns
nothing — no stale references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant