admin: S3 bucket write endpoints (P2 slice 2a)#669
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 48 minutes and 59 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (20)
📝 WalkthroughWalkthroughAdds S3 admin write capabilities: bucket creation, ACL updates, and deletion across adapter, handler, admin contract, forwarding, server wiring, startup, proto, tests, and docs. New sentinel errors and request DTOs are introduced; leader-forwarding and per-request role revalidation are wired end-to-end. Changes
Sequence DiagramsequenceDiagram
actor Client
participant Handler as S3Handler
participant RoleStore
participant Forwarder as LeaderForwarder
participant BucketsSrc as BucketsSource
participant Adapter as S3 Adapter
participant KV as KV Store
Client->>Handler: POST /s3/buckets {name, acl}
Handler->>RoleStore: validate write permission for principal
RoleStore-->>Handler: role (allow/deny)
alt Not leader locally
Handler->>Forwarder: ForwardCreateBucket(principal, {name,acl})
Forwarder-->>Handler: leader response / ErrLeaderUnavailable
else Leader or local
Handler->>BucketsSrc: AdminCreateBucket(principal, {name,acl})
BucketsSrc->>Adapter: AdminCreateBucket(ctx, principal, name, acl)
Adapter->>KV: read exists / validate
alt exists
Adapter-->>BucketsSrc: ErrAdminBucketAlreadyExists
else
Adapter->>KV: OperationGroup write bucket metadata & generation
KV-->>Adapter: success
Adapter-->>BucketsSrc: AdminBucketSummary
end
BucketsSrc-->>Handler: result / error
end
Handler-->>Client: 201/409/503/4xx based on outcome
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 3m 39s —— View job Code Review: admin S3 bucket write endpoints (P2 slice 2a)Todo list
The overall implementation is solid and mirrors the Dynamo side well. I cross-checked the missing That said, I found a few things worth addressing: Findings1. Stale route table comment in
|
| # | Severity | File | Area |
|---|---|---|---|
| 1 | Medium | internal/admin/server.go:197 |
Stale route table comment |
| 2 | Medium | internal/admin/s3_handler_test.go |
*ValidationError handler path untested |
| 3 | Minor | internal/admin/s3_handler.go |
Audit log omits failed write attempts |
| 4 | Minor | internal/admin/s3_handler.go:440 |
MaxBytesReader(nil, ...) undocumented |
| 5 | Minor | internal/admin/s3_handler.go:469 |
Misleading error text for unknown fields |
The core write paths (AdminCreateBucket, AdminPutBucketAcl, AdminDeleteBucket), the three-guard pattern (role → leader → validation), and the bridge translation are well-structured and consistent with the Dynamo precedent. Items 1 and 2 are the most worth fixing before merge; the rest can go into a follow-up.
There was a problem hiding this comment.
Code Review
This pull request introduces administrative write operations for S3 buckets, including creation, deletion, and ACL updates. It adds new methods to the S3Server adapter, updates the BucketsSource interface, and extends the S3Handler to support these operations with mandatory role re-validation. The implementation is supported by comprehensive unit tests, error translation logic, and a bridge connecting the admin API to the adapter. I have no feedback to provide.
…axBytes nil doc) Three actionable findings from claude review on 91a48ef: 1) **Medium: stale route-table comment in `buildAPIMux`** `internal/admin/server.go`: the comment block listing the routes only mentioned the two GET S3 endpoints. Added the three new write endpoints (POST /buckets, DELETE /buckets/{name}, PUT /buckets/{name}/acl) so a future reader does not have to reverse-engineer the routing from `servePerBucket`. 2) **Medium: missing handler-level `*ValidationError` test** `s3_handler_test.go`: `writeBucketsError`'s `errors.As(err, &verr)` arm was exercised end-to-end via the bridge translation but had no direct handler-level coverage. Added `TestS3Handler_WriteEndpoints_ValidationErrorReturns400` — three sub-cases (create / put_acl / delete) injecting a `*ValidationError` directly into the stub source and asserting 400 + `invalid_request` + the message. Locks the arm down so a future refactor that drops the `errors.As` branch fails CI immediately. 3) **Minor: `MaxBytesReader(nil, ...)` undocumented** `s3_handler.go`: added an explanatory comment at the call site. Passing nil is safe because `MaxBytesReader`'s only writer use is the connection-close optimisation (a comma-ok type assertion that no-ops on nil); the cap itself is enforced regardless. We pass nil because the helper takes only `io.Reader` so it remains usable from non-HTTP entry points (the upcoming AdminForward decoder reuses the same shape). The 413 path still works via `errAdminS3BodyTooLarge` matching on `*http.MaxBytesError`. Findings #3 (failed-write audit log) and #5 (DisallowUnknownFields error message) were noted as pre-existing patterns shared with the Dynamo side; tracking them separately rather than diverging on this slice.
|
@claude review ddc3b9b addresses the three actionable findings:
Findings #3 (failed-write audit) and #5 (DisallowUnknownFields error text) noted as pre-existing patterns shared with the Dynamo side — tracking separately to avoid diverging on this slice. |
|
Claude finished @bootjp's task in 49s —— View job Re-review: admin S3 bucket write endpoints (P2 slice 2a)Todo list
All three actionable findings from the previous review have been correctly addressed in Finding #1 (Medium) — Route-table comment: ✅ Fixed
Finding #2 (Medium) — ValidationError handler coverage: ✅ Fixed
Finding #4 (Minor) — MaxBytesReader nil comment: ✅ Fixed
Findings #3 and #5 (failed-write audit omission and DisallowUnknownFields error text) acknowledged as pre-existing patterns shared with the Dynamo side — reasonable to defer and track together with that side rather than diverge here. No regressions found. The test file ends at 731 lines covering all sentinel error paths, role gates, method rejections, body-decode edge cases, and the now-complete |
Slice 2b of P2 (docs/design/2026_04_24_proposed_admin_dashboard.md
Section 3.3.2 + 4.1): a follower-side S3 admin write
(POST /buckets, PUT /buckets/{name}/acl, DELETE /buckets/{name})
now hands off to the leader transparently, completing the same
end-to-end forwarding contract Dynamo writes received in #644 +
Stacked on #669 (P2 slice 2a). Once #669 merges, this rebases
cleanly onto main.
Proto (proto/admin_forward.proto):
- Three new ADMIN_OP enum values (CREATE_BUCKET / DELETE_BUCKET /
PUT_BUCKET_ACL) appended after the Dynamo block so existing
wire-format integers stay stable. Regenerated with the pinned
protoc 29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.
Leader-side ForwardServer (internal/admin/forward_server.go):
- WithBucketsSource lets deployments wire the S3 dispatcher
optionally — Dynamo-only builds keep the receiver nil and the
three new operations return 501 NotImplemented.
- Three new dispatch arms: handleCreateBucket / handleDeleteBucket /
handlePutBucketAcl. Each one mirrors the leader-direct HTTP
path's payload contract (NUL-byte rejection, 64 KiB limit,
DisallowUnknownFields, trailing-token rejection, slash-in-name
rejection) so a hostile follower cannot smuggle a payload past
validations the leader-direct path enforces.
- forwardBucketsErrorResponse mirrors forwardErrorResponse on the
Dynamo side: ErrBucketsForbidden / NotLeader / NotFound /
AlreadyExists / NotEmpty + ValidationError each map to the same
HTTP status the leader-direct writeBucketsError produces, so
forwarded and leader-direct responses are byte-for-byte
indistinguishable from the SPA's view.
- isStructuredSourceError extended to recognise the bucket
sentinels so they are NOT logged at LevelError on the leader.
- Forward's switch was extracted into dispatchForward to keep the
parent function under cyclop's 10-branch ceiling as the operation
enum grew.
Follower-side LeaderForwarder (internal/admin/forward_client.go):
- Interface gains ForwardCreateBucket / ForwardDeleteBucket /
ForwardPutBucketAcl. PutBucketAcl carries both the bucket name
(from the URL path) and the new ACL (from the request body) in
one JSON payload — same approach handleDeleteBucket takes for
the bucket name.
- gRPCForwardClient methods reuse the existing forward() helper
for transport, so connection-cache reuse and ErrLeaderUnavailable
signalling behave identically across resource types.
Handler integration (internal/admin/s3_handler.go):
- New `forwarder LeaderForwarder` field + WithLeaderForwarder method.
- handleCreate / handlePutACL / handleDelete now consult tryForward*
helpers when the source returned ErrBucketsNotLeader; the helpers
are gated on `errors.Is(err, ErrBucketsNotLeader) && forwarder != nil`
so a leader-direct rejection (already-exists, not-found, etc.) is
never re-applied at the leader.
- writeForwardResult / writeForwardFailure mirror the Dynamo handler's
pattern: nosniff + Cache-Control:no-store + Retry-After:1 on 503.
ErrLeaderUnavailable does NOT log at LevelError (elections are
routine); transport errors do log so operators can investigate.
Wiring (main.go + main_admin_forward.go):
- adminForwardServerDeps gains a `buckets` field; readyForRegistration
still requires only TablesSource + RoleStore (so cluster-only or
Dynamo-only builds keep registering Dynamo forwarding without S3).
- runtimeServerRunner.start() now creates *adapter.S3Server BEFORE
startRaftServers (in addition to dynamoServer) so the leader-side
ForwardServer registration sees both adapters. The reorder is safe:
each adapter listens on its own address and the raft TCP listeners
are independent.
- ServerDeps.Forwarder now plumbs through buildS3HandlerForDeps too,
so the follower's S3Handler picks up the same LeaderForwarder
instance the Dynamo handler does.
Tests:
- 9 forward-server tests covering the three new bucket operations:
happy path / no-BucketsSource→501 / bad-JSON 400 / already-exists
409 / not-empty 409 / slash-in-name 400 / missing-acl 400 /
payload-too-large 413 (sweep over all three ops).
- 4 forward-client tests covering ForwardCreateBucket /
ForwardDeleteBucket / ForwardPutBucketAcl happy-path payload
shapes + ErrLeaderUnavailable on no-leader.
- stubLeaderForwarder gains 3 bucket-side forward methods so
existing dynamo tests still satisfy the LeaderForwarder
interface, and the new stub fields let bucket-handler tests
verify the forward arguments.
- 6 handler integration tests on S3Handler.tryForward*: forwarded
create / delete / put-acl happy paths (replay leader's status
+ payload + content-type), forwarder ErrLeaderUnavailable → 503
+ Retry-After, transport-error → 503 + no leakage, and a 3-axis
gate sweep proving the forwarder is NOT invoked on
AlreadyExists / Forbidden / generic source errors.
Closes design 3.3.2 acceptance criteria 2 (transparent forwarding)
+ 6 (forwarded_from in audit log) for S3 admin writes; criterion 3
(election-period 503 + retry) is also live for S3 because the
existing tryForward helpers reuse the same fallback paths.
Two findings from claude review #675: 1. **P2 was marked ✅ shipped but #669 / #673 are still in flight.** PR description correctly noted slices 2a/2b haven't merged, but the table cell read "✅ shipped". Future readers reading the table alone would conclude main has the S3 write paths today, which is wrong. Fixed: P2 is now 🟡 partial with the in-flight PRs explicitly called out, mirroring the P4 row's shape. 2. **P1 row claimed criteria 1–6 but criterion 5 is in the outstanding items as deferred.** The two cells contradicted each other. Updated P1 to "criteria 1–4 + 6 (criterion 5 deferred — see outstanding items)" so the table is consistent with the outstanding-items section below. The cosmetic notes (Last-updated field, verbose Status header) are left for a future cleanup; they are not factually wrong, only non-conforming to the README convention. Calling them out as a follow-up rather than baking the precedent into this doc.
…axBytes nil doc) Three actionable findings from claude review on 91a48ef: 1) **Medium: stale route-table comment in `buildAPIMux`** `internal/admin/server.go`: the comment block listing the routes only mentioned the two GET S3 endpoints. Added the three new write endpoints (POST /buckets, DELETE /buckets/{name}, PUT /buckets/{name}/acl) so a future reader does not have to reverse-engineer the routing from `servePerBucket`. 2) **Medium: missing handler-level `*ValidationError` test** `s3_handler_test.go`: `writeBucketsError`'s `errors.As(err, &verr)` arm was exercised end-to-end via the bridge translation but had no direct handler-level coverage. Added `TestS3Handler_WriteEndpoints_ValidationErrorReturns400` — three sub-cases (create / put_acl / delete) injecting a `*ValidationError` directly into the stub source and asserting 400 + `invalid_request` + the message. Locks the arm down so a future refactor that drops the `errors.As` branch fails CI immediately. 3) **Minor: `MaxBytesReader(nil, ...)` undocumented** `s3_handler.go`: added an explanatory comment at the call site. Passing nil is safe because `MaxBytesReader`'s only writer use is the connection-close optimisation (a comma-ok type assertion that no-ops on nil); the cap itself is enforced regardless. We pass nil because the helper takes only `io.Reader` so it remains usable from non-HTTP entry points (the upcoming AdminForward decoder reuses the same shape). The 413 path still works via `errAdminS3BodyTooLarge` matching on `*http.MaxBytesError`. Findings #3 (failed-write audit log) and #5 (DisallowUnknownFields error message) were noted as pre-existing patterns shared with the Dynamo side; tracking them separately rather than diverging on this slice.
ddc3b9b to
5a67834
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
internal/admin/s3_handler.go (1)
309-313: Minor:body.ACLis forwarded un-trimmed even though validation rejects whitespace-only input.
strings.TrimSpace(body.ACL) == ""rejects blank/whitespace input, but the un-trimmedbody.ACLis what flows into the source. The adapter'sadminCanonicalACLtrims again, so functionally this works, but" public-read "from the SPA reaches the audit log line at 322 with the surrounding whitespace preserved (slog.String("acl", body.ACL)), which makes downstream log filtering noisier than necessary. Trimming once at the boundary keeps the audit field canonical.♻️ Trim once at the handler boundary
- if strings.TrimSpace(body.ACL) == "" { + body.ACL = strings.TrimSpace(body.ACL) + if body.ACL == "" { writeJSONError(w, http.StatusBadRequest, "invalid_body", "acl is required") return }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/admin/s3_handler.go` around lines 309 - 313, Trim body.ACL once at the handler boundary and use the trimmed value everywhere: e.g., compute acl := strings.TrimSpace(body.ACL), validate acl == "" (return the same invalid_body error), and then pass acl into h.source.AdminPutBucketAcl and into the audit slog.String("acl", ...) so the audit log and source receive the canonical ACL; you can still keep adminCanonicalACL doing further normalization if needed but stop forwarding the untrimmed body.ACL.internal/admin/buckets_source.go (1)
22-50: Minor:PutBucketACLRequestDTO defined but not used by the source interface.
AdminPutBucketAcltakesname, acl stringdirectly whileAdminCreateBuckettakesCreateBucketRequest. The newPutBucketACLRequesttype is only used by the handler's body decoder. This is functionally fine, but a future "owner override" extension (referenced in the type's own doc comment) would need to update the interface signature anyway, so deferring the typed DTO until then is also a valid choice. Worth a one-line doc note clarifying thatPutBucketACLRequestlives on the wire side only.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/admin/buckets_source.go` around lines 22 - 50, The PutBucketACLRequest DTO is defined but not used by the source interface; update the codebase by adding a one-line doc comment on the PutBucketACLRequest type clarifying it is a wire-only/request-body DTO (used by the handler's body decoder) and not part of the BucketsSource interface contract, and note that AdminPutBucketAcl currently accepts name, acl string and would be the place to change the signature if an "owner override" extension is added later; this keeps the intent clear while leaving AdminPutBucketAcl and AdminCreateBucket unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@adapter/s3_admin.go`:
- Around line 369-403: AdminDeleteBucket reads a one-entry probe of
s3keys.ObjectManifestPrefixForBucket(...) via s.store.ScanAt(...) then issues a
delete via s.coordinator.Dispatch(...) but does not supply the scanned manifest
keys as ReadKeys, so OCC won't detect concurrent PutObject races; fix by
extracting the keys from kvs returned by ScanAt and populate the
OperationGroup's ReadKeys slice before calling coordinator.Dispatch in the
retryS3Mutation closure (reference functions/values: AdminDeleteBucket,
retryS3Mutation, loadBucketMetaAt, s.store.ScanAt, s.coordinator.Dispatch,
s3keys.ObjectManifestPrefixForBucket, s3keys.BucketMetaKey) so the OCC validator
will consider those manifest keys and trigger a retry on conflict.
- Around line 342-355: AdminDeleteBucket has a TOCTOU: after scanning
ObjectManifestPrefixForBucket at readTS it dispatches a delete of BucketMetaKey
without including the scanned range in the OperationGroup.ReadKeys, so
concurrent inserts can be orphaned; fix by collecting the scanned range keys
(the range returned by the scan over ObjectManifestPrefixForBucket) and include
them in the OperationGroup passed to s.coordinator.Dispatch (or, alternatively,
re-scan at StartTS before deleting) so OCC validation detects conflicts; update
the retryS3Mutation / s.coordinator.Dispatch call in AdminDeleteBucket (and
likewise in deleteBucket in s3.go) to populate ReadKeys with the scanned range.
In `@internal/admin/s3_handler.go`:
- Around line 351-376: S3Handler.principalForWrite currently skips checking the
JWT-embedded role when a live RoleStore is present, causing behavioral
divergence from DynamoHandler.principalForWrite; change principalForWrite to
first enforce the JWT role gate (check principal.Role.AllowsWrite()) regardless
of h.roles, then if h.roles != nil re-resolve the live role via
h.roles.LookupRole(principal.AccessKey) and require role.AllowsWrite() as well,
and finally return an AuthPrincipal containing the live role
(AuthPrincipal{AccessKey: principal.AccessKey, Role: role}) when both checks
pass; also update test setup (newS3HandlerForTest) or add tests to call
WithRoleStore so the h.roles != nil path is exercised.
---
Nitpick comments:
In `@internal/admin/buckets_source.go`:
- Around line 22-50: The PutBucketACLRequest DTO is defined but not used by the
source interface; update the codebase by adding a one-line doc comment on the
PutBucketACLRequest type clarifying it is a wire-only/request-body DTO (used by
the handler's body decoder) and not part of the BucketsSource interface
contract, and note that AdminPutBucketAcl currently accepts name, acl string and
would be the place to change the signature if an "owner override" extension is
added later; this keeps the intent clear while leaving AdminPutBucketAcl and
AdminCreateBucket unchanged.
In `@internal/admin/s3_handler.go`:
- Around line 309-313: Trim body.ACL once at the handler boundary and use the
trimmed value everywhere: e.g., compute acl := strings.TrimSpace(body.ACL),
validate acl == "" (return the same invalid_body error), and then pass acl into
h.source.AdminPutBucketAcl and into the audit slog.String("acl", ...) so the
audit log and source receive the canonical ACL; you can still keep
adminCanonicalACL doing further normalization if needed but stop forwarding the
untrimmed body.ACL.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6b38f8ff-8066-45eb-9dda-c712067729e1
📒 Files selected for processing (7)
adapter/s3_admin.goadapter/s3_admin_test.gointernal/admin/buckets_source.gointernal/admin/s3_handler.gointernal/admin/s3_handler_test.gointernal/admin/server.gomain_admin.go
| _, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| StartTS: startTS, | ||
| Elems: []*kv.Elem[kv.OP]{ | ||
| {Op: kv.Put, Key: s3keys.BucketMetaKey(name), Value: body}, | ||
| }, | ||
| }) | ||
| return errors.WithStack(err) | ||
| }) | ||
| if err != nil { | ||
| return err //nolint:wrapcheck // sentinel errors propagate as-is. | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Locate the kv.OperationGroup struct definition and check how Dispatch
# treats a zero CommitTS when IsTxn=true.
ast-grep --pattern 'type OperationGroup[$_] struct {
$$$
}'
# Find the Dispatch implementation that consumes OperationGroup.
rg -nP --type=go -C5 'func.*Dispatch\s*\(.*OperationGroup'
# Find any other call sites that set IsTxn=true without CommitTS to see
# whether this is a known idiom or a regression.
rg -nPU --type=go -C3 'IsTxn:\s*true' -g '!**/*_test.go'Repository: bootjp/elastickv
Length of output: 34769
🏁 Script executed:
#!/bin/bash
# Find OperationGroup struct definition
rg -nP 'type OperationGroup\[' --type=go -A 10
# Find validateOperationGroup function
rg -nP 'func.*validateOperationGroup' --type=go -A 20
# Check the Dispatch implementation in coordinator.go more closely
sed -n '240,280p' kv/coordinator.goRepository: bootjp/elastickv
Length of output: 4198
🏁 Script executed:
#!/bin/bash
# Find AdminDeleteBucket implementation with full context
sed -n '361,403p' adapter/s3_admin.go
# Check if there's any range-read/read-intent tracking in OCC logic
rg -nP 'ReadKeys|read.*intent' kv/coordinator.go -A 3 -B 1
# Check transaction validation/conflict detection logic
rg -nP 'validateRead|checkConflict' kv/ --type=go -A 10 | head -60Repository: bootjp/elastickv
Length of output: 6862
🏁 Script executed:
#!/bin/bash
# Check AdminPutBucketAcl (around line 312-355) for ReadKeys
sed -n '312,355p' adapter/s3_admin.go
# Check retryS3Mutation implementation
rg -nP 'func.*retryS3Mutation' adapter/s3_admin.go -A 30Repository: bootjp/elastickv
Length of output: 1240
🏁 Script executed:
#!/bin/bash
# Find retryS3Mutation implementation
rg -nP 'func.*retryS3Mutation' adapter/s3.go -A 35
# Also check if AdminDeleteBucket passes ReadKeys for range scans
rg -nP 'ReadKeys.*=|ReadKeys:' adapter/s3_admin.go
# Check if there's any conflict detection specifically for AdminDeleteBucket
rg -nP 'ObjectManifestPrefix|retryS3Mutation' adapter/s3.go -B 2 -A 2Repository: bootjp/elastickv
Length of output: 2854
🏁 Script executed:
#!/bin/bash
# Check if any other code passes ReadKeys for range scans
rg -nP 'ReadKeys.*=|ReadKeys:' adapter/ --type=go -B 2 | head -80
# Specifically check AdminCreateBucket to see if it uses ReadKeys
sed -n '259,310p' adapter/s3_admin.go
# Check dispatchTxn to understand how ReadKeys affect OCC
sed -n '801,870p' kv/coordinator.go | head -60Repository: bootjp/elastickv
Length of output: 6935
🏁 Script executed:
#!/bin/bash
# Check if s3.go deleteBucket (regular, not admin) passes ReadKeys
sed -n '638,680p' adapter/s3.go
# Check if any S3 code passes ReadKeys for range scans
rg -nP 'ScanAt.*ReadKeys|ReadKeys.*ScanAt' adapter/s3.go adapter/s3_admin.go
# Look for patterns of range scan + dispatch in S3
rg -nP 'ScanAt' adapter/s3.go -A 10 | head -100Repository: bootjp/elastickv
Length of output: 3165
🏁 Script executed:
#!/bin/bash
# Check the test coordinator that tests bucket delete races
sed -n '745,800p' adapter/s3_test.go
# Check containsObjectManifestMutation function
rg -nP 'func.*containsObjectManifestMutation' adapter/ --type=go -A 8
# Verify whether AdminDeleteBucket's dispatch is covered by the race test
rg -nP 'AdminDeleteBucket|bucketDeleteRaceCoordinator' adapter/s3_test.go adapter/s3_admin_test.goRepository: bootjp/elastickv
Length of output: 2993
Remove the CommitTS concern; verify the range-scan TOCTOU issue in AdminDeleteBucket.
The CommitTS asymmetry is intentional. The OperationGroup struct (kv/transcoder.go:29-30) explicitly documents that CommitTS is optional: "Coordinators choose one automatically when this is zero." The validateOperationGroup function does not validate CommitTS, and this pattern is used throughout the codebase (SQS, Redis, DynamoDB adapters all omit it when unset). No comment is needed.
However, AdminDeleteBucket has a genuine TOCTOU issue: it scans ObjectManifestPrefixForBucket at readTS (lines 370–385), finds no objects, then dispatches a delete of BucketMetaKey with StartTS but no ReadKeys. If a concurrent write inserts an object after the scan but before the delete commits, the object becomes orphaned. The retryS3Mutation loop only retries on conflicts to BucketMetaKey itself, not to keys within the scanned range.
Compare to SQS code (e.g., adapter/sqs_catalog.go:656–661), which correctly passes ReadKeys after range scans to enable OCC conflict detection. The same issue exists in s3.go:deleteBucket() (lines 638–680).
Pass the scanned range as ReadKeys to enable OCC validation, or re-scan at transaction StartTS before deleting to close the window.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@adapter/s3_admin.go` around lines 342 - 355, AdminDeleteBucket has a TOCTOU:
after scanning ObjectManifestPrefixForBucket at readTS it dispatches a delete of
BucketMetaKey without including the scanned range in the
OperationGroup.ReadKeys, so concurrent inserts can be orphaned; fix by
collecting the scanned range keys (the range returned by the scan over
ObjectManifestPrefixForBucket) and include them in the OperationGroup passed to
s.coordinator.Dispatch (or, alternatively, re-scan at StartTS before deleting)
so OCC validation detects conflicts; update the retryS3Mutation /
s.coordinator.Dispatch call in AdminDeleteBucket (and likewise in deleteBucket
in s3.go) to populate ReadKeys with the scanned range.
| err := s.retryS3Mutation(ctx, func() error { | ||
| readTS := s.readTS() | ||
| startTS := s.txnStartTS(readTS) | ||
| readPin := s.pinReadTS(readTS) | ||
| defer readPin.Release() | ||
|
|
||
| meta, exists, err := s.loadBucketMetaAt(ctx, name, readTS) | ||
| if err != nil { | ||
| return errors.WithStack(err) | ||
| } | ||
| if !exists || meta == nil { | ||
| return ErrAdminBucketNotFound | ||
| } | ||
| start := s3keys.ObjectManifestPrefixForBucket(name, meta.Generation) | ||
| kvs, err := s.store.ScanAt(ctx, start, prefixScanEnd(start), 1, readTS) | ||
| if err != nil { | ||
| return errors.WithStack(err) | ||
| } | ||
| if len(kvs) > 0 { | ||
| return ErrAdminBucketNotEmpty | ||
| } | ||
| _, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{ | ||
| IsTxn: true, | ||
| StartTS: startTS, | ||
| Elems: []*kv.Elem[kv.OP]{ | ||
| {Op: kv.Del, Key: s3keys.BucketMetaKey(name)}, | ||
| }, | ||
| }) | ||
| return errors.WithStack(err) | ||
| }) | ||
| if err != nil { | ||
| return err //nolint:wrapcheck // sentinel errors propagate as-is. | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how retryS3Mutation handles dispatch errors and whether it
# retries OCC conflicts on range reads.
ast-grep --pattern 'func ($_ *S3Server) retryS3Mutation($$$) $_ {
$$$
}'
# Look for OCC / read-intent tracking inside kv coordinator.
rg -nP --type=go -C3 'readPin|ReadIntent|ReadSet|range.*read'
# See whether the SigV4 deleteBucket path uses the same pattern (so we
# can scope this concern as pre-existing or new).
rg -nP --type=go -C5 'deleteBucket\s*\(' -g '!**/*_test.go' -g '!**/s3_admin*.go'Repository: bootjp/elastickv
Length of output: 33198
🏁 Script executed:
# Find isRetryableS3MutationErr implementation
rg -nP --type=go -A10 'func isRetryableS3MutationErr' -g '!**/*_test.go'
# Check what error types are retryable (look for OCC/write conflict handling)
rg -nP --type=go 'WriteConflict|OCC.*conflict|retryable' -g '!**/*_test.go' -C2
# Verify: does AdminDeleteBucket need to track ScanAt as ReadKeys?
rg -nP --type=go -B5 -A10 'func.*AdminDeleteBucket' adapter/s3_admin.goRepository: bootjp/elastickv
Length of output: 42188
🏁 Script executed:
# Check kv.OperationGroup definition and ReadKeys field
rg -nP --type=go -A5 'type OperationGroup' kv/
# Search for tests of AdminDeleteBucket or deleteBucket with concurrent writes
rg -nP --type=go 'TestAdmin.*DeleteBucket|Test.*deleteBucket' -g '**/*_test.go'
# Check if any S3 path passes keys from a range scan as ReadKeys
rg -nP --type=go -B10 'ReadKeys.*Scan' adapter/s3.goRepository: bootjp/elastickv
Length of output: 375
🏁 Script executed:
# Find the Dispatch signature to understand ReadKeys parameter
rg -nP --type=go 'func.*Dispatch.*ctx.*error' kv/ -A8
# Check if ReadKeys is in pb.Request
rg -nP --type=go 'ReadKeys' kv/ -B2 -A2 | head -100Repository: bootjp/elastickv
Length of output: 8737
🏁 Script executed:
# Search for existing tests of AdminDeleteBucket or deleteBucket with concurrent mutations
rg -nP --type=go 'AdminDeleteBucket|deleteBucket' adapter/s3_admin_test.go adapter/s3_test.go 2>/dev/null | head -50
# Check if there are any tests at all for these functions
fd -e _test.go -type f | xargs rg -l 'AdminDeleteBucket|deleteBucket' 2>/dev/nullRepository: bootjp/elastickv
Length of output: 1193
🏁 Script executed:
# Read the test that checks for non-empty bucket
sed -n '287,320p' adapter/s3_admin_test.goRepository: bootjp/elastickv
Length of output: 1474
🏁 Script executed:
# Verify there are no race-condition tests for AdminDeleteBucket/deleteBucket
rg -nP --type=go 'concurrent|race|TOCTOU' adapter/s3_admin_test.go adapter/s3_test.goRepository: bootjp/elastickv
Length of output: 181
Verify the OCC read-set handling for the range scan in AdminDeleteBucket.
The concern is valid: AdminDeleteBucket reads the object-manifest prefix at readTS (limit 1 — a "is the bucket empty?" probe) and dispatches a delete on s3keys.BucketMetaKey(name) without passing the scanned manifest keys as ReadKeys to the OCC validator. The two operations touch disjoint key spaces:
- read set: range over
s3keys.ObjectManifestPrefixForBucket(name, meta.Generation) - write set: single point key
s3keys.BucketMetaKey(name)
The OCC layer validates only explicit point keys in the ReadKeys field (see store.ApplyMutations — it iterates readKeys slice). A concurrent PutObject that inserts a new manifest key between readTS and the delete's commit will not conflict; the delete commits anyway, leaving orphaned objects under a now-deleted bucket meta. The retryS3Mutation retry loop only activates if a conflict is detected, which will not occur.
This race exists in both AdminDeleteBucket and the SigV4 deleteBucket path (same pattern at adapter/s3.go:638), making it a pre-existing concern. To close the gap, either populate ReadKeys with the scanned manifest keys, or explicitly document that the design accepts this orphan race.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@adapter/s3_admin.go` around lines 369 - 403, AdminDeleteBucket reads a
one-entry probe of s3keys.ObjectManifestPrefixForBucket(...) via
s.store.ScanAt(...) then issues a delete via s.coordinator.Dispatch(...) but
does not supply the scanned manifest keys as ReadKeys, so OCC won't detect
concurrent PutObject races; fix by extracting the keys from kvs returned by
ScanAt and populate the OperationGroup's ReadKeys slice before calling
coordinator.Dispatch in the retryS3Mutation closure (reference functions/values:
AdminDeleteBucket, retryS3Mutation, loadBucketMetaAt, s.store.ScanAt,
s.coordinator.Dispatch, s3keys.ObjectManifestPrefixForBucket,
s3keys.BucketMetaKey) so the OCC validator will consider those manifest keys and
trigger a retry on conflict.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ddc3b9b215
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if h.roles == nil { | ||
| // Production wiring always sets a RoleStore; nil is a test | ||
| // fallthrough we accept so single-handler unit tests can | ||
| // reach the source without standing up the auth chain. | ||
| if !principal.Role.AllowsWrite() { |
There was a problem hiding this comment.
Enforce JWT write role before live role lookup
The write-role check is currently gated by h.roles == nil, so in normal production wiring (where RoleStore is set) a session whose JWT role is read-only can still perform S3 writes as soon as the live role map says full. That contradicts the handler’s own “JWT freezes the role at login” contract and diverges from Dynamo’s principalForWrite behavior, making authorization semantics inconsistent across admin resources. Keep the JWT AllowsWrite check unconditional, then apply live-role revalidation.
Useful? React with 👍 / 👎.
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md Section 8: a single self-contained operator-facing reference for the admin HTTP listener. Covers: - Quick-start invocation for a loopback dev cluster - Required + optional flag reference, with explanations of why each guard exists (TLS hard-error, rolling-update caveats, etc.) - TLS topologies (loopback / TLS / discouraged plaintext-non-loopback) - Role model + how live role re-validation works on every state- changing request - The full /admin/api/v1/* surface (auth + cluster + dynamo + s3, including the slice 2 write paths and the AdminForward forwarding contract) - forwarded_from audit log shape and why it carries the follower's node ID - Troubleshooting guide for the common failure modes operators hit during initial bring-up (missing credentials, TLS hard-error, 401 ambiguity, 503 leader_unavailable, bucket_not_empty, blank-screen / placeholder bundle) - Cross-references to the design doc and architecture overview The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as deliverables; those are already implemented (see config.go's validateTLS / validateAccessKeyRoles, the role gates in DynamoHandler.principalForWrite + S3Handler.principalForWrite, and the CSRFDoubleSubmit middleware. This doc stitches them into a single reference operators can land on without reading code. Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API- surface table can describe S3 write endpoints as shipped. Once both land in main, this rebases cleanly. EOF )
Slice 2b of P2 (docs/design/2026_04_24_proposed_admin_dashboard.md
Section 3.3.2 + 4.1): a follower-side S3 admin write
(POST /buckets, PUT /buckets/{name}/acl, DELETE /buckets/{name})
now hands off to the leader transparently, completing the same
end-to-end forwarding contract Dynamo writes received in #644 +
Stacked on #669 (P2 slice 2a). Once #669 merges, this rebases
cleanly onto main.
Proto (proto/admin_forward.proto):
- Three new ADMIN_OP enum values (CREATE_BUCKET / DELETE_BUCKET /
PUT_BUCKET_ACL) appended after the Dynamo block so existing
wire-format integers stay stable. Regenerated with the pinned
protoc 29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.
Leader-side ForwardServer (internal/admin/forward_server.go):
- WithBucketsSource lets deployments wire the S3 dispatcher
optionally — Dynamo-only builds keep the receiver nil and the
three new operations return 501 NotImplemented.
- Three new dispatch arms: handleCreateBucket / handleDeleteBucket /
handlePutBucketAcl. Each one mirrors the leader-direct HTTP
path's payload contract (NUL-byte rejection, 64 KiB limit,
DisallowUnknownFields, trailing-token rejection, slash-in-name
rejection) so a hostile follower cannot smuggle a payload past
validations the leader-direct path enforces.
- forwardBucketsErrorResponse mirrors forwardErrorResponse on the
Dynamo side: ErrBucketsForbidden / NotLeader / NotFound /
AlreadyExists / NotEmpty + ValidationError each map to the same
HTTP status the leader-direct writeBucketsError produces, so
forwarded and leader-direct responses are byte-for-byte
indistinguishable from the SPA's view.
- isStructuredSourceError extended to recognise the bucket
sentinels so they are NOT logged at LevelError on the leader.
- Forward's switch was extracted into dispatchForward to keep the
parent function under cyclop's 10-branch ceiling as the operation
enum grew.
Follower-side LeaderForwarder (internal/admin/forward_client.go):
- Interface gains ForwardCreateBucket / ForwardDeleteBucket /
ForwardPutBucketAcl. PutBucketAcl carries both the bucket name
(from the URL path) and the new ACL (from the request body) in
one JSON payload — same approach handleDeleteBucket takes for
the bucket name.
- gRPCForwardClient methods reuse the existing forward() helper
for transport, so connection-cache reuse and ErrLeaderUnavailable
signalling behave identically across resource types.
Handler integration (internal/admin/s3_handler.go):
- New `forwarder LeaderForwarder` field + WithLeaderForwarder method.
- handleCreate / handlePutACL / handleDelete now consult tryForward*
helpers when the source returned ErrBucketsNotLeader; the helpers
are gated on `errors.Is(err, ErrBucketsNotLeader) && forwarder != nil`
so a leader-direct rejection (already-exists, not-found, etc.) is
never re-applied at the leader.
- writeForwardResult / writeForwardFailure mirror the Dynamo handler's
pattern: nosniff + Cache-Control:no-store + Retry-After:1 on 503.
ErrLeaderUnavailable does NOT log at LevelError (elections are
routine); transport errors do log so operators can investigate.
Wiring (main.go + main_admin_forward.go):
- adminForwardServerDeps gains a `buckets` field; readyForRegistration
still requires only TablesSource + RoleStore (so cluster-only or
Dynamo-only builds keep registering Dynamo forwarding without S3).
- runtimeServerRunner.start() now creates *adapter.S3Server BEFORE
startRaftServers (in addition to dynamoServer) so the leader-side
ForwardServer registration sees both adapters. The reorder is safe:
each adapter listens on its own address and the raft TCP listeners
are independent.
- ServerDeps.Forwarder now plumbs through buildS3HandlerForDeps too,
so the follower's S3Handler picks up the same LeaderForwarder
instance the Dynamo handler does.
Tests:
- 9 forward-server tests covering the three new bucket operations:
happy path / no-BucketsSource→501 / bad-JSON 400 / already-exists
409 / not-empty 409 / slash-in-name 400 / missing-acl 400 /
payload-too-large 413 (sweep over all three ops).
- 4 forward-client tests covering ForwardCreateBucket /
ForwardDeleteBucket / ForwardPutBucketAcl happy-path payload
shapes + ErrLeaderUnavailable on no-leader.
- stubLeaderForwarder gains 3 bucket-side forward methods so
existing dynamo tests still satisfy the LeaderForwarder
interface, and the new stub fields let bucket-handler tests
verify the forward arguments.
- 6 handler integration tests on S3Handler.tryForward*: forwarded
create / delete / put-acl happy paths (replay leader's status
+ payload + content-type), forwarder ErrLeaderUnavailable → 503
+ Retry-After, transport-error → 503 + no leakage, and a 3-axis
gate sweep proving the forwarder is NOT invoked on
AlreadyExists / Forbidden / generic source errors.
Closes design 3.3.2 acceptance criteria 2 (transparent forwarding)
+ 6 (forwarded_from in audit log) for S3 admin writes; criterion 3
(election-period 503 + retry) is also live for S3 because the
existing tryForward helpers reuse the same fallback paths.
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md Section 8: a single self-contained operator-facing reference for the admin HTTP listener. Covers: - Quick-start invocation for a loopback dev cluster - Required + optional flag reference, with explanations of why each guard exists (TLS hard-error, rolling-update caveats, etc.) - TLS topologies (loopback / TLS / discouraged plaintext-non-loopback) - Role model + how live role re-validation works on every state- changing request - The full /admin/api/v1/* surface (auth + cluster + dynamo + s3, including the slice 2 write paths and the AdminForward forwarding contract) - forwarded_from audit log shape and why it carries the follower's node ID - Troubleshooting guide for the common failure modes operators hit during initial bring-up (missing credentials, TLS hard-error, 401 ambiguity, 503 leader_unavailable, bucket_not_empty, blank-screen / placeholder bundle) - Cross-references to the design doc and architecture overview The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as deliverables; those are already implemented (see config.go's validateTLS / validateAccessKeyRoles, the role gates in DynamoHandler.principalForWrite + S3Handler.principalForWrite, and the CSRFDoubleSubmit middleware. This doc stitches them into a single reference operators can land on without reading code. Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API- surface table can describe S3 write endpoints as shipped. Once both land in main, this rebases cleanly. EOF )
Latest claude review on #675 flagged that lines 3 and 6 still described P2 as fully shipped: Line 3: "P1, **P2**, and P3 of the implementation plan have shipped" Line 6: "after P1 + **P2** + P3 landed" But the status table I added in 7617daf correctly marks P2 as 🟡 partial (read-only slice 1 in #658 landed; write paths #669 and #673 still in flight). The header prose contradicting the table is exactly the kind of inconsistency that triggers the rename to `_partial_` in the first place — readers skimming the top of the doc would conclude P2 was done. Rewrote the Status line and the parenthetical on Last updated to match the table, calling out P1 + P3 as fully shipped, P2 as the read-only slice only, and P4 as TLS/role/CSRF live with the operator doc landing now. Pointed readers at the per-phase status table for the breakdown. No other changes; the table itself was already correct.
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md Section 8: a single self-contained operator-facing reference for the admin HTTP listener. Covers: - Quick-start invocation for a loopback dev cluster - Required + optional flag reference, with explanations of why each guard exists (TLS hard-error, rolling-update caveats, etc.) - TLS topologies (loopback / TLS / discouraged plaintext-non-loopback) - Role model + how live role re-validation works on every state- changing request - The full /admin/api/v1/* surface (auth + cluster + dynamo + s3, including the slice 2 write paths and the AdminForward forwarding contract) - forwarded_from audit log shape and why it carries the follower's node ID - Troubleshooting guide for the common failure modes operators hit during initial bring-up (missing credentials, TLS hard-error, 401 ambiguity, 503 leader_unavailable, bucket_not_empty, blank-screen / placeholder bundle) - Cross-references to the design doc and architecture overview The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as deliverables; those are already implemented (see config.go's validateTLS / validateAccessKeyRoles, the role gates in DynamoHandler.principalForWrite + S3Handler.principalForWrite, and the CSRFDoubleSubmit middleware. This doc stitches them into a single reference operators can land on without reading code. Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API- surface table can describe S3 write endpoints as shipped. Once both land in main, this rebases cleanly. EOF )
Per `docs/design/README.md`'s lifecycle convention, the admin dashboard design doc is now "partial": - **P1** (DynamoDB CRUD + AdminForward) — shipped via #634, #635, #644, #648 - **P2** (S3 buckets list/create/delete/ACL + DescribeTable) — shipped via #658, with #669 + #673 in flight - **P3** (React SPA + embed) — shipped via #649, #650 - **P4** (TLS / role / CSRF / operator docs) — TLS, role, CSRF are already live in P1; operator docs in #674 Independent of the in-flight slice 2 PRs (#669/#673) and the docs PR (#674) — this rename only reflects what is already on main today, plus an "Implementation status" table mapping each phase to the PR it landed in. ## What this PR changes - `git mv` the design doc from `2026_04_24_proposed_admin_dashboard.md` to `2026_04_24_partial_admin_dashboard.md` so its history follows - Add an "Implementation status" header table indexing each phase to the PRs that landed it - List the outstanding open items so future readers know what is still owed against the original proposal: - AdminForward acceptance criterion 5 (rolling-upgrade compat flag) — deferred - S3 object browser — explicitly out of scope per Section 2 Non-goals - TLS cert hot-reload — restart-to-rotate is the documented model When the rolling-upgrade flag lands, the doc gets renamed once more to `2026_04_24_implemented_admin_dashboard.md` per the README's lifecycle convention.
|
@claude review |
|
Claude finished @bootjp's task in 3m 16s —— View job Code Review: admin S3 bucket write endpoints (P2 slice 2a)Todo list
All three items from the previous Claude round are confirmed clean in 5a67834. CodeRabbit's second-pass comment raised three new findings. Here's where each stands: Finding 1 — TOCTOU in
|
| # | Status | Severity | Area |
|---|---|---|---|
| TOCTOU / OCC read-set | Pre-existing, deferred ✅ | — | adapter/s3_admin.go:383-396 |
principalForWrite JWT gate |
Still open | Medium | internal/admin/s3_handler.go:358-375 |
| ACL audit log whitespace | Still open | Minor | internal/admin/s3_handler.go:313,322 |
Finding 2 is worth fixing before merge — it's a straightforward alignment with the Dynamo handler, plugs the production test gap, and removes the inconsistent JWT-role behavior within the same admin session. Finding 3 can be a follow-up with the Dynamo-side audit-field cleanup.
P2 slice 2b: a follower-side S3 admin write (POST /buckets, PUT
/buckets/{name}/acl, DELETE /buckets/{name}) now hands off to the leader
transparently, completing the same end-to-end forwarding contract Dynamo
writes received in #644 + #648.
**Stacked on #669 (P2 slice 2a).** Once #669 merges, this rebases
cleanly onto main.
## Summary
- **Proto** — three new `ADMIN_OP` enum values (`CREATE_BUCKET` /
`DELETE_BUCKET` / `PUT_BUCKET_ACL`) appended after the Dynamo block so
existing wire integers stay stable. Regenerated with the pinned protoc
29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.
- **Leader-side** — `ForwardServer.WithBucketsSource` lets deployments
wire the S3 dispatcher optionally; Dynamo-only builds keep it nil and
the new operations return 501. Three new dispatch arms mirror the
leader-direct HTTP path's payload contract (NUL-byte rejection, 64 KiB
limit, `DisallowUnknownFields`, trailing-token rejection, slash-in-name
rejection). `forwardBucketsErrorResponse` mirrors
`forwardErrorResponse`: `ErrBucketsForbidden` / `NotLeader` / `NotFound`
/ `AlreadyExists` / `NotEmpty` + `*ValidationError` each map to the same
HTTP status `writeBucketsError` produces, so forwarded and leader-direct
responses are byte-identical from the SPA's view.
- **Follower-side** — `LeaderForwarder` interface gains
`ForwardCreateBucket` / `ForwardDeleteBucket` / `ForwardPutBucketAcl`.
`ForwardPutBucketAcl` carries both the bucket name (URL path) and the
new ACL (request body) in one JSON payload — same approach
`ForwardDeleteTable` takes for the table name.
- **Handler integration** — `S3Handler.WithLeaderForwarder` +
`tryForwardCreateBucket` / `tryForwardPutBucketAcl` /
`tryForwardDeleteBucket` gated on `ErrBucketsNotLeader && forwarder !=
nil`. `writeForwardResult` / `writeForwardFailure` mirror the Dynamo
handler's pattern (nosniff + Cache-Control:no-store + Retry-After:1 on
503).
- **Wiring** — `runtimeServerRunner.start()` now creates
`*adapter.S3Server` before `startRaftServers` so the leader-side
`ForwardServer` registration sees both adapters.
`adminForwardServerDeps.buckets` is wired through
`registerAdminForwardServer`. `ServerDeps.Forwarder` now plumbs into
`S3Handler` too.
## What is NOT in this PR
- Rolling-upgrade compatibility flag (criterion 5) — still deferred
behind a cluster-version bump.
## Test plan
- [x] `go build ./...`
- [x] `go vet ./...`
- [x] `golangci-lint run` (admin + main + adapter + proto: 0 issues)
- [x] `go test ./internal/admin/ -count=1 -race` — full admin suite
passes (19 new tests across forward_server / forward_client / s3_handler
integration)
- [x] `go test . -count=1 -race` — main package passes
- [x] Targeted adapter tests pass
- [ ] Full `./adapter/` times out due to a pre-existing flake (verified
earlier on PRs #648 / #658 / #669 — unrelated to this branch)
- [ ] End-to-end smoke against a 3-node cluster — manual exercise after
#669 + this land
## Acceptance criteria coverage (Section 3.3.2 + 4.1)
| # | Criterion | Status |
|---|---|---|
| 1 | Leader direct write | ✓ for S3 (#669 / slice 2a) |
| 2 | Follower forwards transparently | ✅ for S3 (this PR) |
| 3 | Election-period 503 + retry | ✅ for S3 (this PR — same fallback as
Dynamo) |
| 4 | Leader demotes stale full role | ✅ for S3 (existing
`validatePrincipal` runs before bucket dispatch) |
| 5 | Rolling-upgrade compat flag | ⏳ deferred (cluster-version bump) |
| 6 | `forwarded_from` in audit log | ✅ for S3 (this PR — leader-side
audit log carries it) |
## Self-review (5 lenses)
1. **Data loss**: No FSM / Raft / Pebble path changes. The leader-side
dispatcher reuses the same `BucketsSource` interface that the
leader-direct HTTP path uses; the only new code path is the JSON
envelope + principal re-validation.
2. **Concurrency**: New methods on `gRPCForwardClient` reuse the
existing `forward()` helper, so connection-cache reuse and
`ErrLeaderUnavailable` signalling behave identically across resource
types. The `runtimeServerRunner.start()` reorder is safe — each adapter
listens on its own address and the raft TCP listeners are independent.
3. **Performance**: One additional gRPC round-trip on follower-side
admin writes; identical to Dynamo's path.
4. **Data consistency**: Leader-side dispatcher applies the same strict
body decode the leader-direct HTTP path does (NUL-byte rejection,
trailing-token rejection, slash-in-name rejection).
`forwardBucketsErrorResponse` produces byte-identical envelopes to
`writeBucketsError` so forwarded and leader-direct responses are
indistinguishable.
5. **Test coverage**: 19 new tests — 9 forward-server (happy paths, 501
no-source, bad-JSON, already-exists, not-empty, slash-in-name,
missing-acl, payload-too-large sweep), 4 forward-client (3 happy-path
payload shapes + ErrLeaderUnavailable for all three), 6
handler-integration (forwarded create/delete/put-acl happy +
leader-unavailable + transport-error + 3-axis non-NotLeader gate sweep).
Codex P2 + coderabbitai Minor on PR #669 caught that the S3 admin handler diverged from the Dynamo path in production wiring. The previous shape was: if h.roles == nil { /* JWT check */ ... return principal } /* h.roles != nil — production: skip JWT, only live role */ role, _ := h.roles.LookupRole(...) if !role.AllowsWrite() { 403 } So a session whose JWT was minted as `read-only` could **escalate** to write capability the moment the live role index promoted that access key to `full` — exactly the kind of between-login-and-revocation window the JWT was supposed to freeze. DynamoHandler.principalForWrite (dynamo_handler.go:413) fires the JWT gate unconditionally and then layers the live role check on top, which is the safer ordering: the JWT can only narrow capability, never widen it. Aligned the S3 handler with that contract. The fix also picks up a second case the Dynamo path already covers: a JWT minted as `full` whose live role has since been demoted to read-only or removed entirely is now rejected with 403 instead of silently sliding through. Per CLAUDE.md "test the bug first" — `newS3HandlerForTest` never called `WithRoleStore`, so every existing test ran the `h.roles == nil` branch and the production path was unproven. Added two new tests: TestS3Handler_PrincipalForWrite_JWTGateFiresWithRoleStore — table sweep over all three S3 admin write endpoints (create, put_acl, delete) with WithRoleStore wired and JWT=read-only / live=full. All three must 403. TestS3Handler_PrincipalForWrite_LiveRoleNarrowsJWT — JWT=full + live=read-only or removed-from-index. The full live-role-narrowing branch was unproven before; this pins both demotion and revocation cases. Also documented the AdminDeleteBucket TOCTOU that coderabbitai flagged as 🔴 / 🟠 in the same review: AdminDeleteBucket scans ObjectManifestPrefixForBucket at readTS (limit=1, an "is bucket empty?" probe) and dispatches the BucketMetaKey delete with that point key only in ReadKeys. A concurrent PutObject inserting a manifest key in the scanned prefix between readTS and commit will not conflict — the OCC validator only inspects keys appearing in ReadKeys, and there is no ReadRanges mechanism. This is a pre-existing race that adapter/s3.go:deleteBucket (SigV4 path) inherits as well — closing the gap requires either bumping BucketGenerationKey on every PutObject so it serves as an OCC token here, or extending OperationGroup with ReadRanges and teaching the FSM to validate range emptiness atomically with commit. Both are larger schema changes outside this PR's scope. Recorded the limitation as a code comment on AdminDeleteBucket so the next reader of the function knows the contract; tracking the fix as a follow-up rather than expanding the PR's surface. go test -race ./internal/admin/ — passes. golangci-lint run ./internal/admin/... ./adapter/... — 0 issues.
|
@claude review Addressed both Codex P2 + coderabbitai findings on c8165ae:
Latest commit: 71a8514.
|
|
Claude finished @bootjp's task in 1m 47s —— View job Code Review: admin S3 bucket write endpoints (P2 slice 2a) — final passTodo list
Both items from the previous round are correctly addressed in 71a8514. Finding 1 — JWT gate fix: ✅ Correct
The structure is line-for-line equivalent to Finding 2 — JWT gate tests: ✅ Correct
Finding 3 — TOCTOU documentation: ✅ ThoroughThe comment at Finding 4 — Stale design-doc reference in the TOCTOU comment (minor) ℹ️
One of these would close the gap:
This is the only remaining nit. The security fix and its test coverage are complete and correct — the PR is ready to merge once the doc reference is corrected. |
Slice 2a of P2 (docs/design/2026_04_24_proposed_admin_dashboard.md
Section 4.1): POST /admin/api/v1/s3/buckets, PUT
/admin/api/v1/s3/buckets/{name}/acl, and DELETE
/admin/api/v1/s3/buckets/{name} become reachable so the SPA's S3
write modals stop receiving 405. Slice 2b will plumb AdminForward
so a follower can hand these writes off to the leader transparently.
Adapter (`adapter/s3_admin.go`):
- `AdminCreateBucket` performs the same atomic bucket-meta + ACL +
generation-key txn the SigV4 createBucket path does, with three
in-method guards: principal must be AdminRoleFull, the local node
must be the verified S3 leader, and bucket-name + ACL must pass
the existing validators. The owner field is now populated with the
caller's access key so the audit-log breadcrumb survives storage.
- `AdminPutBucketAcl` mutates only `meta.Acl` in place, preserving
generation so existing object references remain valid.
- `AdminDeleteBucket` rejects non-empty buckets with
`ErrAdminBucketNotEmpty` (mirrors the SigV4 path's
`BucketNotEmpty`); the dashboard cannot force a recursive delete
by design.
- New sentinels: `ErrAdminBucketAlreadyExists`,
`ErrAdminBucketNotFound`, `ErrAdminBucketNotEmpty`,
`ErrAdminInvalidBucketName`, `ErrAdminInvalidACL`. Reuses the
shared `ErrAdminForbidden` and `ErrAdminNotLeader` from the
Dynamo side.
Admin package:
- `BucketsSource` interface gains `AdminCreateBucket` /
`AdminPutBucketAcl` / `AdminDeleteBucket`. New types
`CreateBucketRequest` and `PutBucketACLRequest` carry the
documented JSON shapes.
- New sentinels `ErrBucketsNotEmpty`, `ErrBucketsAlreadyExists`
(the former is new for slice 2a; the latter was reserved earlier
and is now wired).
- `S3Handler` now routes POST/PUT/DELETE through dedicated
per-method handlers with `principalForWrite` re-validating the
role on every request against the live `MapRoleStore`. The route
switch was refactored into `serveCollection` + `servePerBucket`
to keep the cyclomatic budget under control as the surface grew.
- `handleCreate` / `handlePutACL` / `handleDelete` log a structured
`admin_audit` line on success matching the Dynamo handler's shape.
- `decodeAdminS3JSONBody` is the strict shared decoder
(`DisallowUnknownFields`, NUL-byte rejection, trailing-token
rejection, 64 KiB cap) used by both POST and PUT.
- `writeBucketsError` translates the source-side sentinels into
the design's HTTP statuses (403 / 503+Retry-After:1 / 404 / 409 /
400 invalid_request via `*ValidationError`).
Bridge (`main_admin.go`):
- `bucketsBridge` gains write methods that call the adapter's
`Admin*` methods and run their errors through
`translateAdminBucketsError` — the same pattern as
`translateAdminTablesError`, with the leader-churn fallback
routing kv-internal sentinels to `admin.ErrBucketsNotLeader` so
the SPA's retry contract stays intact.
Tests:
- 19 admin-package tests covering: create happy path / read-only
rejected / 409 already-exists / 503 not-leader / 400 invalid JSON
shapes (5 sub-cases) / NUL-byte / 413 oversize / put_acl happy +
read-only + missing field + 404 + non-PUT method-not-allowed /
delete happy + read-only + 409 not-empty + 404 + 503 not-leader /
cross-method missing-principal 401.
- 10 adapter-level tests against the in-memory MVCC store covering
the same surfaces from the storage side: happy path round-trips
through describe, ACL default to private, role / leader / ACL /
bucket-name validation rejections, AlreadyExists / NotFound /
NotEmpty paths.
- Existing `TestS3Handler_DescribeBucket_SubpathReturns404` was
superseded by two more precise tests:
`TestS3Handler_PerBucket_UnknownSubpathReturns404` (any non-/acl
sub-path 404s) and `TestS3Handler_PerBucket_AclSubpathRejectsGet`
(the now-valid /acl path returns 405 on GET).
…axBytes nil doc) Three actionable findings from claude review on 91a48ef: 1) **Medium: stale route-table comment in `buildAPIMux`** `internal/admin/server.go`: the comment block listing the routes only mentioned the two GET S3 endpoints. Added the three new write endpoints (POST /buckets, DELETE /buckets/{name}, PUT /buckets/{name}/acl) so a future reader does not have to reverse-engineer the routing from `servePerBucket`. 2) **Medium: missing handler-level `*ValidationError` test** `s3_handler_test.go`: `writeBucketsError`'s `errors.As(err, &verr)` arm was exercised end-to-end via the bridge translation but had no direct handler-level coverage. Added `TestS3Handler_WriteEndpoints_ValidationErrorReturns400` — three sub-cases (create / put_acl / delete) injecting a `*ValidationError` directly into the stub source and asserting 400 + `invalid_request` + the message. Locks the arm down so a future refactor that drops the `errors.As` branch fails CI immediately. 3) **Minor: `MaxBytesReader(nil, ...)` undocumented** `s3_handler.go`: added an explanatory comment at the call site. Passing nil is safe because `MaxBytesReader`'s only writer use is the connection-close optimisation (a comma-ok type assertion that no-ops on nil); the cap itself is enforced regardless. We pass nil because the helper takes only `io.Reader` so it remains usable from non-HTTP entry points (the upcoming AdminForward decoder reuses the same shape). The 413 path still works via `errAdminS3BodyTooLarge` matching on `*http.MaxBytesError`. Findings #3 (failed-write audit log) and #5 (DisallowUnknownFields error message) were noted as pre-existing patterns shared with the Dynamo side; tracking them separately rather than diverging on this slice.
Slice 2b of P2 (docs/design/2026_04_24_proposed_admin_dashboard.md
Section 3.3.2 + 4.1): a follower-side S3 admin write
(POST /buckets, PUT /buckets/{name}/acl, DELETE /buckets/{name})
now hands off to the leader transparently, completing the same
end-to-end forwarding contract Dynamo writes received in #644 +
Stacked on #669 (P2 slice 2a). Once #669 merges, this rebases
cleanly onto main.
Proto (proto/admin_forward.proto):
- Three new ADMIN_OP enum values (CREATE_BUCKET / DELETE_BUCKET /
PUT_BUCKET_ACL) appended after the Dynamo block so existing
wire-format integers stay stable. Regenerated with the pinned
protoc 29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.
Leader-side ForwardServer (internal/admin/forward_server.go):
- WithBucketsSource lets deployments wire the S3 dispatcher
optionally — Dynamo-only builds keep the receiver nil and the
three new operations return 501 NotImplemented.
- Three new dispatch arms: handleCreateBucket / handleDeleteBucket /
handlePutBucketAcl. Each one mirrors the leader-direct HTTP
path's payload contract (NUL-byte rejection, 64 KiB limit,
DisallowUnknownFields, trailing-token rejection, slash-in-name
rejection) so a hostile follower cannot smuggle a payload past
validations the leader-direct path enforces.
- forwardBucketsErrorResponse mirrors forwardErrorResponse on the
Dynamo side: ErrBucketsForbidden / NotLeader / NotFound /
AlreadyExists / NotEmpty + ValidationError each map to the same
HTTP status the leader-direct writeBucketsError produces, so
forwarded and leader-direct responses are byte-for-byte
indistinguishable from the SPA's view.
- isStructuredSourceError extended to recognise the bucket
sentinels so they are NOT logged at LevelError on the leader.
- Forward's switch was extracted into dispatchForward to keep the
parent function under cyclop's 10-branch ceiling as the operation
enum grew.
Follower-side LeaderForwarder (internal/admin/forward_client.go):
- Interface gains ForwardCreateBucket / ForwardDeleteBucket /
ForwardPutBucketAcl. PutBucketAcl carries both the bucket name
(from the URL path) and the new ACL (from the request body) in
one JSON payload — same approach handleDeleteBucket takes for
the bucket name.
- gRPCForwardClient methods reuse the existing forward() helper
for transport, so connection-cache reuse and ErrLeaderUnavailable
signalling behave identically across resource types.
Handler integration (internal/admin/s3_handler.go):
- New `forwarder LeaderForwarder` field + WithLeaderForwarder method.
- handleCreate / handlePutACL / handleDelete now consult tryForward*
helpers when the source returned ErrBucketsNotLeader; the helpers
are gated on `errors.Is(err, ErrBucketsNotLeader) && forwarder != nil`
so a leader-direct rejection (already-exists, not-found, etc.) is
never re-applied at the leader.
- writeForwardResult / writeForwardFailure mirror the Dynamo handler's
pattern: nosniff + Cache-Control:no-store + Retry-After:1 on 503.
ErrLeaderUnavailable does NOT log at LevelError (elections are
routine); transport errors do log so operators can investigate.
Wiring (main.go + main_admin_forward.go):
- adminForwardServerDeps gains a `buckets` field; readyForRegistration
still requires only TablesSource + RoleStore (so cluster-only or
Dynamo-only builds keep registering Dynamo forwarding without S3).
- runtimeServerRunner.start() now creates *adapter.S3Server BEFORE
startRaftServers (in addition to dynamoServer) so the leader-side
ForwardServer registration sees both adapters. The reorder is safe:
each adapter listens on its own address and the raft TCP listeners
are independent.
- ServerDeps.Forwarder now plumbs through buildS3HandlerForDeps too,
so the follower's S3Handler picks up the same LeaderForwarder
instance the Dynamo handler does.
Tests:
- 9 forward-server tests covering the three new bucket operations:
happy path / no-BucketsSource→501 / bad-JSON 400 / already-exists
409 / not-empty 409 / slash-in-name 400 / missing-acl 400 /
payload-too-large 413 (sweep over all three ops).
- 4 forward-client tests covering ForwardCreateBucket /
ForwardDeleteBucket / ForwardPutBucketAcl happy-path payload
shapes + ErrLeaderUnavailable on no-leader.
- stubLeaderForwarder gains 3 bucket-side forward methods so
existing dynamo tests still satisfy the LeaderForwarder
interface, and the new stub fields let bucket-handler tests
verify the forward arguments.
- 6 handler integration tests on S3Handler.tryForward*: forwarded
create / delete / put-acl happy paths (replay leader's status
+ payload + content-type), forwarder ErrLeaderUnavailable → 503
+ Retry-After, transport-error → 503 + no leakage, and a 3-axis
gate sweep proving the forwarder is NOT invoked on
AlreadyExists / Forbidden / generic source errors.
Closes design 3.3.2 acceptance criteria 2 (transparent forwarding)
+ 6 (forwarded_from in audit log) for S3 admin writes; criterion 3
(election-period 503 + retry) is also live for S3 because the
existing tryForward helpers reuse the same fallback paths.
Three findings from claude + Gemini review on 70213e0: 1) **Issue 1 — `logUnexpectedSourceError` slog key was "table"** When called for bucket operations the field key was `"table"` but the value was a bucket name. Log queries on `table=` would find spurious bucket-error entries; queries on `bucket=` would miss the audit lines entirely. Renamed the parameter and the slog key to `resource` so the same forensic query works for both resource families. 2) **Gemini security-high + Claude Issue 2 — validation divergence** `handleCreateBucket` only checked `strings.TrimSpace(name) == ""` while the HTTP path's `validateCreateBucketRequest` also rejects whitespace-padded names like `" bucket "`. The forward path would have accepted them, then hit the adapter's `validateS3BucketName` with a less actionable error message — different SPA behaviour depending on whether the request was leader-direct or follower-forwarded. Fix: call `validateCreateBucketRequest(body)` in `handleCreateBucket` exactly like `decodeCreateTableRequest` is shared between the table-side handlers. 3) **Issue 3 — only CREATE_BUCKET tested for nil-BucketsSource → 501** `DELETE_BUCKET` and `PUT_BUCKET_ACL` had identical `if s.buckets == nil` guards but no coverage. Replaced `TestForwardServer_CreateBucket_NoBucketsSourceReturns501` with a table-driven `TestForwardServer_BucketOps_NoBucketsSourceReturns501` sweeping all three operations. A future op added without the nil guard fails CI immediately. Plus a new `TestForwardServer_CreateBucket_RejectsWhitespacePaddedName` that pins the validation-parity fix from #2. Rebased onto the latest `feat/admin-s3-writes` (which now carries the slice 2a review fixes) so the stack stays clean.
Codex on PR #673 caught that `readyForRegistration()` predicate gated registration on `tables != nil`, which meant a cluster started with `--dynamoAddr` empty but S3 enabled never registered the AdminForward gRPC service at all. Followers attempting to forward S3 admin writes (CREATE_BUCKET / DELETE_BUCKET / PUT_BUCKET_ACL) hit gRPC `Unimplemented` and surfaced as 503 to the SPA, instead of transparently reaching the leader. The S3 write path was advertised as "follower-forwarded" by P2 slice 2b but actually only worked when Dynamo was also configured. Fix is symmetric: relax the predicate to require RoleStore plus **at least one** source (TablesSource or BucketsSource) and have the dispatcher reject ops for the missing surface with the same 501 it already returns when the inverse source is nil. Per CLAUDE.md "test the bug first" convention, add the regression test before the fix: TestForwardServer_DynamoOps_NoTablesSourceReturns501 — symmetric with the existing TestForwardServer_BucketOps_NoBucketsSourceReturns501. With nil source, CREATE_TABLE used to slip past the source nil-deref because validation rejected the payload first; DELETE_TABLE panicked on `s.source.AdminDeleteTable`. Both now return 501. Plus three readyForRegistration assertions: - buckets-only deployment registers (the regression case) - missing-roles + buckets-only does NOT register - tables-only / both-sources continue to register While in there, the per-op nil-source guards I added tipped the linter's dupl detector. Refactored the dispatch path to keep forward_server tidy: 1. **`checkOpAvailability`**: pulls the per-op nil-source switch out of `dispatchForward` so the latter stays under the cyclomatic ceiling. One place to update when a future op joins the enum, instead of five. 2. **`decodeNamedPayload`**: the {"name": "..."} JSON shape is shared between Dynamo handleDelete and S3 handleDeleteBucket (NUL guard, DisallowUnknownFields, dec.More(), empty-name reject, slash reject — every guard had a paired comment citing the original review that introduced it). Centralising the helper keeps that audit trail in one place; the opLabel parameter ("delete" / "delete-bucket") preserves the path-specific wording in error responses. 3. **`auditDeleteSuccess`**: handleDelete and handleDeleteBucket each emitted a 7-line `admin_audit` block that differed only in the operation label and the `table` vs `bucket` field name. Folded into one helper so a future delete-shaped op automatically picks up the same field set. 4. **`notImplementedForwardResponse`**: deduplicates the 501 construction across the five Dynamo + S3 op cases. No behaviour change to any of the existing successful paths; all forward_server tests pass unchanged. Tests added: the S3-only registration assertion + the new symmetric DynamoOps_NoTablesSourceReturns501 sweep. go test -race ./internal/admin/ . — passes. golangci-lint run ./internal/admin/... . — 0 issues.
Codex P2 + coderabbitai Minor on PR #669 caught that the S3 admin handler diverged from the Dynamo path in production wiring. The previous shape was: if h.roles == nil { /* JWT check */ ... return principal } /* h.roles != nil — production: skip JWT, only live role */ role, _ := h.roles.LookupRole(...) if !role.AllowsWrite() { 403 } So a session whose JWT was minted as `read-only` could **escalate** to write capability the moment the live role index promoted that access key to `full` — exactly the kind of between-login-and-revocation window the JWT was supposed to freeze. DynamoHandler.principalForWrite (dynamo_handler.go:413) fires the JWT gate unconditionally and then layers the live role check on top, which is the safer ordering: the JWT can only narrow capability, never widen it. Aligned the S3 handler with that contract. The fix also picks up a second case the Dynamo path already covers: a JWT minted as `full` whose live role has since been demoted to read-only or removed entirely is now rejected with 403 instead of silently sliding through. Per CLAUDE.md "test the bug first" — `newS3HandlerForTest` never called `WithRoleStore`, so every existing test ran the `h.roles == nil` branch and the production path was unproven. Added two new tests: TestS3Handler_PrincipalForWrite_JWTGateFiresWithRoleStore — table sweep over all three S3 admin write endpoints (create, put_acl, delete) with WithRoleStore wired and JWT=read-only / live=full. All three must 403. TestS3Handler_PrincipalForWrite_LiveRoleNarrowsJWT — JWT=full + live=read-only or removed-from-index. The full live-role-narrowing branch was unproven before; this pins both demotion and revocation cases. Also documented the AdminDeleteBucket TOCTOU that coderabbitai flagged as 🔴 / 🟠 in the same review: AdminDeleteBucket scans ObjectManifestPrefixForBucket at readTS (limit=1, an "is bucket empty?" probe) and dispatches the BucketMetaKey delete with that point key only in ReadKeys. A concurrent PutObject inserting a manifest key in the scanned prefix between readTS and commit will not conflict — the OCC validator only inspects keys appearing in ReadKeys, and there is no ReadRanges mechanism. This is a pre-existing race that adapter/s3.go:deleteBucket (SigV4 path) inherits as well — closing the gap requires either bumping BucketGenerationKey on every PutObject so it serves as an OCC token here, or extending OperationGroup with ReadRanges and teaching the FSM to validate range emptiness atomically with commit. Both are larger schema changes outside this PR's scope. Recorded the limitation as a code comment on AdminDeleteBucket so the next reader of the function knows the contract; tracking the fix as a follow-up rather than expanding the PR's surface. go test -race ./internal/admin/ — passes. golangci-lint run ./internal/admin/... ./adapter/... — 0 issues.
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md Section 8: a single self-contained operator-facing reference for the admin HTTP listener. Covers: - Quick-start invocation for a loopback dev cluster - Required + optional flag reference, with explanations of why each guard exists (TLS hard-error, rolling-update caveats, etc.) - TLS topologies (loopback / TLS / discouraged plaintext-non-loopback) - Role model + how live role re-validation works on every state- changing request - The full /admin/api/v1/* surface (auth + cluster + dynamo + s3, including the slice 2 write paths and the AdminForward forwarding contract) - forwarded_from audit log shape and why it carries the follower's node ID - Troubleshooting guide for the common failure modes operators hit during initial bring-up (missing credentials, TLS hard-error, 401 ambiguity, 503 leader_unavailable, bucket_not_empty, blank-screen / placeholder bundle) - Cross-references to the design doc and architecture overview The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as deliverables; those are already implemented (see config.go's validateTLS / validateAccessKeyRoles, the role gates in DynamoHandler.principalForWrite + S3Handler.principalForWrite, and the CSRFDoubleSubmit middleware. This doc stitches them into a single reference operators can land on without reading code. Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API- surface table can describe S3 write endpoints as shipped. Once both land in main, this rebases cleanly. EOF )
…ane label Three substantive findings + one nit from claude review on #674: 1. **HS256 key size (high)**: said "≥ 32 raw bytes / 44 base64 chars" but the validator in internal/admin/config.go enforces exactly 64 raw bytes. A 32-byte key fails startup with a confusing error. Updated to "exactly 64 raw bytes — 88 base64 chars (standard padding) or 86 (RawURLEncoding)". Operators copying the doc into key-generation tooling now produce a key that actually validates. 2. **Audit log sample (moderate)**: showed method=POST / path=... / status=201 / duration=8.2ms — none of those fields are emitted. The actual slog entries use operation= + a resource key (bucket= or table=). Replaced the example with two real shapes (leader-direct and forwarded), so an operator building a log parser against the documented contract finds matches. 3. **Cross-reference link (moderate)**: line 249 used the _partial_ filename for the design doc, but #675 (the rename PR) hasnt landed yet. Point to _proposed_ (matches main today) with an inline note that the link will follow once #675 lands. The intro paragraph at line 5 already used _proposed_, so this removes the inconsistency claude flagged. 4. **"data-plane" label nit**: /admin/api/v1/cluster is an admin- plane endpoint, not data-plane. "data-plane" implied the DynamoDB / S3 / Redis ports. Replaced with "admin". The cosmetic Last-updated / Status verbosity items on #675 are tracked separately.
Two findings from the latest claude review on #674: 1. **Audit log shapes (Bug A)**: the previous example claimed `admin_audit ... operation=delete_table table=orders` was a leader-direct sample, but `DynamoHandler.handleDelete` (dynamo_handler.go:379-396) does not emit any handler-level admin_audit — that shape only appears via `ForwardServer` on the forwarded path. It also omitted the `Audit` middleware line (middleware.go:206-232) which wraps every non-GET admin request with method= / path= / status= / remote= / duration=. An operator building a parser against the doc would have missed half of the actual lines. Rewrote the section as three explicit shapes — Audit middleware (always), S3Handler op line (only emits from s3_handler.go:299 /333/355), ForwardServer op line (covers both Dynamo and S3 forwarded paths). Removed the misleading `delete_table` leader-direct sample and replaced the Dynamo examples with forwarded-only ones, which is where they actually originate. Added a one-sentence note that a single request typically produces two lines (op-specific + middleware) so log parsers know to treat the key as a union. 2. **Login 401 vs 403 (Codex P2)**: the troubleshooting entry said "both produce 401" but auth_handler.go:336-345 returns 403 forbidden when the credentials match but the access key is not in `-adminFullAccessKeys` / `-adminReadOnlyAccessKeys`. Split into two entries: 401 invalid_credentials (bad access_key or secret) and 403 forbidden (valid SigV4 creds, but no admin role assignment), with the remediation specific to each case. No functional changes. Doc-only.
Two Codex P2 findings on the latest review of #674: 1. **Cookie Secure attribute (line 55 + the TLS topology section)**: I wrote that `-adminAllowPlaintextNonLoopback` "enables the dashboard to mint cookies without the `Secure` attribute". That conflates two flags. In the actual wiring (auth_handler.go:114 + auth_handler.go:458-468 + main_admin.go:597), `secureCookie` is `!opts.InsecureCookie` which comes from `-adminAllowInsecureDevCookie`, **not** `-adminAllowPlaintextNonLoopback`. The latter only bypasses the startup TLS guard (config.go:112). Operationally this is the difference between "fails closed" and "broken end-to-end": an operator who reads the old doc and sets only `-adminAllowPlaintextNonLoopback` gets a listener that accepts plaintext, but the dashboard still mints `Secure=true` cookies that the browser refuses to send back over the plaintext channel — login appears to succeed and every subsequent request 401s, exactly the bug Codex was guarding against. Rewrote the line-55 entry to scope `-adminAllowPlaintextNonLoopback` to the listener TLS guard only, and added the cross-reference to `-adminAllowInsecureDevCookie` for the cookie attribute. Updated the loopback section to drop the misleading "Secure regardless" claim and explain when a loopback rig actually needs the dev-cookie flag. Added a paragraph at the bottom of "Discouraged: plaintext non-loopback" calling out that a working plaintext rig needs both flags, and that the two are deliberately separate so misconfiguration fails closed on either axis instead of silently downgrading both at once. 2. **SPA Retry-After claim (line 172)**: I wrote that "the SPA's API client honours Retry-After and re-issues the request once." Codex caught that web/admin/src/api/client.ts:75 does a single `fetch` call with no retry path, and useApi.ts has no 503 retry logic either. The doc was setting an operator expectation the code does not meet — under leader election a user would see the 503 surface directly and conclude "intermittent failure" instead of re-issuing the action. Rewrote the section to describe the current behaviour accurately: server emits `Retry-After: 1`, but the SPA surfaces the 503 directly and the operator must re-issue. Kept the header documented because a future client (or an external operator script driving the JSON API) can still honour it — it is part of the server's contract. No code changes; doc-only. The cookie wiring claim was verified against auth_handler.go and main_admin.go; the SPA claim was verified against web/admin/src/api/client.ts and useApi.ts.
Codex P2 on the latest review of #674: the audit-log section said the Audit middleware "is emitted for every non-GET/HEAD/ OPTIONS request that reaches the admin mux on this node, regardless of which handler served it." That overstates coverage: - `/auth/login` is wired through `publicAuth`, not `protect` — the middleware can't identify the actor without a pre-existing session, so it is intentionally skipped. AuthService emits `admin_audit action=login` directly (auth_handler.go:432). - `/auth/logout` is wired through `protectNoAudit` — the middleware is omitted to avoid emitting two audit lines per logout (a generic one plus the action-specific one); the specific line is the one operators want (auth_handler.go:442). - The `protect` chain is `BodyLimit → SessionAuth → Audit → CSRFDoubleSubmit → handler` (server.go:247-255). Audit only fires AFTER SessionAuth accepts the session, so an unauthenticated write attempt (missing / expired / invalid cookie) gets 401'd before reaching Audit and produces no middleware line. CSRF-rejected requests still produce one, by design, because the actor is known by then. An operator building log-monitoring on top of the previous text could conclude that every write attempt is guaranteed to appear in `admin_audit`, then miss CSRF-bypassing probes against an unauthenticated endpoint. Rewrote the middleware-shape section to: - Scope the rule to the protected mux chain. - Explain the ordering choice (Audit between SessionAuth and CSRFDoubleSubmit) so the CSRF-rejection-still-audits guarantee is preserved in the doc. - Enumerate the two auth endpoints that bypass the middleware and what they emit instead. - Adjust the intro paragraph: protected-chain mutations produce two audit lines, auth endpoints produce one (was: "every state-changing request typically produces two"). No behaviour change; doc-only. Verified against internal/admin/server.go:236-296 and internal/admin/auth_handler.go:432-449.
Codex P2 on the latest review of #674: my line 260 said both login and logout `admin_audit` entries include `claimed_actor`, but `auditLogout` (auth_handler.go:442-449) only emits `action`, `actor`, `remote`, and `status`. `claimed_actor` is login-only because the field exists to distinguish the typed access key from the authenticated one — logout has no claim to verify (the actor comes from the already-validated session cookie). A log pipeline that validated field presence against the previous text would misclassify every successful logout as malformed, exactly the operator-facing bug Codex flagged. Replaced the one-line summary with two concrete sample lines — one for login (with `claimed_actor`), one for logout (without) — and a closing sentence telling parsers that `claimed_actor` is present-only-on-login. Same shape verified against auth_handler.go:432-449. No code changes; doc-only.
07bdc20 to
885eb50
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
adapter/s3_admin_test.go (1)
154-310: Consider consolidating the negative-path tests into table-driven cases.The Create/PutACL/Delete suites each repeat the same scaffold (
NewMVCCStore+NewS3Server+ one call +ErrorIs) for read-only / invalid-ACL / invalid-name / missing-bucket cases. A single table per operation (or one shared table keyed on op + principal + want-err) would shrink ~150 lines to ~30 and make adding a new sentinel a one-line append.As per coding guidelines: "Co-locate test files in the same package and use table-driven test patterns."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/s3_admin_test.go` around lines 154 - 310, Consolidate the repeated negative-path tests (functions named TestS3Server_AdminCreateBucket_*, TestS3Server_AdminPutBucketAcl_*, TestS3Server_AdminDeleteBucket_*) into table-driven subtests: define a table with entries containing a name, principal generator (e.g., fullAdminBucketsPrincipal() or readOnlyAdminBucketsPrincipal()), bucket name, ACL (if applicable), any pre-test setup closure (for example to create a bucket when testing existing/duplicate cases), and expected error (ErrAdminForbidden, ErrAdminInvalidACL, ErrAdminInvalidBucketName, ErrAdminBucketNotFound, ErrAdminBucketAlreadyExists, ErrAdminBucketNotEmpty); then loop over the table with t.Run and for each case create a fresh store via store.NewMVCCStore() and server via NewS3Server(...), invoke the appropriate method (AdminCreateBucket, AdminPutBucketAcl, or AdminDeleteBucket) and assert require.ErrorIs/require.NoError as specified; keep helper functions fullAdminBucketsPrincipal(), readOnlyAdminBucketsPrincipal(), and any object-creation setup (the HTTP PUT via server.handle) to reuse within case.setup so behavior remains identical.internal/admin/buckets_source.go (1)
44-44: RenameAdminPutBucketAcltoAdminPutBucketACLfor acronym consistency with DTO field and Go conventions.The method name uses lowercase
Aclwhile the DTO fieldPutBucketACLRequest.ACLand all field references use uppercase. Go style conventions (andstylecheck/revive) keep acronyms uniformly capitalized. The refactor is mechanical: rename the interface method inbuckets_source.go:44and its implementations inadapter/s3_admin.go,main_admin.go, and tests. Also apply the same fix to the relatedForwardPutBucketAclmethod in the forward client.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/admin/buckets_source.go` at line 44, Rename the interface method AdminPutBucketAcl to AdminPutBucketACL and update all implementations and callers (including the forward client) to use the new name; also rename ForwardPutBucketAcl to ForwardPutBucketACL to match the DTO field PutBucketACLRequest.ACL and Go acronym conventions. Locate the method declaration in the buckets source interface (AdminPutBucketAcl), update its signature to AdminPutBucketACL(ctx context.Context, principal AuthPrincipal, name, acl string) error, then change the corresponding method definitions in the S3 admin adapter (function named AdminPutBucketAcl), the main admin implementation (AdminPutBucketAcl), the forward client wrapper (ForwardPutBucketAcl), and any tests or references to call the new AdminPutBucketACL and ForwardPutBucketACL names. After renaming, run the build/tests and fix any remaining references or imports that fail due to the renamed symbols.internal/admin/forward_server.go (2)
147-167: Optional: simplifycheckOpAvailability's tri-valued return.The
(*pb.AdminForwardResponse, error, bool)shape returnserrorin the middle position and theerroris alwaysnilbecauserejectForward(and thereforenotImplementedForwardResponse) never returns a non-nil error. The signature could be tightened to(*pb.AdminForwardResponse, bool)(or just return*pb.AdminForwardResponseand letnilmean "continue"), removing the deaderrpropagation indispatchForward. Minor readability nit; defer if you'd rather keep the signature aligned with the gRPC method shape it's eventually reused alongside.♻️ Sketch
-func (s *ForwardServer) checkOpAvailability(op pb.AdminOperation) (*pb.AdminForwardResponse, error, bool) { +func (s *ForwardServer) checkOpAvailability(op pb.AdminOperation) *pb.AdminForwardResponse { switch op { case pb.AdminOperation_ADMIN_OP_CREATE_TABLE, pb.AdminOperation_ADMIN_OP_DELETE_TABLE: if s.source == nil { - resp, err := notImplementedForwardResponse("DynamoDB") - return resp, err, false + resp, _ := notImplementedForwardResponse("DynamoDB") + return resp } ... } - return nil, nil, true + return nil }And in
dispatchForward:- if resp, err, ok := s.checkOpAvailability(op); !ok { - return resp, err + if resp := s.checkOpAvailability(op); resp != nil { + return resp, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/admin/forward_server.go` around lines 147 - 167, The checkOpAvailability function currently returns (*pb.AdminForwardResponse, error, bool) but the error slot is always nil because notImplementedForwardResponse/rejectForward never return a non-nil error; simplify the signature to either (*pb.AdminForwardResponse, bool) or just *pb.AdminForwardResponse (where nil means "ok") and update all call sites (notably dispatchForward) to stop expecting a middle error value; change the function declaration for checkOpAvailability and adjust dispatchForward to handle the new two-valued (or single-valued) return by removing the unused err variable and using the returned AdminForwardResponse (from notImplementedForwardResponse) to short-circuit when present.
251-295: Consider extracting a shared strict-decode helper for the bucket payloads.
handleCreateBucket(251–295) andhandlePutBucketAcl(336–385) each repeat the same 4-step preamble — size cap, NUL-byte rejection,DisallowUnknownFieldsdecode,dec.More()trailing-token rejection — thatdecodeNamedPayloadalready centralises for the{"name":"…"}shape. GenerifyingdecodeNamedPayloadover the body type (e.g. adecodeForwardedJSON[T any]) would let both handlers drop ~15 lines apiece and guarantee future ops cannot diverge on the size/NUL/strict-decode contract. Optional given the bodies are typed differently and the duplication is fairly mechanical.Also applies to: 336-385
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/admin/forward_server.go` around lines 251 - 295, Both handleCreateBucket and handlePutBucketAcl duplicate the same 4-step strict JSON preamble (size limit, NUL-byte check, DisallowUnknownFields decode, trailing-data check); extract that into a generic helper (e.g. decodeForwardedJSON[T any](payload []byte, limit int) (T, error)) that enforces adminForwardPayloadLimit, rejects NUL bytes, uses json.Decoder with DisallowUnknownFields and verifies !dec.More(), returning either the decoded typed body or a canonical error you can map to rejectForward; then replace the inline decoding in handleCreateBucket and handlePutBucketAcl with calls to decodeForwardedJSON and keep the subsequent validateCreateBucketRequest / AdminCreateBucket call and logging unchanged.adapter/s3_admin.go (1)
414-420: Add a comment explaining whyBucketGenerationKeyis not deleted.The dispatch deletes only
BucketMetaKey(name)while preservingBucketGenerationKey(name). This is necessary to prevent manifest key namespace collisions across bucket recreations: the preserved generation counter ensures a subsequentAdminCreateBucketon the same name picks up the next generation vianextBucketGenerationAt, so any manifest key orphaned by the TOCTOU race documented above (lines 362–384) cannot collide with the recreated bucket's generation namespace.♻️ Suggested comment
+ // Preserve BucketGenerationKey so a recreated bucket with + // the same name gets a fresh generation, preventing manifest + // namespace collisions with keys orphaned by the TOCTOU race + // documented above. _, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{ IsTxn: true, StartTS: startTS, Elems: []*kv.Elem[kv.OP]{ {Op: kv.Del, Key: s3keys.BucketMetaKey(name)}, }, })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@adapter/s3_admin.go` around lines 414 - 420, Add a clarifying comment in the deletion operation explaining why we only remove BucketMetaKey(name) and intentionally preserve BucketGenerationKey(name): note that keeping the generation counter prevents manifest key namespace collisions when a bucket is deleted and recreated (AdminCreateBucket uses nextBucketGenerationAt to pick a new generation), so orphaned manifest keys from the TOCTOU window cannot collide with a recreated bucket’s generation namespace; place this comment adjacent to the s.coordinator.Dispatch call that deletes BucketMetaKey and reference BucketGenerationKey, AdminCreateBucket, and nextBucketGenerationAt to make the rationale discoverable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/admin.md`:
- Around line 228-230: The audit-log example line (the `admin_audit` example)
shows the wrong request path; update the example so the path includes the S3
segment by changing `path=/admin/api/v1/buckets` to
`path=/admin/api/v1/s3/buckets` (or alternatively replace it with a Dynamo
example like `/admin/api/v1/dynamo/tables` if you intend a Dynamo example) so
the example matches the API surface declared earlier.
---
Nitpick comments:
In `@adapter/s3_admin_test.go`:
- Around line 154-310: Consolidate the repeated negative-path tests (functions
named TestS3Server_AdminCreateBucket_*, TestS3Server_AdminPutBucketAcl_*,
TestS3Server_AdminDeleteBucket_*) into table-driven subtests: define a table
with entries containing a name, principal generator (e.g.,
fullAdminBucketsPrincipal() or readOnlyAdminBucketsPrincipal()), bucket name,
ACL (if applicable), any pre-test setup closure (for example to create a bucket
when testing existing/duplicate cases), and expected error (ErrAdminForbidden,
ErrAdminInvalidACL, ErrAdminInvalidBucketName, ErrAdminBucketNotFound,
ErrAdminBucketAlreadyExists, ErrAdminBucketNotEmpty); then loop over the table
with t.Run and for each case create a fresh store via store.NewMVCCStore() and
server via NewS3Server(...), invoke the appropriate method (AdminCreateBucket,
AdminPutBucketAcl, or AdminDeleteBucket) and assert
require.ErrorIs/require.NoError as specified; keep helper functions
fullAdminBucketsPrincipal(), readOnlyAdminBucketsPrincipal(), and any
object-creation setup (the HTTP PUT via server.handle) to reuse within
case.setup so behavior remains identical.
In `@adapter/s3_admin.go`:
- Around line 414-420: Add a clarifying comment in the deletion operation
explaining why we only remove BucketMetaKey(name) and intentionally preserve
BucketGenerationKey(name): note that keeping the generation counter prevents
manifest key namespace collisions when a bucket is deleted and recreated
(AdminCreateBucket uses nextBucketGenerationAt to pick a new generation), so
orphaned manifest keys from the TOCTOU window cannot collide with a recreated
bucket’s generation namespace; place this comment adjacent to the
s.coordinator.Dispatch call that deletes BucketMetaKey and reference
BucketGenerationKey, AdminCreateBucket, and nextBucketGenerationAt to make the
rationale discoverable.
In `@internal/admin/buckets_source.go`:
- Line 44: Rename the interface method AdminPutBucketAcl to AdminPutBucketACL
and update all implementations and callers (including the forward client) to use
the new name; also rename ForwardPutBucketAcl to ForwardPutBucketACL to match
the DTO field PutBucketACLRequest.ACL and Go acronym conventions. Locate the
method declaration in the buckets source interface (AdminPutBucketAcl), update
its signature to AdminPutBucketACL(ctx context.Context, principal AuthPrincipal,
name, acl string) error, then change the corresponding method definitions in the
S3 admin adapter (function named AdminPutBucketAcl), the main admin
implementation (AdminPutBucketAcl), the forward client wrapper
(ForwardPutBucketAcl), and any tests or references to call the new
AdminPutBucketACL and ForwardPutBucketACL names. After renaming, run the
build/tests and fix any remaining references or imports that fail due to the
renamed symbols.
In `@internal/admin/forward_server.go`:
- Around line 147-167: The checkOpAvailability function currently returns
(*pb.AdminForwardResponse, error, bool) but the error slot is always nil because
notImplementedForwardResponse/rejectForward never return a non-nil error;
simplify the signature to either (*pb.AdminForwardResponse, bool) or just
*pb.AdminForwardResponse (where nil means "ok") and update all call sites
(notably dispatchForward) to stop expecting a middle error value; change the
function declaration for checkOpAvailability and adjust dispatchForward to
handle the new two-valued (or single-valued) return by removing the unused err
variable and using the returned AdminForwardResponse (from
notImplementedForwardResponse) to short-circuit when present.
- Around line 251-295: Both handleCreateBucket and handlePutBucketAcl duplicate
the same 4-step strict JSON preamble (size limit, NUL-byte check,
DisallowUnknownFields decode, trailing-data check); extract that into a generic
helper (e.g. decodeForwardedJSON[T any](payload []byte, limit int) (T, error))
that enforces adminForwardPayloadLimit, rejects NUL bytes, uses json.Decoder
with DisallowUnknownFields and verifies !dec.More(), returning either the
decoded typed body or a canonical error you can map to rejectForward; then
replace the inline decoding in handleCreateBucket and handlePutBucketAcl with
calls to decodeForwardedJSON and keep the subsequent validateCreateBucketRequest
/ AdminCreateBucket call and logging unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 63834f9c-0607-4c5e-9e4f-3982aeca95cc
⛔ Files ignored due to path filters (2)
proto/admin_forward.pb.gois excluded by!**/*.pb.goproto/admin_forward_grpc.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (17)
adapter/s3_admin.goadapter/s3_admin_test.godocs/admin.mdinternal/admin/buckets_source.gointernal/admin/forward_client.gointernal/admin/forward_client_test.gointernal/admin/forward_integration_test.gointernal/admin/forward_server.gointernal/admin/forward_server_test.gointernal/admin/s3_handler.gointernal/admin/s3_handler_test.gointernal/admin/server.gomain.gomain_admin.gomain_admin_forward.gomain_admin_forward_test.goproto/admin_forward.proto
✅ Files skipped from review due to trivial changes (1)
- proto/admin_forward.proto
🚧 Files skipped from review as they are similar to previous changes (2)
- main_admin.go
- internal/admin/server.go
Operator-facing follow-up to docs/admin.md. The original operator
guide covered configuration semantics; this commit adds the
runbook side: how to actually deploy with the dashboard enabled,
how operators log in, and how to roll out role / key / TLS
changes without breaking sessions.
Three pieces, separable but mutually reinforcing:
1. **scripts/rolling-update.sh — admin-flag plumbing**
The script previously had no awareness of the admin dashboard:
none of the --admin* flags were forwarded to docker run, so an
operator wanting to enable the dashboard had to fork the
script or hand-edit the deployment manifest outside any
reviewed code path. Added a build_admin_flags helper that the
docker run composes alongside the existing s3_flags pattern:
- ADMIN_ENABLED=false → emit nothing, skip every validation,
no behaviour change for existing deployments.
- ADMIN_ENABLED=true → require ADMIN_SESSION_SIGNING_KEY_FILE
plus at least one role list, validate every referenced path
exists / is readable on the remote host BEFORE
stop_container fires, bind-mount each file read-only at the
same in-container path, and pass the corresponding --admin*
flags to the entrypoint.
The pre-flight validation is deliberate: a missing signing-key
file would otherwise hard-error the daemon at startup AFTER the
previous container was already stopped, leaving the node
unhealthy and the rollout's per-node health check exiting the
script — but at most one node has by then been touched, so
quorum survives. Failing earlier in build_admin_flags converts
that to a no-op rollout with a script-level error pointing at
the variable name.
Also enforced the pair-or-neither contract for
ADMIN_TLS_CERT_FILE / _KEY_FILE here, mirroring the daemon's
own startup validation but at the script layer so the operator
sees the variable name rather than a generic "exactly one of
cert/key" daemon message.
2. **scripts/rolling-update.env.example — ADMIN_* block**
Added the same nine variables as commented-out entries, with a
leading paragraph pointing at docs/admin.md and the new
docs/admin_deployment.md. ADMIN_ENABLED defaults to false so a
user who copies the file verbatim and only fills in the data-
plane variables gets the same behaviour as before.
3. **docs/admin_deployment.md — operator runbook**
Six sections covering pre-deployment prep (HS256 key
generation, credentials file, role lists, TLS material), the
first-time rolling-update.sh invocation, the operator login
flow + the four primary failure modes (401 / 403-login /
403-write / 503), day-2 procedures (key add/remove,
HS256 rotation with the previous-key window, TLS cert
rotation, listener bind change, disabling the dashboard), and
a failure-mode runbook for the messages an operator is most
likely to hit (the "is not loopback but TLS is not configured"
hard-error, the cookie-Secure mismatch that produces
login-succeeds-then-401-on-every-request, the rollout-window
role-flag inconsistency).
The HS256 rotation section is the one that does not exist in
admin.md: it walks through the two-key verification window
using ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE so a rotation
does not invalidate existing sessions. Without this, an
operator reading admin.md alone could not safely rotate keys.
Cross-linked from docs/admin.md so a reader landing on the
per-flag reference finds the runbook in one click. The
deployment doc itself points back at admin.md for per-flag
semantics so the two stay decoupled — admin.md owns the "what
each flag means" contract, admin_deployment.md owns the "what
to do as an operator" contract.
No code changes; bash syntax-check passes.
coderabbitai on PR #669 caught that the Audit-middleware sample line on docs/admin.md:229 used `path=/admin/api/v1/buckets` but the actual API surface declared earlier in the same doc is `/admin/api/v1/s3/buckets`. An operator copy-pasting this into a log-parser regex would miss every real S3 admin write. Swapped to the correct path. The Dynamo path (`/admin/api/v1/dynamo/tables`) was the other option suggested, but the surrounding section already calls out that the Audit middleware fires for the protected mux chain across all admin resources — keeping S3 here matches the doc's working example (create_bucket / put_bucket_acl / delete_bucket samples).
|
@claude review Addressed the new coderabbitai 🟡 finding on The Audit-middleware sample line used The other three comments showing on cadffe4 are stale re-anchorings of the JWT-role / TOCTOU findings from 2026-04-26T12:27:57Z, all addressed in 71a8514. Latest commit: 55c93c8. Doc-only. |
|
Claude finished @bootjp's task in 43s —— View job Code Review: admin S3 bucket write endpoints (P2 slice 2a) — final doc-fix passTodo list
Finding — audit-log path: ✅ Fixed
The Stale coderabbitai re-anchors The three coderabbitai comments showing against No new findings. This is a doc-only commit and the PR is ready to merge. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 55c93c8626
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if [[ "${ADMIN_ENABLED}" != "true" ]]; then | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
Forward ADMIN_ env vars before gating admin flags*
This gate runs on the remote host, but in update_one_node the SSH env assignment still does not pass any ADMIN_* variables (I checked the env ... bash -s block), so ADMIN_ENABLED falls back to its default and this branch always returns early. In practice, setting ADMIN_ENABLED=true (and related key/cert settings) in deploy.env is ignored during rollout, and nodes come up without --adminEnabled or admin key/TLS mounts.
Useful? React with 👍 / 👎.
| # node with auth disabled. | ||
| local admin_flags=() | ||
| local admin_volumes=() | ||
| build_admin_flags admin_flags admin_volumes |
There was a problem hiding this comment.
Validate admin files before stopping the running container
The new build_admin_flags validation can exit 1 for missing admin key/cert files, but it is invoked inside run_container; the rollout flow stops the current container first and only then calls run_container (stop_container precedes run_container in this script). A bad admin path therefore turns a config typo into avoidable downtime by taking the node offline before failing. The validation step should run in preflight before stop_container.
Useful? React with 👍 / 👎.
…669) (#678) ## Summary Follow-up to #669 (now merged). The admin-flag plumbing I added in that PR has a real bug I caught while re-reading the script: every rollout would crash on the first remote node because the `ADMIN_*` variables were never forwarded across the SSH heredoc. ## What was broken `build_admin_flags` lives inside the remote SSH heredoc in `update_one_node` (`bash -s <<'REMOTE'`), but the `env` block that seeds the remote shell only forwarded the existing `IMAGE` / `RAFT_PORT` / `EXTRA_ENV` / etc. variables — no `ADMIN_*`. With `set -u` active on the remote, the first access of `${ADMIN_ENABLED}` inside `build_admin_flags` crashes the rollout **regardless** of whether admin is enabled (the helper is invoked unconditionally from `run_container`). So an operator running this script after #669 with the default `ADMIN_ENABLED=false` would have seen `ADMIN_ENABLED: unbound variable` on the first node touched, leaving at most one node restarted but otherwise the cluster intact (the per-node health check would exit non-zero before moving on). ## Fix 1. **Forward all 9 `ADMIN_*` variables through `env`**, alongside the existing forwarding pattern. Path-like values (`*_FILE`, `*_KEYS`, `ADDRESS`) get `printf %q` quoting at the bottom of the local script (matches the existing `RAFT_TO_REDIS_MAP_Q` etc. pattern). The three boolean flags (`ADMIN_ENABLED`, `ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK`, `ADMIN_ALLOW_INSECURE_DEV_COOKIE`) are forwarded unquoted for readability — but only after a local validation pass that rejects anything other than the literal `true` / `false` so the unquoted forwarding stays metacharacter-safe. 2. **Defense-in-depth `:-` defaults inside `build_admin_flags`.** Every `ADMIN_*` reference inside the helper now reads through `${VAR:-}` once at the top into a local. A future refactor that ever drops one of the forwarded variables will produce the targeted "ADMIN_* required" error instead of an opaque `unbound variable` crash with no hint at which variable. ## Test plan - [x] `bash -n scripts/rolling-update.sh` — passes - [x] `ADMIN_ENABLED=invalid bash scripts/rolling-update.sh` → "must be 'true' or 'false', got 'invalid'" - [x] `ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK=yes` → same validator catches it - [x] `ADMIN_ENABLED=true` (no signing key) → reaches the remote branch where build_admin_flags would 'aborting' on the missing key - [ ] End-to-end rollout against a 3-node staging cluster with `ADMIN_ENABLED=true` (operator to verify before merging into a production deploy.env) - [ ] End-to-end rollout with `ADMIN_ENABLED=false` (the previously broken default path) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Improved validation and error reporting for deployment configuration parameters. * Enhanced environment variable forwarding and path handling in deployment scripts to increase reliability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary State-refresh of `docs/design/2026_04_24_partial_admin_dashboard.md` now that all four phases of the original P1–P4 plan have shipped. | Phase | Was | Now | |---|---|---| | P1 | shipped | shipped (no change) | | P2 | partial — read-only #658 only | **shipped** (+ #669 writes + #673 forwarding) | | P3 | shipped | shipped (no change) | | P4 | mostly | **shipped** (+ operator doc #674, deployment runbook #669, script wiring #669+#678) | Also added the **AdminDeleteBucket TOCTOU** to Outstanding open items — coderabbitai flagged it during PR #669 review (pre-existing race, `s3.go:deleteBucket` inherits the same shape; recorded as a code comment on `AdminDeleteBucket` for the future fix). Doc stays at `_partial_` rather than promoting to `_implemented_` because two outstanding items remain: AdminForward acceptance criterion 5 (rolling-upgrade flag, explicitly deferred at design time) and the new TOCTOU. Updated the closing rename trigger to cover both. ## Test plan - [x] No design changes — this is a state refresh - [x] Status table entries cross-checked against the merge commits on main - [x] AdminDeleteBucket TOCTOU description matches the code comment in adapter/s3_admin.go and the operator-side workaround in docs/admin_deployment.md
Per docs/design/README.md's lifecycle convention. The original P1–P4 plan has fully shipped: - P1 (admin skeleton + Dynamo + AdminForward) — #634/#635/#644/#648 - P2 (S3 endpoints incl. write paths and AdminForward integration) — #658 / #669 / #673 / #695 (TOCTOU safety net) - P3 (React SPA + embed) — #649 / #650 - P4 (TLS / role / CSRF / operator doc / deployment runbook / scripts/rolling-update.sh admin support) — #674 / #669 / #678 The AdminDeleteBucket TOCTOU caught during PR #669 review (the last "in-flight" item that kept the doc at _partial_) is fully resolved by the safety-net design landed in #695. What changed: - git mv 2026_04_24_partial_admin_dashboard.md → 2026_04_24_implemented_admin_dashboard.md (history follows the rename) - Header Status line: "Partial" → "Implemented", explanation updated to reflect the post-fix state and the rationale for promotion. - "Last updated" bumped to 2026-04-28 with the rename trigger. - Section heading "Outstanding open items" → "Out-of-scope follow-ups" — the remaining three entries (criterion 5, object browser, TLS hot-reload) are not in-flight work; they are deferred-at-design or Non-goal items. The TOCTOU bullet is removed (resolved) and replaced with a one-line cross-link to the safety-net design + admin_deployment.md §4.6 contract. - Removed the closing "rename trigger" sentence — we just did the rename. - Status table: P2 row now lists #695 alongside #658/#669/#673 so a future reader can find the TOCTOU fix from the index. - Cross-references updated everywhere the old filename appeared: docs/admin.md (header link + Cross-references) docs/admin_deployment.md (header link + final cross-ref) docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md (Background section pointer) internal/admin/config.go (Section 7.1 reference comment) No code changes other than the comment-only filename refresh in config.go.
## Summary Promote the admin dashboard design doc from `_partial_` → `_implemented_` per `docs/design/README.md`'s lifecycle convention. PR #695 landed the TOCTOU safety-net fix (the last in-flight item that kept the doc at `_partial_`), so the original P1–P4 plan is now fully shipped: | Phase | Landed via | |---|---| | P1 (admin skeleton + Dynamo + AdminForward) | #634 / #635 / #644 / #648 | | P2 (S3 endpoints + writes + AdminForward S3 + TOCTOU fix) | #658 / #669 / #673 / **#695** | | P3 (React SPA + embed) | #649 / #650 | | P4 (TLS / role / CSRF / operator doc / runbook / script wiring) | #674 / #669 / #678 | The remaining three items in the doc move from "Outstanding open items" (in-flight) to **"Out-of-scope follow-ups"** (deferred-at-design or Non-goal): - AdminForward criterion 5 — rolling-upgrade flag, deferred behind a cluster-version bump that doesn't exist yet - S3 object browser — Non-goal per §2.2 - TLS cert hot-reload — out of scope per `docs/admin.md` ## Changes - `git mv` partial → implemented (history follows the rename) - Status line / Last-updated / status-table / Out-of-scope section content reflects the promotion - Cross-references updated in all 4 referencing files: `docs/admin.md`, `docs/admin_deployment.md`, `docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md`, `internal/admin/config.go` (comment-only) ## Test plan - [x] No code changes other than a comment-only filename refresh in `config.go` - [x] `go build ./...` passes - [x] `golangci-lint run ./internal/admin/...` — 0 issues - [x] `grep -rn "2026_04_24_partial\|2026_04_24_proposed_admin"` returns nothing — no stale references
P2 slice 2a of docs/design/2026_04_24_proposed_admin_dashboard.md. Ships the S3 admin write endpoints so the SPA's S3 modals stop receiving 405. Slice 2b will plumb AdminForward so a follower can hand these writes off to the leader transparently.
Summary
*adapter.S3Server.AdminCreateBucket/AdminPutBucketAcl/AdminDeleteBucket— SigV4-bypass write methods with three in-method guards: principal must beAdminRoleFull, the local node must be the verified S3 leader, and bucket-name / ACL must pass the existing validators.AdminCreateBucketreuses the same atomic bucket-meta + ACL + generation-key txn the SigV4 path does — no new code path through the storage layer.AdminDeleteBucketrejects non-empty buckets (the dashboard cannot force recursive delete by design).internal/adminwrite surface —BucketsSourcegains the three write methods +CreateBucketRequest/PutBucketACLRequesttypes with the documented JSON shapes.S3Handler.serveCollection+servePerBucketroute POST/PUT/DELETE through dedicated handlers withprincipalForWritere-validating the role on every request against the liveMapRoleStore.decodeAdminS3JSONBodyis generic over the request type, appliesDisallowUnknownFields, rejects NUL bytes, rejects trailing tokens, and caps at 64 KiB (matches design 4.4). Used by both POST and PUT.writeBucketsErrortranslates the source-side sentinels into the design's HTTP statuses: 403 forbidden / 503 + Retry-After:1 leader_unavailable / 404 not_found / 409 already_exists / 409 bucket_not_empty / 400 invalid_request via*ValidationError.bucketsBridgegains write methods running throughtranslateAdminBucketsError, mirroringtranslateAdminTablesErroron the Dynamo side. Leader-churn sentinels from the kv coordinator route toadmin.ErrBucketsNotLeaderso the SPA's retry contract stays intact.What is NOT in this PR
Test plan
go build ./...go vet ./...golangci-lint run(admin + main + adapter packages: 0 issues)go test ./internal/admin/ -count=1 -race— 19 new handler tests passgo test . -count=1 -race— main package passesgo test -run "TestS3Server_Admin" ./adapter/ -count=1 -race— 10 new adapter tests passgo test ./adapter/times out at 120s due to a pre-existing flake (verified earlier on PRs admin: AdminForward Phase 3 — register gRPC server + wire follower-side bridge #648 / admin: read-only S3 bucket endpoints (P2 slice 1) #658 againstmain— unrelated to this branch)Acceptance criteria coverage (Section 4.1)
GET /admin/api/v1/s3/bucketsGET /admin/api/v1/s3/buckets/{name}POST /admin/api/v1/s3/bucketsPUT /admin/api/v1/s3/buckets/{name}/aclDELETE /admin/api/v1/s3/buckets/{name}Self-review (5 lenses)
AdminCreateBucketreusess.coordinator.Dispatchwith the sameOperationGroupshape as the SigV4 path — bucket meta + generation key in one txn. No new FSM / Pebble / Raft path.AdminDeleteBucket's "must be empty" guard is a SnapshotAt scan + size check identical to the SigV4 path.retryS3Mutationwhich already handles transient mid-dispatch leader churn. The leader check isisVerifiedS3Leader— same primitive the SigV4 path uses. Role gate is re-evaluated against the liveMapRoleStoreon every request, so a key downgrade picked up between login and write is enforced immediately.AdminCreateBucketwrites (BucketMetaKey, BucketGenerationKey) atomically.AdminPutBucketAclmutates onlymeta.Acland re-encodes the entire BucketMeta — generation is preserved so existing object references stay valid.AdminDeleteBucketremoves only BucketMetaKey (BucketGenerationKey is left behind, matching the SigV4 path's behaviour — a future re-create gets a fresh generation).TestS3Handler_DescribeBucket_SubpathReturns404was superseded by two more precise tests now that/aclis a real sub-resource.Summary by CodeRabbit