Skip to content

admin: S3 bucket write endpoints (P2 slice 2a)#669

Merged
bootjp merged 14 commits intomainfrom
feat/admin-s3-writes
Apr 26, 2026
Merged

admin: S3 bucket write endpoints (P2 slice 2a)#669
bootjp merged 14 commits intomainfrom
feat/admin-s3-writes

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 26, 2026

P2 slice 2a of docs/design/2026_04_24_proposed_admin_dashboard.md. Ships the S3 admin write endpoints so the SPA's S3 modals stop receiving 405. Slice 2b will plumb AdminForward so a follower can hand these writes off to the leader transparently.

Summary

  • *adapter.S3Server.AdminCreateBucket / AdminPutBucketAcl / AdminDeleteBucket — SigV4-bypass write methods with three in-method guards: principal must be AdminRoleFull, the local node must be the verified S3 leader, and bucket-name / ACL must pass the existing validators. AdminCreateBucket reuses the same atomic bucket-meta + ACL + generation-key txn the SigV4 path does — no new code path through the storage layer. AdminDeleteBucket rejects non-empty buckets (the dashboard cannot force recursive delete by design).
  • internal/admin write surfaceBucketsSource gains the three write methods + CreateBucketRequest / PutBucketACLRequest types with the documented JSON shapes. S3Handler.serveCollection + servePerBucket route POST/PUT/DELETE through dedicated handlers with principalForWrite re-validating the role on every request against the live MapRoleStore.
  • Strict body decoderdecodeAdminS3JSONBody is generic over the request type, applies DisallowUnknownFields, rejects NUL bytes, rejects trailing tokens, and caps at 64 KiB (matches design 4.4). Used by both POST and PUT.
  • writeBucketsError translates the source-side sentinels into the design's HTTP statuses: 403 forbidden / 503 + Retry-After:1 leader_unavailable / 404 not_found / 409 already_exists / 409 bucket_not_empty / 400 invalid_request via *ValidationError.
  • BridgebucketsBridge gains write methods running through translateAdminBucketsError, mirroring translateAdminTablesError on the Dynamo side. Leader-churn sentinels from the kv coordinator route to admin.ErrBucketsNotLeader so the SPA's retry contract stays intact.

What is NOT in this PR

  • AdminForward integration for S3 admin writes — slice 2b.
  • Rolling-upgrade compatibility flag (criterion 5) — still deferred behind a cluster-version bump.

Test plan

Acceptance criteria coverage (Section 4.1)

Endpoint This PR
GET /admin/api/v1/s3/buckets ✓ (#658)
GET /admin/api/v1/s3/buckets/{name} ✓ (#658)
POST /admin/api/v1/s3/buckets
PUT /admin/api/v1/s3/buckets/{name}/acl
DELETE /admin/api/v1/s3/buckets/{name}

Self-review (5 lenses)

  1. Data loss: AdminCreateBucket reuses s.coordinator.Dispatch with the same OperationGroup shape as the SigV4 path — bucket meta + generation key in one txn. No new FSM / Pebble / Raft path. AdminDeleteBucket's "must be empty" guard is a SnapshotAt scan + size check identical to the SigV4 path.
  2. Concurrency: Writes go through retryS3Mutation which already handles transient mid-dispatch leader churn. The leader check is isVerifiedS3Leader — same primitive the SigV4 path uses. Role gate is re-evaluated against the live MapRoleStore on every request, so a key downgrade picked up between login and write is enforced immediately.
  3. Performance: One additional load-bucket-meta read on PutACL / Delete. No hot-path changes; admin writes are operator-rate, not data-plane-rate.
  4. Data consistency: AdminCreateBucket writes (BucketMetaKey, BucketGenerationKey) atomically. AdminPutBucketAcl mutates only meta.Acl and re-encodes the entire BucketMeta — generation is preserved so existing object references stay valid. AdminDeleteBucket removes only BucketMetaKey (BucketGenerationKey is left behind, matching the SigV4 path's behaviour — a future re-create gets a fresh generation).
  5. Test coverage: 29 new tests (19 admin-package + 10 adapter-level) covering happy paths, role gates, leader checks, validation rejections, all four sentinel error mappings, and the cross-method missing-principal 401. The existing TestS3Handler_DescribeBucket_SubpathReturns404 was superseded by two more precise tests now that /acl is a real sub-resource.

Summary by CodeRabbit

  • New Features
    • S3 bucket admin endpoints: create, update ACL, and delete with enforced write authorization, input validation, leader-forwarding, and clear HTTP error mappings.
  • Documentation
    • New admin docs covering dashboard config, TLS/role semantics, audit logging, and troubleshooting for admin operations.
  • Tests
    • Extensive end-to-end and unit tests for bucket lifecycle, forwarding, auth/validation, and error scenarios.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 59 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 48 minutes and 59 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1f65a0ee-9dde-4c21-a7a7-008aad4b5335

📥 Commits

Reviewing files that changed from the base of the PR and between 07bdc20 and 55c93c8.

⛔ Files ignored due to path filters (2)
  • proto/admin_forward.pb.go is excluded by !**/*.pb.go
  • proto/admin_forward_grpc.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (20)
  • adapter/s3_admin.go
  • adapter/s3_admin_test.go
  • docs/admin.md
  • docs/admin_deployment.md
  • internal/admin/buckets_source.go
  • internal/admin/forward_client.go
  • internal/admin/forward_client_test.go
  • internal/admin/forward_integration_test.go
  • internal/admin/forward_server.go
  • internal/admin/forward_server_test.go
  • internal/admin/s3_handler.go
  • internal/admin/s3_handler_test.go
  • internal/admin/server.go
  • main.go
  • main_admin.go
  • main_admin_forward.go
  • main_admin_forward_test.go
  • proto/admin_forward.proto
  • scripts/rolling-update.env.example
  • scripts/rolling-update.sh
📝 Walkthrough

Walkthrough

Adds S3 admin write capabilities: bucket creation, ACL updates, and deletion across adapter, handler, admin contract, forwarding, server wiring, startup, proto, tests, and docs. New sentinel errors and request DTOs are introduced; leader-forwarding and per-request role revalidation are wired end-to-end.

Changes

Cohort / File(s) Summary
Adapter
adapter/s3_admin.go, adapter/s3_admin_test.go
New S3Server admin write APIs: AdminCreateBucket, AdminPutBucketAcl, AdminDeleteBucket; exported sentinel errors (ErrAdminBucketAlreadyExists, ErrAdminBucketNotFound, ErrAdminBucketNotEmpty, ErrAdminInvalidBucketName, ErrAdminInvalidACL); ACL canonicalization helper; transactional KV writes/reads; comprehensive unit tests including non-empty-bucket detection via object manifest scan.
Admin contract & proto
internal/admin/buckets_source.go, proto/admin_forward.proto
BucketsSource extended with AdminCreateBucket/AdminPutBucketAcl/AdminDeleteBucket; new DTOs (CreateBucketRequest, PutBucketACLRequest) and ErrBucketsNotEmpty; proto AdminOperation enum extended with bucket ops (CREATE_BUCKET, DELETE_BUCKET, PUT_BUCKET_ACL).
HTTP handler & server wiring
internal/admin/s3_handler.go, internal/admin/s3_handler_test.go, internal/admin/server.go
Adds POST /s3/buckets, PUT /s3/buckets/{name}/acl, DELETE /s3/buckets/{name}; per-request write authorization via RoleStore (principalForWrite), strict JSON/body decoding (64 KiB limit, NUL/unknown/trailing checks), error translation/leader-forwarding integration, and many handler tests (auth, validation, forwarding, body/size errors).
Forwarding (client/server/integration/tests)
internal/admin/forward_client.go, internal/admin/forward_client_test.go, internal/admin/forward_server.go, internal/admin/forward_server_test.go, internal/admin/forward_integration_test.go
Adds leader-forwarding methods for bucket ops to LeaderForwarder and gRPC forward client; extends ForwardServer dispatch for CREATE/DELETE/PUT_BUCKET_ACL with payload decoding, availability gating (501 when buckets source nil), error serialization, and tests validating happy/failure paths and size limits.
Bridges, startup, main wiring
main_admin.go, main.go, main_admin_forward.go, main_admin_forward_test.go
Adds bucketsBridge admin methods delegating to adapter and translating adapter errors; maps adapter sentinels to admin errors; moves S3 server startup earlier and wires BucketsSource into admin-forward dependencies; adjusts forward-server ready-for-registration checks and tests.
Docs & other
docs/admin.md
New admin documentation covering dashboard, write authorization semantics, forwarding behavior, audit logging, TLS/topology rules, and troubleshooting.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Handler as S3Handler
    participant RoleStore
    participant Forwarder as LeaderForwarder
    participant BucketsSrc as BucketsSource
    participant Adapter as S3 Adapter
    participant KV as KV Store

    Client->>Handler: POST /s3/buckets {name, acl}
    Handler->>RoleStore: validate write permission for principal
    RoleStore-->>Handler: role (allow/deny)
    alt Not leader locally
        Handler->>Forwarder: ForwardCreateBucket(principal, {name,acl})
        Forwarder-->>Handler: leader response / ErrLeaderUnavailable
    else Leader or local
        Handler->>BucketsSrc: AdminCreateBucket(principal, {name,acl})
        BucketsSrc->>Adapter: AdminCreateBucket(ctx, principal, name, acl)
        Adapter->>KV: read exists / validate
        alt exists
            Adapter-->>BucketsSrc: ErrAdminBucketAlreadyExists
        else
            Adapter->>KV: OperationGroup write bucket metadata & generation
            KV-->>Adapter: success
            Adapter-->>BucketsSrc: AdminBucketSummary
        end
        BucketsSrc-->>Handler: result / error
    end
    Handler-->>Client: 201/409/503/4xx based on outcome
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I stitched a bucket, bright and neat,
ACLs trimmed tidy, transactions complete.
Leaders forward, roles keep wrongs at bay,
Objects checked before they slip away.
Hop, audit, flourish — admin dreams today.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'admin: S3 bucket write endpoints (P2 slice 2a)' directly and clearly describes the main change: implementation of S3 bucket write endpoints as part of the admin dashboard feature.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/admin-s3-writes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 39s —— View job


Code Review: admin S3 bucket write endpoints (P2 slice 2a)

Todo list

  • Understand the request
  • Read adapter/s3_admin.go
  • Read adapter/s3_admin_test.go
  • Read internal/admin/buckets_source.go
  • Read internal/admin/s3_handler.go
  • Read internal/admin/s3_handler_test.go
  • Read internal/admin/server.go
  • Read main_admin.go
  • Write review feedback

The overall implementation is solid and mirrors the Dynamo side well. I cross-checked the missing CommitTS on AdminPutBucketAcl / AdminDeleteBucket against the SigV4 path — s3.go:672-678 and s3.go:795-801 show the same pattern (no CommitTS on single-key mutations; only createBucket needs it for CreatedAtHLC). That's intentional and not a bug.

That said, I found a few things worth addressing:


Findings

1. Stale route table comment in buildAPIMux (medium)

internal/admin/server.go:197-208 still lists only two GET endpoints for S3. The three new write endpoints are completely absent from the layout comment, which is the primary reference a future reader will use to understand what this listener exposes.

// current (stale):
//	GET    /admin/api/v1/s3/buckets                 (auth required)
//	GET    /admin/api/v1/s3/buckets/{name}          (auth required)

// should be:
//	GET    /admin/api/v1/s3/buckets                 (auth required)
//	POST   /admin/api/v1/s3/buckets                 (auth required, full role)
//	GET    /admin/api/v1/s3/buckets/{name}          (auth required)
//	DELETE /admin/api/v1/s3/buckets/{name}          (auth required, full role)
//	PUT    /admin/api/v1/s3/buckets/{name}/acl      (auth required, full role)

Fix this →


2. Missing handler-level test for *ValidationError path (medium)

writeBucketsError (s3_handler.go:402-407) has a case errors.As(err, &verr) arm for *ValidationError that maps to 400 invalid_request. This path is exercised end-to-end via the bridge (adapter ErrAdminInvalidBucketName/ErrAdminInvalidACLtranslateAdminBucketsError&admin.ValidationError{...}), but no handler-level unit test directly injects a *ValidationError into the stub source. The invalid_request arm is the only writeBucketsError branch lacking a handler test.

// add to s3_handler_test.go:
func TestS3Handler_CreateBucket_ValidationErrorReturns400(t *testing.T) {
    src := &stubBucketsSource{createErr: &ValidationError{Message: "invalid bucket name: uppercase not allowed"}}
    h := newS3HandlerForTest(src)
    req := httptest.NewRequest(http.MethodPost, pathS3Buckets,
        strings.NewReader(validCreateBucketBody()))
    req = withFullPrincipal(req)
    rec := httptest.NewRecorder()
    h.ServeHTTP(rec, req)
    require.Equal(t, http.StatusBadRequest, rec.Code)
    require.Contains(t, rec.Body.String(), "invalid_request")
    require.Contains(t, rec.Body.String(), "uppercase not allowed")
}

Fix this →


3. Failed write attempts are not audited (minor)

The audit log entries in handleCreate, handlePutACL, and handleDelete (s3_handler.go:286-292, 317-323, 337-342) are only emitted on the success path. A role-rejected create or a leader-unavailable delete leaves no trace in the audit log. The Dynamo side has the same gap (so this is a pre-existing pattern, not introduced here), but admin write attempts that are refused for security reasons are typically the most forensically valuable to log.


4. decodeAdminS3JSONBody passes nil to http.MaxBytesReader (minor)

s3_handler.go:440:

limited := http.MaxBytesReader(nil, io.NopCloser(body), adminS3CreateBodyLimit)

Passing nil as the ResponseWriter is safe with Go's current implementation (the requestTooLarger interface check is a comma-ok assertion, so nil doesn't panic), but it's undocumented API usage. The Dynamo side's equivalent decoder should be checked for consistency — if it passes the real http.ResponseWriter, this is an inconsistency worth noting. If it also passes nil, add a comment explaining the intent (e.g., // nil writer: we detect the limit via errors.As; we don't need the connection-close optimisation here).


5. Misleading error message for DisallowUnknownFields failures (minor)

s3_handler.go:469:

if err := dec.Decode(&out); err != nil {
    return zero, errors.New("request body is not valid JSON")
}

When DisallowUnknownFields is set, an unknown key causes Decode to return an error, but the user sees "request body is not valid JSON" rather than something like "unknown field in request body". The test at line 479 ("unknown field", ..., "request body is not valid JSON") pins this behaviour, so it's intentional — but it's worth considering whether an operator debugging a schema mismatch will be confused. This is the same trade-off the Dynamo decoder made; just flagging it for consistency tracking.


Summary

# Severity File Area
1 Medium internal/admin/server.go:197 Stale route table comment
2 Medium internal/admin/s3_handler_test.go *ValidationError handler path untested
3 Minor internal/admin/s3_handler.go Audit log omits failed write attempts
4 Minor internal/admin/s3_handler.go:440 MaxBytesReader(nil, ...) undocumented
5 Minor internal/admin/s3_handler.go:469 Misleading error text for unknown fields

The core write paths (AdminCreateBucket, AdminPutBucketAcl, AdminDeleteBucket), the three-guard pattern (role → leader → validation), and the bridge translation are well-structured and consistent with the Dynamo precedent. Items 1 and 2 are the most worth fixing before merge; the rest can go into a follow-up.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces administrative write operations for S3 buckets, including creation, deletion, and ACL updates. It adds new methods to the S3Server adapter, updates the BucketsSource interface, and extends the S3Handler to support these operations with mandatory role re-validation. The implementation is supported by comprehensive unit tests, error translation logic, and a bridge connecting the admin API to the adapter. I have no feedback to provide.

bootjp added a commit that referenced this pull request Apr 26, 2026
…axBytes nil doc)

Three actionable findings from claude review on 91a48ef:

1) **Medium: stale route-table comment in `buildAPIMux`**
   `internal/admin/server.go`: the comment block listing the routes
   only mentioned the two GET S3 endpoints. Added the three new
   write endpoints (POST /buckets, DELETE /buckets/{name}, PUT
   /buckets/{name}/acl) so a future reader does not have to
   reverse-engineer the routing from `servePerBucket`.

2) **Medium: missing handler-level `*ValidationError` test**
   `s3_handler_test.go`: `writeBucketsError`'s `errors.As(err,
   &verr)` arm was exercised end-to-end via the bridge translation
   but had no direct handler-level coverage. Added
   `TestS3Handler_WriteEndpoints_ValidationErrorReturns400` —
   three sub-cases (create / put_acl / delete) injecting a
   `*ValidationError` directly into the stub source and asserting
   400 + `invalid_request` + the message. Locks the arm down so a
   future refactor that drops the `errors.As` branch fails CI
   immediately.

3) **Minor: `MaxBytesReader(nil, ...)` undocumented**
   `s3_handler.go`: added an explanatory comment at the call site.
   Passing nil is safe because `MaxBytesReader`'s only writer use
   is the connection-close optimisation (a comma-ok type assertion
   that no-ops on nil); the cap itself is enforced regardless. We
   pass nil because the helper takes only `io.Reader` so it remains
   usable from non-HTTP entry points (the upcoming AdminForward
   decoder reuses the same shape). The 413 path still works via
   `errAdminS3BodyTooLarge` matching on `*http.MaxBytesError`.

Findings #3 (failed-write audit log) and #5 (DisallowUnknownFields
error message) were noted as pre-existing patterns shared with the
Dynamo side; tracking them separately rather than diverging on
this slice.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

ddc3b9b addresses the three actionable findings:

  1. Medium: route-table comment in buildAPIMux now lists all 5 S3 endpoints (the 2 GETs plus POST / DELETE / PUT acl).
  2. Medium: TestS3Handler_WriteEndpoints_ValidationErrorReturns400 (create + put_acl + delete sub-cases) pins the writeBucketsError *ValidationError arm.
  3. Minor: explanatory comment at the MaxBytesReader(nil, ...) call site explaining the nil-writer is intentional and safe.

Findings #3 (failed-write audit) and #5 (DisallowUnknownFields error text) noted as pre-existing patterns shared with the Dynamo side — tracking separately to avoid diverging on this slice.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 49s —— View job


Re-review: admin S3 bucket write endpoints (P2 slice 2a)

Todo list

  • Understand the request (re-review after ddc3b9b addressed previous findings)
  • Read internal/admin/server.go — verify route-table comment fix (Configure Renovate #1)
  • Read internal/admin/s3_handler_test.go — verify ValidationError test coverage (Update dependency go to v1.21.4 #2)
  • Read internal/admin/s3_handler.go — verify MaxBytesReader nil comment (Dependency Dashboard #4)
  • Verify no regressions in the overall structure
  • Post final review

All three actionable findings from the previous review have been correctly addressed in ddc3b9b2:

Finding #1 (Medium) — Route-table comment: ✅ Fixed

server.go:202-206 now lists all five S3 endpoints including the three new write endpoints:

GET    /admin/api/v1/s3/buckets                 (auth required)
POST   /admin/api/v1/s3/buckets                 (auth required, full role)
GET    /admin/api/v1/s3/buckets/{name}          (auth required)
DELETE /admin/api/v1/s3/buckets/{name}          (auth required, full role)
PUT    /admin/api/v1/s3/buckets/{name}/acl      (auth required, full role)

Finding #2 (Medium) — ValidationError handler coverage: ✅ Fixed

TestS3Handler_WriteEndpoints_ValidationErrorReturns400 (s3_handler_test.go:651-701) covers all three write methods as sub-tests, each injecting a *ValidationError into the stub source and asserting StatusBadRequest + "invalid_request" + the full message string. The arm in writeBucketsError is now fully pinned at the handler level.

Finding #4 (Minor) — MaxBytesReader nil comment: ✅ Fixed

s3_handler.go:440-451 now has a thorough comment explaining why nil is safe (the connection-close optimisation is skipped but the cap is still enforced via *http.MaxBytesError), why the ResponseWriter is unavailable at that layer, and that this was flagged in review. The explanation is accurate and sufficient for a future reader.

Findings #3 and #5 (failed-write audit omission and DisallowUnknownFields error text) acknowledged as pre-existing patterns shared with the Dynamo side — reasonable to defer and track together with that side rather than diverge here.

No regressions found. The test file ends at 731 lines covering all sentinel error paths, role gates, method rejections, body-decode edge cases, and the now-complete ValidationError arm. The PR is ready to merge.

bootjp added a commit that referenced this pull request Apr 26, 2026
Slice 2b of P2 (docs/design/2026_04_24_proposed_admin_dashboard.md
Section 3.3.2 + 4.1): a follower-side S3 admin write
(POST /buckets, PUT /buckets/{name}/acl, DELETE /buckets/{name})
now hands off to the leader transparently, completing the same
end-to-end forwarding contract Dynamo writes received in #644 +

Stacked on #669 (P2 slice 2a). Once #669 merges, this rebases
cleanly onto main.

Proto (proto/admin_forward.proto):
- Three new ADMIN_OP enum values (CREATE_BUCKET / DELETE_BUCKET /
  PUT_BUCKET_ACL) appended after the Dynamo block so existing
  wire-format integers stay stable. Regenerated with the pinned
  protoc 29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.

Leader-side ForwardServer (internal/admin/forward_server.go):
- WithBucketsSource lets deployments wire the S3 dispatcher
  optionally — Dynamo-only builds keep the receiver nil and the
  three new operations return 501 NotImplemented.
- Three new dispatch arms: handleCreateBucket / handleDeleteBucket /
  handlePutBucketAcl. Each one mirrors the leader-direct HTTP
  path's payload contract (NUL-byte rejection, 64 KiB limit,
  DisallowUnknownFields, trailing-token rejection, slash-in-name
  rejection) so a hostile follower cannot smuggle a payload past
  validations the leader-direct path enforces.
- forwardBucketsErrorResponse mirrors forwardErrorResponse on the
  Dynamo side: ErrBucketsForbidden / NotLeader / NotFound /
  AlreadyExists / NotEmpty + ValidationError each map to the same
  HTTP status the leader-direct writeBucketsError produces, so
  forwarded and leader-direct responses are byte-for-byte
  indistinguishable from the SPA's view.
- isStructuredSourceError extended to recognise the bucket
  sentinels so they are NOT logged at LevelError on the leader.
- Forward's switch was extracted into dispatchForward to keep the
  parent function under cyclop's 10-branch ceiling as the operation
  enum grew.

Follower-side LeaderForwarder (internal/admin/forward_client.go):
- Interface gains ForwardCreateBucket / ForwardDeleteBucket /
  ForwardPutBucketAcl. PutBucketAcl carries both the bucket name
  (from the URL path) and the new ACL (from the request body) in
  one JSON payload — same approach handleDeleteBucket takes for
  the bucket name.
- gRPCForwardClient methods reuse the existing forward() helper
  for transport, so connection-cache reuse and ErrLeaderUnavailable
  signalling behave identically across resource types.

Handler integration (internal/admin/s3_handler.go):
- New `forwarder LeaderForwarder` field + WithLeaderForwarder method.
- handleCreate / handlePutACL / handleDelete now consult tryForward*
  helpers when the source returned ErrBucketsNotLeader; the helpers
  are gated on `errors.Is(err, ErrBucketsNotLeader) && forwarder != nil`
  so a leader-direct rejection (already-exists, not-found, etc.) is
  never re-applied at the leader.
- writeForwardResult / writeForwardFailure mirror the Dynamo handler's
  pattern: nosniff + Cache-Control:no-store + Retry-After:1 on 503.
  ErrLeaderUnavailable does NOT log at LevelError (elections are
  routine); transport errors do log so operators can investigate.

Wiring (main.go + main_admin_forward.go):
- adminForwardServerDeps gains a `buckets` field; readyForRegistration
  still requires only TablesSource + RoleStore (so cluster-only or
  Dynamo-only builds keep registering Dynamo forwarding without S3).
- runtimeServerRunner.start() now creates *adapter.S3Server BEFORE
  startRaftServers (in addition to dynamoServer) so the leader-side
  ForwardServer registration sees both adapters. The reorder is safe:
  each adapter listens on its own address and the raft TCP listeners
  are independent.
- ServerDeps.Forwarder now plumbs through buildS3HandlerForDeps too,
  so the follower's S3Handler picks up the same LeaderForwarder
  instance the Dynamo handler does.

Tests:
- 9 forward-server tests covering the three new bucket operations:
  happy path / no-BucketsSource→501 / bad-JSON 400 / already-exists
  409 / not-empty 409 / slash-in-name 400 / missing-acl 400 /
  payload-too-large 413 (sweep over all three ops).
- 4 forward-client tests covering ForwardCreateBucket /
  ForwardDeleteBucket / ForwardPutBucketAcl happy-path payload
  shapes + ErrLeaderUnavailable on no-leader.
- stubLeaderForwarder gains 3 bucket-side forward methods so
  existing dynamo tests still satisfy the LeaderForwarder
  interface, and the new stub fields let bucket-handler tests
  verify the forward arguments.
- 6 handler integration tests on S3Handler.tryForward*: forwarded
  create / delete / put-acl happy paths (replay leader's status
  + payload + content-type), forwarder ErrLeaderUnavailable → 503
  + Retry-After, transport-error → 503 + no leakage, and a 3-axis
  gate sweep proving the forwarder is NOT invoked on
  AlreadyExists / Forbidden / generic source errors.

Closes design 3.3.2 acceptance criteria 2 (transparent forwarding)
+ 6 (forwarded_from in audit log) for S3 admin writes; criterion 3
(election-period 503 + retry) is also live for S3 because the
existing tryForward helpers reuse the same fallback paths.
bootjp added a commit that referenced this pull request Apr 26, 2026
Two findings from claude review #675:

1. **P2 was marked ✅ shipped but #669 / #673 are still in flight.**
   PR description correctly noted slices 2a/2b haven't merged, but
   the table cell read "✅ shipped". Future readers reading the
   table alone would conclude main has the S3 write paths today,
   which is wrong. Fixed: P2 is now 🟡 partial with the in-flight
   PRs explicitly called out, mirroring the P4 row's shape.

2. **P1 row claimed criteria 1–6 but criterion 5 is in the
   outstanding items as deferred.** The two cells contradicted each
   other. Updated P1 to "criteria 1–4 + 6 (criterion 5 deferred —
   see outstanding items)" so the table is consistent with the
   outstanding-items section below.

The cosmetic notes (Last-updated field, verbose Status header) are
left for a future cleanup; they are not factually wrong, only
non-conforming to the README convention. Calling them out as a
follow-up rather than baking the precedent into this doc.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

Rebased onto main (5a67834) to absorb #661 (keyviz Sampler) which moved startAdminFromFlags and buildAPIMux's route table forward. The route-table comment now lists all 5 S3 endpoints + the new keyviz endpoint together. Build / vet / lint / admin + main tests all pass.

bootjp added a commit that referenced this pull request Apr 26, 2026
…axBytes nil doc)

Three actionable findings from claude review on 91a48ef:

1) **Medium: stale route-table comment in `buildAPIMux`**
   `internal/admin/server.go`: the comment block listing the routes
   only mentioned the two GET S3 endpoints. Added the three new
   write endpoints (POST /buckets, DELETE /buckets/{name}, PUT
   /buckets/{name}/acl) so a future reader does not have to
   reverse-engineer the routing from `servePerBucket`.

2) **Medium: missing handler-level `*ValidationError` test**
   `s3_handler_test.go`: `writeBucketsError`'s `errors.As(err,
   &verr)` arm was exercised end-to-end via the bridge translation
   but had no direct handler-level coverage. Added
   `TestS3Handler_WriteEndpoints_ValidationErrorReturns400` —
   three sub-cases (create / put_acl / delete) injecting a
   `*ValidationError` directly into the stub source and asserting
   400 + `invalid_request` + the message. Locks the arm down so a
   future refactor that drops the `errors.As` branch fails CI
   immediately.

3) **Minor: `MaxBytesReader(nil, ...)` undocumented**
   `s3_handler.go`: added an explanatory comment at the call site.
   Passing nil is safe because `MaxBytesReader`'s only writer use
   is the connection-close optimisation (a comma-ok type assertion
   that no-ops on nil); the cap itself is enforced regardless. We
   pass nil because the helper takes only `io.Reader` so it remains
   usable from non-HTTP entry points (the upcoming AdminForward
   decoder reuses the same shape). The 413 path still works via
   `errAdminS3BodyTooLarge` matching on `*http.MaxBytesError`.

Findings #3 (failed-write audit log) and #5 (DisallowUnknownFields
error message) were noted as pre-existing patterns shared with the
Dynamo side; tracking them separately rather than diverging on
this slice.
@bootjp bootjp force-pushed the feat/admin-s3-writes branch from ddc3b9b to 5a67834 Compare April 26, 2026 12:26
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
internal/admin/s3_handler.go (1)

309-313: Minor: body.ACL is forwarded un-trimmed even though validation rejects whitespace-only input.

strings.TrimSpace(body.ACL) == "" rejects blank/whitespace input, but the un-trimmed body.ACL is what flows into the source. The adapter's adminCanonicalACL trims again, so functionally this works, but " public-read " from the SPA reaches the audit log line at 322 with the surrounding whitespace preserved (slog.String("acl", body.ACL)), which makes downstream log filtering noisier than necessary. Trimming once at the boundary keeps the audit field canonical.

♻️ Trim once at the handler boundary
-	if strings.TrimSpace(body.ACL) == "" {
+	body.ACL = strings.TrimSpace(body.ACL)
+	if body.ACL == "" {
 		writeJSONError(w, http.StatusBadRequest, "invalid_body", "acl is required")
 		return
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/admin/s3_handler.go` around lines 309 - 313, Trim body.ACL once at
the handler boundary and use the trimmed value everywhere: e.g., compute acl :=
strings.TrimSpace(body.ACL), validate acl == "" (return the same invalid_body
error), and then pass acl into h.source.AdminPutBucketAcl and into the audit
slog.String("acl", ...) so the audit log and source receive the canonical ACL;
you can still keep adminCanonicalACL doing further normalization if needed but
stop forwarding the untrimmed body.ACL.
internal/admin/buckets_source.go (1)

22-50: Minor: PutBucketACLRequest DTO defined but not used by the source interface.

AdminPutBucketAcl takes name, acl string directly while AdminCreateBucket takes CreateBucketRequest. The new PutBucketACLRequest type is only used by the handler's body decoder. This is functionally fine, but a future "owner override" extension (referenced in the type's own doc comment) would need to update the interface signature anyway, so deferring the typed DTO until then is also a valid choice. Worth a one-line doc note clarifying that PutBucketACLRequest lives on the wire side only.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/admin/buckets_source.go` around lines 22 - 50, The
PutBucketACLRequest DTO is defined but not used by the source interface; update
the codebase by adding a one-line doc comment on the PutBucketACLRequest type
clarifying it is a wire-only/request-body DTO (used by the handler's body
decoder) and not part of the BucketsSource interface contract, and note that
AdminPutBucketAcl currently accepts name, acl string and would be the place to
change the signature if an "owner override" extension is added later; this keeps
the intent clear while leaving AdminPutBucketAcl and AdminCreateBucket
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@adapter/s3_admin.go`:
- Around line 369-403: AdminDeleteBucket reads a one-entry probe of
s3keys.ObjectManifestPrefixForBucket(...) via s.store.ScanAt(...) then issues a
delete via s.coordinator.Dispatch(...) but does not supply the scanned manifest
keys as ReadKeys, so OCC won't detect concurrent PutObject races; fix by
extracting the keys from kvs returned by ScanAt and populate the
OperationGroup's ReadKeys slice before calling coordinator.Dispatch in the
retryS3Mutation closure (reference functions/values: AdminDeleteBucket,
retryS3Mutation, loadBucketMetaAt, s.store.ScanAt, s.coordinator.Dispatch,
s3keys.ObjectManifestPrefixForBucket, s3keys.BucketMetaKey) so the OCC validator
will consider those manifest keys and trigger a retry on conflict.
- Around line 342-355: AdminDeleteBucket has a TOCTOU: after scanning
ObjectManifestPrefixForBucket at readTS it dispatches a delete of BucketMetaKey
without including the scanned range in the OperationGroup.ReadKeys, so
concurrent inserts can be orphaned; fix by collecting the scanned range keys
(the range returned by the scan over ObjectManifestPrefixForBucket) and include
them in the OperationGroup passed to s.coordinator.Dispatch (or, alternatively,
re-scan at StartTS before deleting) so OCC validation detects conflicts; update
the retryS3Mutation / s.coordinator.Dispatch call in AdminDeleteBucket (and
likewise in deleteBucket in s3.go) to populate ReadKeys with the scanned range.

In `@internal/admin/s3_handler.go`:
- Around line 351-376: S3Handler.principalForWrite currently skips checking the
JWT-embedded role when a live RoleStore is present, causing behavioral
divergence from DynamoHandler.principalForWrite; change principalForWrite to
first enforce the JWT role gate (check principal.Role.AllowsWrite()) regardless
of h.roles, then if h.roles != nil re-resolve the live role via
h.roles.LookupRole(principal.AccessKey) and require role.AllowsWrite() as well,
and finally return an AuthPrincipal containing the live role
(AuthPrincipal{AccessKey: principal.AccessKey, Role: role}) when both checks
pass; also update test setup (newS3HandlerForTest) or add tests to call
WithRoleStore so the h.roles != nil path is exercised.

---

Nitpick comments:
In `@internal/admin/buckets_source.go`:
- Around line 22-50: The PutBucketACLRequest DTO is defined but not used by the
source interface; update the codebase by adding a one-line doc comment on the
PutBucketACLRequest type clarifying it is a wire-only/request-body DTO (used by
the handler's body decoder) and not part of the BucketsSource interface
contract, and note that AdminPutBucketAcl currently accepts name, acl string and
would be the place to change the signature if an "owner override" extension is
added later; this keeps the intent clear while leaving AdminPutBucketAcl and
AdminCreateBucket unchanged.

In `@internal/admin/s3_handler.go`:
- Around line 309-313: Trim body.ACL once at the handler boundary and use the
trimmed value everywhere: e.g., compute acl := strings.TrimSpace(body.ACL),
validate acl == "" (return the same invalid_body error), and then pass acl into
h.source.AdminPutBucketAcl and into the audit slog.String("acl", ...) so the
audit log and source receive the canonical ACL; you can still keep
adminCanonicalACL doing further normalization if needed but stop forwarding the
untrimmed body.ACL.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6b38f8ff-8066-45eb-9dda-c712067729e1

📥 Commits

Reviewing files that changed from the base of the PR and between 5e4fe67 and ddc3b9b.

📒 Files selected for processing (7)
  • adapter/s3_admin.go
  • adapter/s3_admin_test.go
  • internal/admin/buckets_source.go
  • internal/admin/s3_handler.go
  • internal/admin/s3_handler_test.go
  • internal/admin/server.go
  • main_admin.go

Comment thread adapter/s3_admin.go
Comment on lines +342 to +355
_, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
IsTxn: true,
StartTS: startTS,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Put, Key: s3keys.BucketMetaKey(name), Value: body},
},
})
return errors.WithStack(err)
})
if err != nil {
return err //nolint:wrapcheck // sentinel errors propagate as-is.
}
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate the kv.OperationGroup struct definition and check how Dispatch
# treats a zero CommitTS when IsTxn=true.
ast-grep --pattern 'type OperationGroup[$_] struct {
  $$$
}'

# Find the Dispatch implementation that consumes OperationGroup.
rg -nP --type=go -C5 'func.*Dispatch\s*\(.*OperationGroup'

# Find any other call sites that set IsTxn=true without CommitTS to see
# whether this is a known idiom or a regression.
rg -nPU --type=go -C3 'IsTxn:\s*true' -g '!**/*_test.go'

Repository: bootjp/elastickv

Length of output: 34769


🏁 Script executed:

#!/bin/bash
# Find OperationGroup struct definition
rg -nP 'type OperationGroup\[' --type=go -A 10

# Find validateOperationGroup function
rg -nP 'func.*validateOperationGroup' --type=go -A 20

# Check the Dispatch implementation in coordinator.go more closely
sed -n '240,280p' kv/coordinator.go

Repository: bootjp/elastickv

Length of output: 4198


🏁 Script executed:

#!/bin/bash
# Find AdminDeleteBucket implementation with full context
sed -n '361,403p' adapter/s3_admin.go

# Check if there's any range-read/read-intent tracking in OCC logic
rg -nP 'ReadKeys|read.*intent' kv/coordinator.go -A 3 -B 1

# Check transaction validation/conflict detection logic
rg -nP 'validateRead|checkConflict' kv/ --type=go -A 10 | head -60

Repository: bootjp/elastickv

Length of output: 6862


🏁 Script executed:

#!/bin/bash
# Check AdminPutBucketAcl (around line 312-355) for ReadKeys
sed -n '312,355p' adapter/s3_admin.go

# Check retryS3Mutation implementation
rg -nP 'func.*retryS3Mutation' adapter/s3_admin.go -A 30

Repository: bootjp/elastickv

Length of output: 1240


🏁 Script executed:

#!/bin/bash
# Find retryS3Mutation implementation
rg -nP 'func.*retryS3Mutation' adapter/s3.go -A 35

# Also check if AdminDeleteBucket passes ReadKeys for range scans
rg -nP 'ReadKeys.*=|ReadKeys:' adapter/s3_admin.go

# Check if there's any conflict detection specifically for AdminDeleteBucket
rg -nP 'ObjectManifestPrefix|retryS3Mutation' adapter/s3.go -B 2 -A 2

Repository: bootjp/elastickv

Length of output: 2854


🏁 Script executed:

#!/bin/bash
# Check if any other code passes ReadKeys for range scans
rg -nP 'ReadKeys.*=|ReadKeys:' adapter/ --type=go -B 2 | head -80

# Specifically check AdminCreateBucket to see if it uses ReadKeys
sed -n '259,310p' adapter/s3_admin.go

# Check dispatchTxn to understand how ReadKeys affect OCC
sed -n '801,870p' kv/coordinator.go | head -60

Repository: bootjp/elastickv

Length of output: 6935


🏁 Script executed:

#!/bin/bash
# Check if s3.go deleteBucket (regular, not admin) passes ReadKeys
sed -n '638,680p' adapter/s3.go

# Check if any S3 code passes ReadKeys for range scans
rg -nP 'ScanAt.*ReadKeys|ReadKeys.*ScanAt' adapter/s3.go adapter/s3_admin.go

# Look for patterns of range scan + dispatch in S3
rg -nP 'ScanAt' adapter/s3.go -A 10 | head -100

Repository: bootjp/elastickv

Length of output: 3165


🏁 Script executed:

#!/bin/bash
# Check the test coordinator that tests bucket delete races
sed -n '745,800p' adapter/s3_test.go

# Check containsObjectManifestMutation function
rg -nP 'func.*containsObjectManifestMutation' adapter/ --type=go -A 8

# Verify whether AdminDeleteBucket's dispatch is covered by the race test
rg -nP 'AdminDeleteBucket|bucketDeleteRaceCoordinator' adapter/s3_test.go adapter/s3_admin_test.go

Repository: bootjp/elastickv

Length of output: 2993


Remove the CommitTS concern; verify the range-scan TOCTOU issue in AdminDeleteBucket.

The CommitTS asymmetry is intentional. The OperationGroup struct (kv/transcoder.go:29-30) explicitly documents that CommitTS is optional: "Coordinators choose one automatically when this is zero." The validateOperationGroup function does not validate CommitTS, and this pattern is used throughout the codebase (SQS, Redis, DynamoDB adapters all omit it when unset). No comment is needed.

However, AdminDeleteBucket has a genuine TOCTOU issue: it scans ObjectManifestPrefixForBucket at readTS (lines 370–385), finds no objects, then dispatches a delete of BucketMetaKey with StartTS but no ReadKeys. If a concurrent write inserts an object after the scan but before the delete commits, the object becomes orphaned. The retryS3Mutation loop only retries on conflicts to BucketMetaKey itself, not to keys within the scanned range.

Compare to SQS code (e.g., adapter/sqs_catalog.go:656–661), which correctly passes ReadKeys after range scans to enable OCC conflict detection. The same issue exists in s3.go:deleteBucket() (lines 638–680).

Pass the scanned range as ReadKeys to enable OCC validation, or re-scan at transaction StartTS before deleting to close the window.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/s3_admin.go` around lines 342 - 355, AdminDeleteBucket has a TOCTOU:
after scanning ObjectManifestPrefixForBucket at readTS it dispatches a delete of
BucketMetaKey without including the scanned range in the
OperationGroup.ReadKeys, so concurrent inserts can be orphaned; fix by
collecting the scanned range keys (the range returned by the scan over
ObjectManifestPrefixForBucket) and include them in the OperationGroup passed to
s.coordinator.Dispatch (or, alternatively, re-scan at StartTS before deleting)
so OCC validation detects conflicts; update the retryS3Mutation /
s.coordinator.Dispatch call in AdminDeleteBucket (and likewise in deleteBucket
in s3.go) to populate ReadKeys with the scanned range.

Comment thread adapter/s3_admin.go
Comment on lines +369 to +403
err := s.retryS3Mutation(ctx, func() error {
readTS := s.readTS()
startTS := s.txnStartTS(readTS)
readPin := s.pinReadTS(readTS)
defer readPin.Release()

meta, exists, err := s.loadBucketMetaAt(ctx, name, readTS)
if err != nil {
return errors.WithStack(err)
}
if !exists || meta == nil {
return ErrAdminBucketNotFound
}
start := s3keys.ObjectManifestPrefixForBucket(name, meta.Generation)
kvs, err := s.store.ScanAt(ctx, start, prefixScanEnd(start), 1, readTS)
if err != nil {
return errors.WithStack(err)
}
if len(kvs) > 0 {
return ErrAdminBucketNotEmpty
}
_, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
IsTxn: true,
StartTS: startTS,
Elems: []*kv.Elem[kv.OP]{
{Op: kv.Del, Key: s3keys.BucketMetaKey(name)},
},
})
return errors.WithStack(err)
})
if err != nil {
return err //nolint:wrapcheck // sentinel errors propagate as-is.
}
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how retryS3Mutation handles dispatch errors and whether it
# retries OCC conflicts on range reads.
ast-grep --pattern 'func ($_ *S3Server) retryS3Mutation($$$) $_ {
  $$$
}'

# Look for OCC / read-intent tracking inside kv coordinator.
rg -nP --type=go -C3 'readPin|ReadIntent|ReadSet|range.*read'

# See whether the SigV4 deleteBucket path uses the same pattern (so we
# can scope this concern as pre-existing or new).
rg -nP --type=go -C5 'deleteBucket\s*\(' -g '!**/*_test.go' -g '!**/s3_admin*.go'

Repository: bootjp/elastickv

Length of output: 33198


🏁 Script executed:

# Find isRetryableS3MutationErr implementation
rg -nP --type=go -A10 'func isRetryableS3MutationErr' -g '!**/*_test.go'

# Check what error types are retryable (look for OCC/write conflict handling)
rg -nP --type=go 'WriteConflict|OCC.*conflict|retryable' -g '!**/*_test.go' -C2

# Verify: does AdminDeleteBucket need to track ScanAt as ReadKeys?
rg -nP --type=go -B5 -A10 'func.*AdminDeleteBucket' adapter/s3_admin.go

Repository: bootjp/elastickv

Length of output: 42188


🏁 Script executed:

# Check kv.OperationGroup definition and ReadKeys field
rg -nP --type=go -A5 'type OperationGroup' kv/

# Search for tests of AdminDeleteBucket or deleteBucket with concurrent writes
rg -nP --type=go 'TestAdmin.*DeleteBucket|Test.*deleteBucket' -g '**/*_test.go'

# Check if any S3 path passes keys from a range scan as ReadKeys
rg -nP --type=go -B10 'ReadKeys.*Scan' adapter/s3.go

Repository: bootjp/elastickv

Length of output: 375


🏁 Script executed:

# Find the Dispatch signature to understand ReadKeys parameter
rg -nP --type=go 'func.*Dispatch.*ctx.*error' kv/ -A8

# Check if ReadKeys is in pb.Request
rg -nP --type=go 'ReadKeys' kv/ -B2 -A2 | head -100

Repository: bootjp/elastickv

Length of output: 8737


🏁 Script executed:

# Search for existing tests of AdminDeleteBucket or deleteBucket with concurrent mutations
rg -nP --type=go 'AdminDeleteBucket|deleteBucket' adapter/s3_admin_test.go adapter/s3_test.go 2>/dev/null | head -50

# Check if there are any tests at all for these functions
fd -e _test.go -type f | xargs rg -l 'AdminDeleteBucket|deleteBucket' 2>/dev/null

Repository: bootjp/elastickv

Length of output: 1193


🏁 Script executed:

# Read the test that checks for non-empty bucket
sed -n '287,320p' adapter/s3_admin_test.go

Repository: bootjp/elastickv

Length of output: 1474


🏁 Script executed:

# Verify there are no race-condition tests for AdminDeleteBucket/deleteBucket
rg -nP --type=go 'concurrent|race|TOCTOU' adapter/s3_admin_test.go adapter/s3_test.go

Repository: bootjp/elastickv

Length of output: 181


Verify the OCC read-set handling for the range scan in AdminDeleteBucket.

The concern is valid: AdminDeleteBucket reads the object-manifest prefix at readTS (limit 1 — a "is the bucket empty?" probe) and dispatches a delete on s3keys.BucketMetaKey(name) without passing the scanned manifest keys as ReadKeys to the OCC validator. The two operations touch disjoint key spaces:

  • read set: range over s3keys.ObjectManifestPrefixForBucket(name, meta.Generation)
  • write set: single point key s3keys.BucketMetaKey(name)

The OCC layer validates only explicit point keys in the ReadKeys field (see store.ApplyMutations — it iterates readKeys slice). A concurrent PutObject that inserts a new manifest key between readTS and the delete's commit will not conflict; the delete commits anyway, leaving orphaned objects under a now-deleted bucket meta. The retryS3Mutation retry loop only activates if a conflict is detected, which will not occur.

This race exists in both AdminDeleteBucket and the SigV4 deleteBucket path (same pattern at adapter/s3.go:638), making it a pre-existing concern. To close the gap, either populate ReadKeys with the scanned manifest keys, or explicitly document that the design accepts this orphan race.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/s3_admin.go` around lines 369 - 403, AdminDeleteBucket reads a
one-entry probe of s3keys.ObjectManifestPrefixForBucket(...) via
s.store.ScanAt(...) then issues a delete via s.coordinator.Dispatch(...) but
does not supply the scanned manifest keys as ReadKeys, so OCC won't detect
concurrent PutObject races; fix by extracting the keys from kvs returned by
ScanAt and populate the OperationGroup's ReadKeys slice before calling
coordinator.Dispatch in the retryS3Mutation closure (reference functions/values:
AdminDeleteBucket, retryS3Mutation, loadBucketMetaAt, s.store.ScanAt,
s.coordinator.Dispatch, s3keys.ObjectManifestPrefixForBucket,
s3keys.BucketMetaKey) so the OCC validator will consider those manifest keys and
trigger a retry on conflict.

Comment thread internal/admin/s3_handler.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ddc3b9b215

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/admin/s3_handler.go Outdated
Comment on lines +358 to +362
if h.roles == nil {
// Production wiring always sets a RoleStore; nil is a test
// fallthrough we accept so single-handler unit tests can
// reach the source without standing up the auth chain.
if !principal.Role.AllowsWrite() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce JWT write role before live role lookup

The write-role check is currently gated by h.roles == nil, so in normal production wiring (where RoleStore is set) a session whose JWT role is read-only can still perform S3 writes as soon as the live role map says full. That contradicts the handler’s own “JWT freezes the role at login” contract and diverges from Dynamo’s principalForWrite behavior, making authorization semantics inconsistent across admin resources. Keep the JWT AllowsWrite check unconditional, then apply live-role revalidation.

Useful? React with 👍 / 👎.

bootjp added a commit that referenced this pull request Apr 26, 2026
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md
Section 8: a single self-contained operator-facing reference for
the admin HTTP listener.

Covers:
- Quick-start invocation for a loopback dev cluster
- Required + optional flag reference, with explanations of why each
  guard exists (TLS hard-error, rolling-update caveats, etc.)
- TLS topologies (loopback / TLS / discouraged plaintext-non-loopback)
- Role model + how live role re-validation works on every state-
  changing request
- The full /admin/api/v1/* surface (auth + cluster + dynamo +
  s3, including the slice 2 write paths and the AdminForward
  forwarding contract)
- forwarded_from audit log shape and why it carries the follower's
  node ID
- Troubleshooting guide for the common failure modes operators
  hit during initial bring-up (missing credentials, TLS hard-error,
  401 ambiguity, 503 leader_unavailable, bucket_not_empty,
  blank-screen / placeholder bundle)
- Cross-references to the design doc and architecture overview

The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as
deliverables; those are already implemented (see config.go's
validateTLS / validateAccessKeyRoles, the role gates in
DynamoHandler.principalForWrite + S3Handler.principalForWrite, and
the CSRFDoubleSubmit middleware. This doc stitches them into a
single reference operators can land on without reading code.

Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API-
surface table can describe S3 write endpoints as shipped. Once both
land in main, this rebases cleanly.
EOF
)
bootjp added a commit that referenced this pull request Apr 26, 2026
Slice 2b of P2 (docs/design/2026_04_24_proposed_admin_dashboard.md
Section 3.3.2 + 4.1): a follower-side S3 admin write
(POST /buckets, PUT /buckets/{name}/acl, DELETE /buckets/{name})
now hands off to the leader transparently, completing the same
end-to-end forwarding contract Dynamo writes received in #644 +

Stacked on #669 (P2 slice 2a). Once #669 merges, this rebases
cleanly onto main.

Proto (proto/admin_forward.proto):
- Three new ADMIN_OP enum values (CREATE_BUCKET / DELETE_BUCKET /
  PUT_BUCKET_ACL) appended after the Dynamo block so existing
  wire-format integers stay stable. Regenerated with the pinned
  protoc 29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.

Leader-side ForwardServer (internal/admin/forward_server.go):
- WithBucketsSource lets deployments wire the S3 dispatcher
  optionally — Dynamo-only builds keep the receiver nil and the
  three new operations return 501 NotImplemented.
- Three new dispatch arms: handleCreateBucket / handleDeleteBucket /
  handlePutBucketAcl. Each one mirrors the leader-direct HTTP
  path's payload contract (NUL-byte rejection, 64 KiB limit,
  DisallowUnknownFields, trailing-token rejection, slash-in-name
  rejection) so a hostile follower cannot smuggle a payload past
  validations the leader-direct path enforces.
- forwardBucketsErrorResponse mirrors forwardErrorResponse on the
  Dynamo side: ErrBucketsForbidden / NotLeader / NotFound /
  AlreadyExists / NotEmpty + ValidationError each map to the same
  HTTP status the leader-direct writeBucketsError produces, so
  forwarded and leader-direct responses are byte-for-byte
  indistinguishable from the SPA's view.
- isStructuredSourceError extended to recognise the bucket
  sentinels so they are NOT logged at LevelError on the leader.
- Forward's switch was extracted into dispatchForward to keep the
  parent function under cyclop's 10-branch ceiling as the operation
  enum grew.

Follower-side LeaderForwarder (internal/admin/forward_client.go):
- Interface gains ForwardCreateBucket / ForwardDeleteBucket /
  ForwardPutBucketAcl. PutBucketAcl carries both the bucket name
  (from the URL path) and the new ACL (from the request body) in
  one JSON payload — same approach handleDeleteBucket takes for
  the bucket name.
- gRPCForwardClient methods reuse the existing forward() helper
  for transport, so connection-cache reuse and ErrLeaderUnavailable
  signalling behave identically across resource types.

Handler integration (internal/admin/s3_handler.go):
- New `forwarder LeaderForwarder` field + WithLeaderForwarder method.
- handleCreate / handlePutACL / handleDelete now consult tryForward*
  helpers when the source returned ErrBucketsNotLeader; the helpers
  are gated on `errors.Is(err, ErrBucketsNotLeader) && forwarder != nil`
  so a leader-direct rejection (already-exists, not-found, etc.) is
  never re-applied at the leader.
- writeForwardResult / writeForwardFailure mirror the Dynamo handler's
  pattern: nosniff + Cache-Control:no-store + Retry-After:1 on 503.
  ErrLeaderUnavailable does NOT log at LevelError (elections are
  routine); transport errors do log so operators can investigate.

Wiring (main.go + main_admin_forward.go):
- adminForwardServerDeps gains a `buckets` field; readyForRegistration
  still requires only TablesSource + RoleStore (so cluster-only or
  Dynamo-only builds keep registering Dynamo forwarding without S3).
- runtimeServerRunner.start() now creates *adapter.S3Server BEFORE
  startRaftServers (in addition to dynamoServer) so the leader-side
  ForwardServer registration sees both adapters. The reorder is safe:
  each adapter listens on its own address and the raft TCP listeners
  are independent.
- ServerDeps.Forwarder now plumbs through buildS3HandlerForDeps too,
  so the follower's S3Handler picks up the same LeaderForwarder
  instance the Dynamo handler does.

Tests:
- 9 forward-server tests covering the three new bucket operations:
  happy path / no-BucketsSource→501 / bad-JSON 400 / already-exists
  409 / not-empty 409 / slash-in-name 400 / missing-acl 400 /
  payload-too-large 413 (sweep over all three ops).
- 4 forward-client tests covering ForwardCreateBucket /
  ForwardDeleteBucket / ForwardPutBucketAcl happy-path payload
  shapes + ErrLeaderUnavailable on no-leader.
- stubLeaderForwarder gains 3 bucket-side forward methods so
  existing dynamo tests still satisfy the LeaderForwarder
  interface, and the new stub fields let bucket-handler tests
  verify the forward arguments.
- 6 handler integration tests on S3Handler.tryForward*: forwarded
  create / delete / put-acl happy paths (replay leader's status
  + payload + content-type), forwarder ErrLeaderUnavailable → 503
  + Retry-After, transport-error → 503 + no leakage, and a 3-axis
  gate sweep proving the forwarder is NOT invoked on
  AlreadyExists / Forbidden / generic source errors.

Closes design 3.3.2 acceptance criteria 2 (transparent forwarding)
+ 6 (forwarded_from in audit log) for S3 admin writes; criterion 3
(election-period 503 + retry) is also live for S3 because the
existing tryForward helpers reuse the same fallback paths.
bootjp added a commit that referenced this pull request Apr 26, 2026
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md
Section 8: a single self-contained operator-facing reference for
the admin HTTP listener.

Covers:
- Quick-start invocation for a loopback dev cluster
- Required + optional flag reference, with explanations of why each
  guard exists (TLS hard-error, rolling-update caveats, etc.)
- TLS topologies (loopback / TLS / discouraged plaintext-non-loopback)
- Role model + how live role re-validation works on every state-
  changing request
- The full /admin/api/v1/* surface (auth + cluster + dynamo +
  s3, including the slice 2 write paths and the AdminForward
  forwarding contract)
- forwarded_from audit log shape and why it carries the follower's
  node ID
- Troubleshooting guide for the common failure modes operators
  hit during initial bring-up (missing credentials, TLS hard-error,
  401 ambiguity, 503 leader_unavailable, bucket_not_empty,
  blank-screen / placeholder bundle)
- Cross-references to the design doc and architecture overview

The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as
deliverables; those are already implemented (see config.go's
validateTLS / validateAccessKeyRoles, the role gates in
DynamoHandler.principalForWrite + S3Handler.principalForWrite, and
the CSRFDoubleSubmit middleware. This doc stitches them into a
single reference operators can land on without reading code.

Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API-
surface table can describe S3 write endpoints as shipped. Once both
land in main, this rebases cleanly.
EOF
)
bootjp added a commit that referenced this pull request Apr 26, 2026
Latest claude review on #675 flagged that lines 3 and 6 still
described P2 as fully shipped:

  Line 3: "P1, **P2**, and P3 of the implementation plan have shipped"
  Line 6: "after P1 + **P2** + P3 landed"

But the status table I added in 7617daf correctly marks P2 as
🟡 partial (read-only slice 1 in #658 landed; write paths #669
and #673 still in flight). The header prose contradicting the
table is exactly the kind of inconsistency that triggers the
rename to `_partial_` in the first place — readers skimming the
top of the doc would conclude P2 was done.

Rewrote the Status line and the parenthetical on Last updated to
match the table, calling out P1 + P3 as fully shipped, P2 as the
read-only slice only, and P4 as TLS/role/CSRF live with the
operator doc landing now. Pointed readers at the per-phase status
table for the breakdown.

No other changes; the table itself was already correct.
bootjp added a commit that referenced this pull request Apr 26, 2026
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md
Section 8: a single self-contained operator-facing reference for
the admin HTTP listener.

Covers:
- Quick-start invocation for a loopback dev cluster
- Required + optional flag reference, with explanations of why each
  guard exists (TLS hard-error, rolling-update caveats, etc.)
- TLS topologies (loopback / TLS / discouraged plaintext-non-loopback)
- Role model + how live role re-validation works on every state-
  changing request
- The full /admin/api/v1/* surface (auth + cluster + dynamo +
  s3, including the slice 2 write paths and the AdminForward
  forwarding contract)
- forwarded_from audit log shape and why it carries the follower's
  node ID
- Troubleshooting guide for the common failure modes operators
  hit during initial bring-up (missing credentials, TLS hard-error,
  401 ambiguity, 503 leader_unavailable, bucket_not_empty,
  blank-screen / placeholder bundle)
- Cross-references to the design doc and architecture overview

The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as
deliverables; those are already implemented (see config.go's
validateTLS / validateAccessKeyRoles, the role gates in
DynamoHandler.principalForWrite + S3Handler.principalForWrite, and
the CSRFDoubleSubmit middleware. This doc stitches them into a
single reference operators can land on without reading code.

Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API-
surface table can describe S3 write endpoints as shipped. Once both
land in main, this rebases cleanly.
EOF
)
bootjp added a commit that referenced this pull request Apr 26, 2026
Per `docs/design/README.md`'s lifecycle convention, the admin dashboard
design doc is now "partial":

- **P1** (DynamoDB CRUD + AdminForward) — shipped via #634, #635, #644,
#648
- **P2** (S3 buckets list/create/delete/ACL + DescribeTable) — shipped
via #658, with #669 + #673 in flight
- **P3** (React SPA + embed) — shipped via #649, #650
- **P4** (TLS / role / CSRF / operator docs) — TLS, role, CSRF are
already live in P1; operator docs in #674

Independent of the in-flight slice 2 PRs (#669/#673) and the docs PR
(#674) — this rename only reflects what is already on main today, plus
an "Implementation status" table mapping each phase to the PR it landed
in.

## What this PR changes

- `git mv` the design doc from `2026_04_24_proposed_admin_dashboard.md`
to `2026_04_24_partial_admin_dashboard.md` so its history follows
- Add an "Implementation status" header table indexing each phase to the
PRs that landed it
- List the outstanding open items so future readers know what is still
owed against the original proposal:
- AdminForward acceptance criterion 5 (rolling-upgrade compat flag) —
deferred
  - S3 object browser — explicitly out of scope per Section 2 Non-goals
  - TLS cert hot-reload — restart-to-rotate is the documented model

When the rolling-upgrade flag lands, the doc gets renamed once more to
`2026_04_24_implemented_admin_dashboard.md` per the README's lifecycle
convention.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 3m 16s —— View job


Code Review: admin S3 bucket write endpoints (P2 slice 2a)

Todo list

  • Understand request (third review pass — post-rebase to 5a67834)
  • Read adapter/s3_admin.go — verify TOCTOU / OCC read-set state
  • Read internal/admin/s3_handler.go — verify principalForWrite JWT gate fix
  • Read internal/admin/buckets_source.go
  • Read internal/admin/s3_handler_test.go — verify test coverage
  • Read internal/admin/server.go — verify route table
  • Read main_admin.go — verify bridge

All three items from the previous Claude round are confirmed clean in 5a67834. CodeRabbit's second-pass comment raised three new findings. Here's where each stands:


Finding 1 — TOCTOU in AdminDeleteBucket (pre-existing, deferred) ✅

CodeRabbit flagged that the emptiness probe (ScanAt at readTS, limit 1) doesn't populate OperationGroup.ReadKeys, so a concurrent PutObject racing between the scan and the delete commit could orphan objects under a deleted bucket meta.

This is a real gap, but adapter/s3.go:658-659 uses the same pattern on the SigV4 deleteBucket path — both probe with a limit-1 scan and dispatch without ReadKeys. The OCC layer validates point keys, not ranges, so a range lock would be needed to close the window fully; that's a larger kv-level change. Treating this as pre-existing and tracking it with the SigV4 path is the right call. No action needed here.


Finding 2 — principalForWrite JWT-gate divergence from DynamoHandler (still open) ⚠️ Medium

CodeRabbit marked this "✅ Addressed in commits d1694aa to 5a67834," but the current code (internal/admin/s3_handler.go:351-376) still has the divergence. DynamoHandler.principalForWrite (dynamo_handler.go:424-448) always checks the JWT role first and then, if h.roles != nil, re-validates against the live store:

// DynamoHandler — always checks JWT role first
if !principal.Role.AllowsWrite() { return forbidden }
if h.roles != nil {
    liveRole, _ := h.roles.LookupRole(principal.AccessKey)
    if !liveRole.AllowsWrite() { return forbidden }
    principal.Role = liveRole
}
return principal, true

The current S3Handler skips the JWT-role check entirely when h.roles != nil:

// S3Handler — JWT check only runs in the test fallthrough
if h.roles == nil {
    if !principal.Role.AllowsWrite() { return forbidden }
    return principal, true
}
// production: JWT role never consulted
role, found := h.roles.LookupRole(principal.AccessKey)
if !found || !role.AllowsWrite() { return forbidden }
return AuthPrincipal{...Role: role}, true

Practical impact: For the security-critical path (revocation / demotion), both handlers behave identically — the live role is read-only, write is denied. The divergence matters only in the promotion case: a session whose JWT says read-only but whose live role is now full will be denied by DynamoDB admin writes and allowed by S3 admin writes with the same session. Inconsistent behavior within the same admin session is confusing even if the security risk is low.

Secondary gap: newS3HandlerForTest never calls WithRoleStore, so no unit test exercises the h.roles != nil branch — the production path is untested at the handler level.

The aligned fix mirrors DynamoHandler.principalForWrite exactly:

func (h *S3Handler) principalForWrite(w http.ResponseWriter, r *http.Request) (AuthPrincipal, bool) {
    principal, ok := PrincipalFromContext(r.Context())
    if !ok {
        writeJSONError(w, http.StatusUnauthorized, "unauthenticated", "no session principal")
        return AuthPrincipal{}, false
    }
    if !principal.Role.AllowsWrite() {           // ← always check JWT role
        writeJSONError(w, http.StatusForbidden, "forbidden",
            "this endpoint requires a full-access role")
        return AuthPrincipal{}, false
    }
    if h.roles != nil {                           // ← then re-validate live role
        liveRole, exists := h.roles.LookupRole(principal.AccessKey)
        if !exists || !liveRole.AllowsWrite() {
            writeJSONError(w, http.StatusForbidden, "forbidden",
                "this endpoint requires a full-access role")
            return AuthPrincipal{}, false
        }
        principal.Role = liveRole
    }
    return principal, true
}

Fix this →


Finding 3 — ACL audit log emits untrimmed value (minor) ℹ️

s3_handler.go:309-313 validates strings.TrimSpace(body.ACL) == "" but passes the original body.ACL (not the trimmed version) to both AdminPutBucketAcl and the audit log at line 322. The adapter's adminCanonicalACL trims before use, so the stored value is correct, but the audit log could record " public-read " rather than "public-read". Trimming at the handler boundary (compute acl := strings.TrimSpace(body.ACL), then use acl everywhere below) would keep the audit field canonical with no functional change.


Summary

# Status Severity Area
TOCTOU / OCC read-set Pre-existing, deferred ✅ adapter/s3_admin.go:383-396
principalForWrite JWT gate Still open Medium internal/admin/s3_handler.go:358-375
ACL audit log whitespace Still open Minor internal/admin/s3_handler.go:313,322

Finding 2 is worth fixing before merge — it's a straightforward alignment with the Dynamo handler, plugs the production test gap, and removes the inconsistent JWT-role behavior within the same admin session. Finding 3 can be a follow-up with the Dynamo-side audit-field cleanup.

bootjp added a commit that referenced this pull request Apr 26, 2026
P2 slice 2b: a follower-side S3 admin write (POST /buckets, PUT
/buckets/{name}/acl, DELETE /buckets/{name}) now hands off to the leader
transparently, completing the same end-to-end forwarding contract Dynamo
writes received in #644 + #648.

**Stacked on #669 (P2 slice 2a).** Once #669 merges, this rebases
cleanly onto main.

## Summary

- **Proto** — three new `ADMIN_OP` enum values (`CREATE_BUCKET` /
`DELETE_BUCKET` / `PUT_BUCKET_ACL`) appended after the Dynamo block so
existing wire integers stay stable. Regenerated with the pinned protoc
29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.
- **Leader-side** — `ForwardServer.WithBucketsSource` lets deployments
wire the S3 dispatcher optionally; Dynamo-only builds keep it nil and
the new operations return 501. Three new dispatch arms mirror the
leader-direct HTTP path's payload contract (NUL-byte rejection, 64 KiB
limit, `DisallowUnknownFields`, trailing-token rejection, slash-in-name
rejection). `forwardBucketsErrorResponse` mirrors
`forwardErrorResponse`: `ErrBucketsForbidden` / `NotLeader` / `NotFound`
/ `AlreadyExists` / `NotEmpty` + `*ValidationError` each map to the same
HTTP status `writeBucketsError` produces, so forwarded and leader-direct
responses are byte-identical from the SPA's view.
- **Follower-side** — `LeaderForwarder` interface gains
`ForwardCreateBucket` / `ForwardDeleteBucket` / `ForwardPutBucketAcl`.
`ForwardPutBucketAcl` carries both the bucket name (URL path) and the
new ACL (request body) in one JSON payload — same approach
`ForwardDeleteTable` takes for the table name.
- **Handler integration** — `S3Handler.WithLeaderForwarder` +
`tryForwardCreateBucket` / `tryForwardPutBucketAcl` /
`tryForwardDeleteBucket` gated on `ErrBucketsNotLeader && forwarder !=
nil`. `writeForwardResult` / `writeForwardFailure` mirror the Dynamo
handler's pattern (nosniff + Cache-Control:no-store + Retry-After:1 on
503).
- **Wiring** — `runtimeServerRunner.start()` now creates
`*adapter.S3Server` before `startRaftServers` so the leader-side
`ForwardServer` registration sees both adapters.
`adminForwardServerDeps.buckets` is wired through
`registerAdminForwardServer`. `ServerDeps.Forwarder` now plumbs into
`S3Handler` too.

## What is NOT in this PR

- Rolling-upgrade compatibility flag (criterion 5) — still deferred
behind a cluster-version bump.

## Test plan

- [x] `go build ./...`
- [x] `go vet ./...`
- [x] `golangci-lint run` (admin + main + adapter + proto: 0 issues)
- [x] `go test ./internal/admin/ -count=1 -race` — full admin suite
passes (19 new tests across forward_server / forward_client / s3_handler
integration)
- [x] `go test . -count=1 -race` — main package passes
- [x] Targeted adapter tests pass
- [ ] Full `./adapter/` times out due to a pre-existing flake (verified
earlier on PRs #648 / #658 / #669 — unrelated to this branch)
- [ ] End-to-end smoke against a 3-node cluster — manual exercise after
#669 + this land

## Acceptance criteria coverage (Section 3.3.2 + 4.1)

| # | Criterion | Status |
|---|---|---|
| 1 | Leader direct write | ✓ for S3 (#669 / slice 2a) |
| 2 | Follower forwards transparently | ✅ for S3 (this PR) |
| 3 | Election-period 503 + retry | ✅ for S3 (this PR — same fallback as
Dynamo) |
| 4 | Leader demotes stale full role | ✅ for S3 (existing
`validatePrincipal` runs before bucket dispatch) |
| 5 | Rolling-upgrade compat flag | ⏳ deferred (cluster-version bump) |
| 6 | `forwarded_from` in audit log | ✅ for S3 (this PR — leader-side
audit log carries it) |

## Self-review (5 lenses)

1. **Data loss**: No FSM / Raft / Pebble path changes. The leader-side
dispatcher reuses the same `BucketsSource` interface that the
leader-direct HTTP path uses; the only new code path is the JSON
envelope + principal re-validation.
2. **Concurrency**: New methods on `gRPCForwardClient` reuse the
existing `forward()` helper, so connection-cache reuse and
`ErrLeaderUnavailable` signalling behave identically across resource
types. The `runtimeServerRunner.start()` reorder is safe — each adapter
listens on its own address and the raft TCP listeners are independent.
3. **Performance**: One additional gRPC round-trip on follower-side
admin writes; identical to Dynamo's path.
4. **Data consistency**: Leader-side dispatcher applies the same strict
body decode the leader-direct HTTP path does (NUL-byte rejection,
trailing-token rejection, slash-in-name rejection).
`forwardBucketsErrorResponse` produces byte-identical envelopes to
`writeBucketsError` so forwarded and leader-direct responses are
indistinguishable.
5. **Test coverage**: 19 new tests — 9 forward-server (happy paths, 501
no-source, bad-JSON, already-exists, not-empty, slash-in-name,
missing-acl, payload-too-large sweep), 4 forward-client (3 happy-path
payload shapes + ErrLeaderUnavailable for all three), 6
handler-integration (forwarded create/delete/put-acl happy +
leader-unavailable + transport-error + 3-axis non-NotLeader gate sweep).
bootjp added a commit that referenced this pull request Apr 26, 2026
Codex P2 + coderabbitai Minor on PR #669 caught that the S3
admin handler diverged from the Dynamo path in production
wiring. The previous shape was:

  if h.roles == nil { /* JWT check */ ... return principal }
  /* h.roles != nil — production: skip JWT, only live role */
  role, _ := h.roles.LookupRole(...)
  if !role.AllowsWrite() { 403 }

So a session whose JWT was minted as `read-only` could
**escalate** to write capability the moment the live role index
promoted that access key to `full` — exactly the kind of
between-login-and-revocation window the JWT was supposed to
freeze. DynamoHandler.principalForWrite (dynamo_handler.go:413)
fires the JWT gate unconditionally and then layers the live
role check on top, which is the safer ordering: the JWT can
only narrow capability, never widen it.

Aligned the S3 handler with that contract. The fix also picks
up a second case the Dynamo path already covers: a JWT minted
as `full` whose live role has since been demoted to read-only
or removed entirely is now rejected with 403 instead of
silently sliding through.

Per CLAUDE.md "test the bug first" — `newS3HandlerForTest`
never called `WithRoleStore`, so every existing test ran the
`h.roles == nil` branch and the production path was unproven.
Added two new tests:

  TestS3Handler_PrincipalForWrite_JWTGateFiresWithRoleStore
    — table sweep over all three S3 admin write endpoints
    (create, put_acl, delete) with WithRoleStore wired and
    JWT=read-only / live=full. All three must 403.

  TestS3Handler_PrincipalForWrite_LiveRoleNarrowsJWT
    — JWT=full + live=read-only or removed-from-index.
    The full live-role-narrowing branch was unproven before;
    this pins both demotion and revocation cases.

Also documented the AdminDeleteBucket TOCTOU that coderabbitai
flagged as 🔴 / 🟠 in the same review:

  AdminDeleteBucket scans ObjectManifestPrefixForBucket at
  readTS (limit=1, an "is bucket empty?" probe) and dispatches
  the BucketMetaKey delete with that point key only in
  ReadKeys. A concurrent PutObject inserting a manifest key
  in the scanned prefix between readTS and commit will not
  conflict — the OCC validator only inspects keys appearing
  in ReadKeys, and there is no ReadRanges mechanism.

This is a pre-existing race that adapter/s3.go:deleteBucket
(SigV4 path) inherits as well — closing the gap requires
either bumping BucketGenerationKey on every PutObject so it
serves as an OCC token here, or extending OperationGroup with
ReadRanges and teaching the FSM to validate range emptiness
atomically with commit. Both are larger schema changes outside
this PR's scope. Recorded the limitation as a code comment on
AdminDeleteBucket so the next reader of the function knows the
contract; tracking the fix as a follow-up rather than expanding
the PR's surface.

go test -race ./internal/admin/ — passes.
golangci-lint run ./internal/admin/... ./adapter/... — 0 issues.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

Addressed both Codex P2 + coderabbitai findings on c8165ae:

  1. JWT role gate (Codex P2 + coderabbitai 🟡)S3Handler.principalForWrite now fires the JWT AllowsWrite check unconditionally, then layers live-role re-validation on top, matching DynamoHandler.principalForWrite (dynamo_handler.go:413). Previous shape skipped the JWT check whenever h.roles != nil (production wiring), so a session minted as read-only could escalate to writes the moment the live role index promoted the access key — exactly the kind of between-login-and-revocation window the JWT was supposed to freeze. Tests added: TestS3Handler_PrincipalForWrite_JWTGateFiresWithRoleStore (table sweep over create / put_acl / delete with WithRoleStore wired and JWT=read-only / live=full → all 403) and TestS3Handler_PrincipalForWrite_LiveRoleNarrowsJWT (JWT=full + live=demoted-or-removed → 403). Previously newS3HandlerForTest never called WithRoleStore so the production branch was unproven.

  2. AdminDeleteBucket TOCTOU (coderabbitai 🔴 / 🟠) — documented as a known limitation rather than fixed in this PR. The empty-bucket probe scans ObjectManifestPrefixForBucket at readTS but the BucketMetaKey delete only carries that single point key in ReadKeys; a concurrent PutObject inserting a manifest key between readTS and commit will not conflict (no ReadRanges mechanism). Same race exists pre-existing in adapter/s3.go:deleteBucket (SigV4 path). Closing the gap requires either bumping BucketGenerationKey on every PutObject (so it can serve as an OCC token) or extending OperationGroup with ReadRanges + FSM-side range-emptiness validation — both are larger schema changes outside this PR's scope. Code comment on AdminDeleteBucket records the limitation and the operator-side workaround (pause writes against the target bucket before issuing admin delete). Will track the fix as a follow-up.

Latest commit: 71a8514.

go test -race ./internal/admin/ — passes.
golangci-lint run ./internal/admin/... ./adapter/... — 0 issues.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 1m 47s —— View job


Code Review: admin S3 bucket write endpoints (P2 slice 2a) — final pass

Todo list

  • Understand request (post-71a85144 review: JWT gate fix + TOCTOU documentation)
  • Read internal/admin/s3_handler.go — verify principalForWrite JWT gate fix
  • Read internal/admin/s3_handler_test.go — verify new JWT gate tests
  • Read adapter/s3_admin.go — verify TOCTOU comment on AdminDeleteBucket
  • Compare with DynamoHandler.principalForWrite for alignment
  • Post review findings

Both items from the previous round are correctly addressed in 71a8514.

Finding 1 — JWT gate fix: ✅ Correct

s3_handler.go:390-411 now matches dynamo_handler.go:424-447 exactly:

  1. JWT AllowsWrite() fires unconditionally — a read-only JWT can never escalate even if the live role index has since promoted the key.
  2. Live role re-validation fires only when h.roles != nil — production wiring always sets one.
  3. principal.Role = liveRole on the happy path so downstream sees the current live role.

The structure is line-for-line equivalent to DynamoHandler.principalForWrite. No divergence remains.

Finding 2 — JWT gate tests: ✅ Correct

TestS3Handler_PrincipalForWrite_JWTGateFiresWithRoleStore (lines 456–491) correctly reproduces the pre-fix escalation window: MapRoleStore{"AKIA_RO": RoleFull} with a JWT carrying AKIA_RO / RoleReadOnly. With the fix the JWT gate fires first and returns 403. The table sweeps all three write endpoints, which locks down that no handler forgets to call principalForWrite.

TestS3Handler_PrincipalForWrite_LiveRoleNarrowsJWT (lines 499–521) covers demotion and revocation correctly. Both WithRoleStore paths are now proven.

Finding 3 — TOCTOU documentation: ✅ Thorough

The comment at adapter/s3_admin.go:362–384 clearly states the race, explains why the OCC validator won't catch it, documents that the same limitation pre-exists in the SigV4 path (adapter/s3.go:deleteBucket), and gives the operator workaround (pause writes before admin delete).

Finding 4 — Stale design-doc reference in the TOCTOU comment (minor) ℹ️

adapter/s3_admin.go:381 says: "tracked in docs/design/2026_04_24_partial_admin_dashboard.md under the Outstanding open items section". Two issues:

  1. The actual file is docs/design/2026_04_24_**proposed**_admin_dashboard.md — the partial_ prefix only applies after the first milestone ships (per CLAUDE.md lifecycle). A reader following the reference today will get a 404.
  2. The "Outstanding open items" section doesn't currently exist in the design doc.

One of these would close the gap:

  • Change partial_proposed_ in the comment to match the current filename, then update the comment again when the design-doc lifecycle rename happens; or
  • Add an "Outstanding open items" section to the design doc now with the TOCTOU entry before merging.

This is the only remaining nit. The security fix and its test coverage are complete and correct — the PR is ready to merge once the doc reference is corrected.

bootjp added 12 commits April 27, 2026 02:56
Slice 2a of P2 (docs/design/2026_04_24_proposed_admin_dashboard.md
Section 4.1): POST /admin/api/v1/s3/buckets, PUT
/admin/api/v1/s3/buckets/{name}/acl, and DELETE
/admin/api/v1/s3/buckets/{name} become reachable so the SPA's S3
write modals stop receiving 405. Slice 2b will plumb AdminForward
so a follower can hand these writes off to the leader transparently.

Adapter (`adapter/s3_admin.go`):
- `AdminCreateBucket` performs the same atomic bucket-meta + ACL +
  generation-key txn the SigV4 createBucket path does, with three
  in-method guards: principal must be AdminRoleFull, the local node
  must be the verified S3 leader, and bucket-name + ACL must pass
  the existing validators. The owner field is now populated with the
  caller's access key so the audit-log breadcrumb survives storage.
- `AdminPutBucketAcl` mutates only `meta.Acl` in place, preserving
  generation so existing object references remain valid.
- `AdminDeleteBucket` rejects non-empty buckets with
  `ErrAdminBucketNotEmpty` (mirrors the SigV4 path's
  `BucketNotEmpty`); the dashboard cannot force a recursive delete
  by design.
- New sentinels: `ErrAdminBucketAlreadyExists`,
  `ErrAdminBucketNotFound`, `ErrAdminBucketNotEmpty`,
  `ErrAdminInvalidBucketName`, `ErrAdminInvalidACL`. Reuses the
  shared `ErrAdminForbidden` and `ErrAdminNotLeader` from the
  Dynamo side.

Admin package:
- `BucketsSource` interface gains `AdminCreateBucket` /
  `AdminPutBucketAcl` / `AdminDeleteBucket`. New types
  `CreateBucketRequest` and `PutBucketACLRequest` carry the
  documented JSON shapes.
- New sentinels `ErrBucketsNotEmpty`, `ErrBucketsAlreadyExists`
  (the former is new for slice 2a; the latter was reserved earlier
  and is now wired).
- `S3Handler` now routes POST/PUT/DELETE through dedicated
  per-method handlers with `principalForWrite` re-validating the
  role on every request against the live `MapRoleStore`. The route
  switch was refactored into `serveCollection` + `servePerBucket`
  to keep the cyclomatic budget under control as the surface grew.
- `handleCreate` / `handlePutACL` / `handleDelete` log a structured
  `admin_audit` line on success matching the Dynamo handler's shape.
- `decodeAdminS3JSONBody` is the strict shared decoder
  (`DisallowUnknownFields`, NUL-byte rejection, trailing-token
  rejection, 64 KiB cap) used by both POST and PUT.
- `writeBucketsError` translates the source-side sentinels into
  the design's HTTP statuses (403 / 503+Retry-After:1 / 404 / 409 /
  400 invalid_request via `*ValidationError`).

Bridge (`main_admin.go`):
- `bucketsBridge` gains write methods that call the adapter's
  `Admin*` methods and run their errors through
  `translateAdminBucketsError` — the same pattern as
  `translateAdminTablesError`, with the leader-churn fallback
  routing kv-internal sentinels to `admin.ErrBucketsNotLeader` so
  the SPA's retry contract stays intact.

Tests:
- 19 admin-package tests covering: create happy path / read-only
  rejected / 409 already-exists / 503 not-leader / 400 invalid JSON
  shapes (5 sub-cases) / NUL-byte / 413 oversize / put_acl happy +
  read-only + missing field + 404 + non-PUT method-not-allowed /
  delete happy + read-only + 409 not-empty + 404 + 503 not-leader /
  cross-method missing-principal 401.
- 10 adapter-level tests against the in-memory MVCC store covering
  the same surfaces from the storage side: happy path round-trips
  through describe, ACL default to private, role / leader / ACL /
  bucket-name validation rejections, AlreadyExists / NotFound /
  NotEmpty paths.
- Existing `TestS3Handler_DescribeBucket_SubpathReturns404` was
  superseded by two more precise tests:
  `TestS3Handler_PerBucket_UnknownSubpathReturns404` (any non-/acl
  sub-path 404s) and `TestS3Handler_PerBucket_AclSubpathRejectsGet`
  (the now-valid /acl path returns 405 on GET).
…axBytes nil doc)

Three actionable findings from claude review on 91a48ef:

1) **Medium: stale route-table comment in `buildAPIMux`**
   `internal/admin/server.go`: the comment block listing the routes
   only mentioned the two GET S3 endpoints. Added the three new
   write endpoints (POST /buckets, DELETE /buckets/{name}, PUT
   /buckets/{name}/acl) so a future reader does not have to
   reverse-engineer the routing from `servePerBucket`.

2) **Medium: missing handler-level `*ValidationError` test**
   `s3_handler_test.go`: `writeBucketsError`'s `errors.As(err,
   &verr)` arm was exercised end-to-end via the bridge translation
   but had no direct handler-level coverage. Added
   `TestS3Handler_WriteEndpoints_ValidationErrorReturns400` —
   three sub-cases (create / put_acl / delete) injecting a
   `*ValidationError` directly into the stub source and asserting
   400 + `invalid_request` + the message. Locks the arm down so a
   future refactor that drops the `errors.As` branch fails CI
   immediately.

3) **Minor: `MaxBytesReader(nil, ...)` undocumented**
   `s3_handler.go`: added an explanatory comment at the call site.
   Passing nil is safe because `MaxBytesReader`'s only writer use
   is the connection-close optimisation (a comma-ok type assertion
   that no-ops on nil); the cap itself is enforced regardless. We
   pass nil because the helper takes only `io.Reader` so it remains
   usable from non-HTTP entry points (the upcoming AdminForward
   decoder reuses the same shape). The 413 path still works via
   `errAdminS3BodyTooLarge` matching on `*http.MaxBytesError`.

Findings #3 (failed-write audit log) and #5 (DisallowUnknownFields
error message) were noted as pre-existing patterns shared with the
Dynamo side; tracking them separately rather than diverging on
this slice.
Slice 2b of P2 (docs/design/2026_04_24_proposed_admin_dashboard.md
Section 3.3.2 + 4.1): a follower-side S3 admin write
(POST /buckets, PUT /buckets/{name}/acl, DELETE /buckets/{name})
now hands off to the leader transparently, completing the same
end-to-end forwarding contract Dynamo writes received in #644 +

Stacked on #669 (P2 slice 2a). Once #669 merges, this rebases
cleanly onto main.

Proto (proto/admin_forward.proto):
- Three new ADMIN_OP enum values (CREATE_BUCKET / DELETE_BUCKET /
  PUT_BUCKET_ACL) appended after the Dynamo block so existing
  wire-format integers stay stable. Regenerated with the pinned
  protoc 29.3 / protoc-gen-go 1.36.11 / protoc-gen-go-grpc 1.6.1.

Leader-side ForwardServer (internal/admin/forward_server.go):
- WithBucketsSource lets deployments wire the S3 dispatcher
  optionally — Dynamo-only builds keep the receiver nil and the
  three new operations return 501 NotImplemented.
- Three new dispatch arms: handleCreateBucket / handleDeleteBucket /
  handlePutBucketAcl. Each one mirrors the leader-direct HTTP
  path's payload contract (NUL-byte rejection, 64 KiB limit,
  DisallowUnknownFields, trailing-token rejection, slash-in-name
  rejection) so a hostile follower cannot smuggle a payload past
  validations the leader-direct path enforces.
- forwardBucketsErrorResponse mirrors forwardErrorResponse on the
  Dynamo side: ErrBucketsForbidden / NotLeader / NotFound /
  AlreadyExists / NotEmpty + ValidationError each map to the same
  HTTP status the leader-direct writeBucketsError produces, so
  forwarded and leader-direct responses are byte-for-byte
  indistinguishable from the SPA's view.
- isStructuredSourceError extended to recognise the bucket
  sentinels so they are NOT logged at LevelError on the leader.
- Forward's switch was extracted into dispatchForward to keep the
  parent function under cyclop's 10-branch ceiling as the operation
  enum grew.

Follower-side LeaderForwarder (internal/admin/forward_client.go):
- Interface gains ForwardCreateBucket / ForwardDeleteBucket /
  ForwardPutBucketAcl. PutBucketAcl carries both the bucket name
  (from the URL path) and the new ACL (from the request body) in
  one JSON payload — same approach handleDeleteBucket takes for
  the bucket name.
- gRPCForwardClient methods reuse the existing forward() helper
  for transport, so connection-cache reuse and ErrLeaderUnavailable
  signalling behave identically across resource types.

Handler integration (internal/admin/s3_handler.go):
- New `forwarder LeaderForwarder` field + WithLeaderForwarder method.
- handleCreate / handlePutACL / handleDelete now consult tryForward*
  helpers when the source returned ErrBucketsNotLeader; the helpers
  are gated on `errors.Is(err, ErrBucketsNotLeader) && forwarder != nil`
  so a leader-direct rejection (already-exists, not-found, etc.) is
  never re-applied at the leader.
- writeForwardResult / writeForwardFailure mirror the Dynamo handler's
  pattern: nosniff + Cache-Control:no-store + Retry-After:1 on 503.
  ErrLeaderUnavailable does NOT log at LevelError (elections are
  routine); transport errors do log so operators can investigate.

Wiring (main.go + main_admin_forward.go):
- adminForwardServerDeps gains a `buckets` field; readyForRegistration
  still requires only TablesSource + RoleStore (so cluster-only or
  Dynamo-only builds keep registering Dynamo forwarding without S3).
- runtimeServerRunner.start() now creates *adapter.S3Server BEFORE
  startRaftServers (in addition to dynamoServer) so the leader-side
  ForwardServer registration sees both adapters. The reorder is safe:
  each adapter listens on its own address and the raft TCP listeners
  are independent.
- ServerDeps.Forwarder now plumbs through buildS3HandlerForDeps too,
  so the follower's S3Handler picks up the same LeaderForwarder
  instance the Dynamo handler does.

Tests:
- 9 forward-server tests covering the three new bucket operations:
  happy path / no-BucketsSource→501 / bad-JSON 400 / already-exists
  409 / not-empty 409 / slash-in-name 400 / missing-acl 400 /
  payload-too-large 413 (sweep over all three ops).
- 4 forward-client tests covering ForwardCreateBucket /
  ForwardDeleteBucket / ForwardPutBucketAcl happy-path payload
  shapes + ErrLeaderUnavailable on no-leader.
- stubLeaderForwarder gains 3 bucket-side forward methods so
  existing dynamo tests still satisfy the LeaderForwarder
  interface, and the new stub fields let bucket-handler tests
  verify the forward arguments.
- 6 handler integration tests on S3Handler.tryForward*: forwarded
  create / delete / put-acl happy paths (replay leader's status
  + payload + content-type), forwarder ErrLeaderUnavailable → 503
  + Retry-After, transport-error → 503 + no leakage, and a 3-axis
  gate sweep proving the forwarder is NOT invoked on
  AlreadyExists / Forbidden / generic source errors.

Closes design 3.3.2 acceptance criteria 2 (transparent forwarding)
+ 6 (forwarded_from in audit log) for S3 admin writes; criterion 3
(election-period 503 + retry) is also live for S3 because the
existing tryForward helpers reuse the same fallback paths.
Three findings from claude + Gemini review on 70213e0:

1) **Issue 1 — `logUnexpectedSourceError` slog key was "table"**
   When called for bucket operations the field key was `"table"` but
   the value was a bucket name. Log queries on `table=` would find
   spurious bucket-error entries; queries on `bucket=` would miss
   the audit lines entirely. Renamed the parameter and the slog
   key to `resource` so the same forensic query works for both
   resource families.

2) **Gemini security-high + Claude Issue 2 — validation divergence**
   `handleCreateBucket` only checked `strings.TrimSpace(name) == ""`
   while the HTTP path's `validateCreateBucketRequest` also rejects
   whitespace-padded names like `" bucket "`. The forward path
   would have accepted them, then hit the adapter's
   `validateS3BucketName` with a less actionable error message —
   different SPA behaviour depending on whether the request was
   leader-direct or follower-forwarded. Fix: call
   `validateCreateBucketRequest(body)` in `handleCreateBucket`
   exactly like `decodeCreateTableRequest` is shared between the
   table-side handlers.

3) **Issue 3 — only CREATE_BUCKET tested for nil-BucketsSource → 501**
   `DELETE_BUCKET` and `PUT_BUCKET_ACL` had identical
   `if s.buckets == nil` guards but no coverage. Replaced
   `TestForwardServer_CreateBucket_NoBucketsSourceReturns501` with
   a table-driven `TestForwardServer_BucketOps_NoBucketsSourceReturns501`
   sweeping all three operations. A future op added without the
   nil guard fails CI immediately.

Plus a new `TestForwardServer_CreateBucket_RejectsWhitespacePaddedName`
that pins the validation-parity fix from #2.

Rebased onto the latest `feat/admin-s3-writes` (which now carries
the slice 2a review fixes) so the stack stays clean.
Codex on PR #673 caught that `readyForRegistration()` predicate
gated registration on `tables != nil`, which meant a cluster
started with `--dynamoAddr` empty but S3 enabled never registered
the AdminForward gRPC service at all. Followers attempting to
forward S3 admin writes (CREATE_BUCKET / DELETE_BUCKET /
PUT_BUCKET_ACL) hit gRPC `Unimplemented` and surfaced as 503 to
the SPA, instead of transparently reaching the leader. The S3
write path was advertised as "follower-forwarded" by P2 slice 2b
but actually only worked when Dynamo was also configured.

Fix is symmetric: relax the predicate to require RoleStore plus
**at least one** source (TablesSource or BucketsSource) and have
the dispatcher reject ops for the missing surface with the same
501 it already returns when the inverse source is nil.

Per CLAUDE.md "test the bug first" convention, add the
regression test before the fix:

  TestForwardServer_DynamoOps_NoTablesSourceReturns501
    — symmetric with the existing
    TestForwardServer_BucketOps_NoBucketsSourceReturns501.
    With nil source, CREATE_TABLE used to slip past the source
    nil-deref because validation rejected the payload first;
    DELETE_TABLE panicked on `s.source.AdminDeleteTable`. Both
    now return 501.

Plus three readyForRegistration assertions:
  - buckets-only deployment registers (the regression case)
  - missing-roles + buckets-only does NOT register
  - tables-only / both-sources continue to register

While in there, the per-op nil-source guards I added tipped the
linter's dupl detector. Refactored the dispatch path to keep
forward_server tidy:

1. **`checkOpAvailability`**: pulls the per-op nil-source switch
   out of `dispatchForward` so the latter stays under the
   cyclomatic ceiling. One place to update when a future op
   joins the enum, instead of five.

2. **`decodeNamedPayload`**: the {"name": "..."} JSON shape is
   shared between Dynamo handleDelete and S3 handleDeleteBucket
   (NUL guard, DisallowUnknownFields, dec.More(), empty-name
   reject, slash reject — every guard had a paired comment
   citing the original review that introduced it). Centralising
   the helper keeps that audit trail in one place; the opLabel
   parameter ("delete" / "delete-bucket") preserves the
   path-specific wording in error responses.

3. **`auditDeleteSuccess`**: handleDelete and handleDeleteBucket
   each emitted a 7-line `admin_audit` block that differed only
   in the operation label and the `table` vs `bucket` field
   name. Folded into one helper so a future delete-shaped op
   automatically picks up the same field set.

4. **`notImplementedForwardResponse`**: deduplicates the 501
   construction across the five Dynamo + S3 op cases.

No behaviour change to any of the existing successful paths;
all forward_server tests pass unchanged. Tests added: the
S3-only registration assertion + the new symmetric
DynamoOps_NoTablesSourceReturns501 sweep.

go test -race ./internal/admin/ . — passes.
golangci-lint run ./internal/admin/... . — 0 issues.
Codex P2 + coderabbitai Minor on PR #669 caught that the S3
admin handler diverged from the Dynamo path in production
wiring. The previous shape was:

  if h.roles == nil { /* JWT check */ ... return principal }
  /* h.roles != nil — production: skip JWT, only live role */
  role, _ := h.roles.LookupRole(...)
  if !role.AllowsWrite() { 403 }

So a session whose JWT was minted as `read-only` could
**escalate** to write capability the moment the live role index
promoted that access key to `full` — exactly the kind of
between-login-and-revocation window the JWT was supposed to
freeze. DynamoHandler.principalForWrite (dynamo_handler.go:413)
fires the JWT gate unconditionally and then layers the live
role check on top, which is the safer ordering: the JWT can
only narrow capability, never widen it.

Aligned the S3 handler with that contract. The fix also picks
up a second case the Dynamo path already covers: a JWT minted
as `full` whose live role has since been demoted to read-only
or removed entirely is now rejected with 403 instead of
silently sliding through.

Per CLAUDE.md "test the bug first" — `newS3HandlerForTest`
never called `WithRoleStore`, so every existing test ran the
`h.roles == nil` branch and the production path was unproven.
Added two new tests:

  TestS3Handler_PrincipalForWrite_JWTGateFiresWithRoleStore
    — table sweep over all three S3 admin write endpoints
    (create, put_acl, delete) with WithRoleStore wired and
    JWT=read-only / live=full. All three must 403.

  TestS3Handler_PrincipalForWrite_LiveRoleNarrowsJWT
    — JWT=full + live=read-only or removed-from-index.
    The full live-role-narrowing branch was unproven before;
    this pins both demotion and revocation cases.

Also documented the AdminDeleteBucket TOCTOU that coderabbitai
flagged as 🔴 / 🟠 in the same review:

  AdminDeleteBucket scans ObjectManifestPrefixForBucket at
  readTS (limit=1, an "is bucket empty?" probe) and dispatches
  the BucketMetaKey delete with that point key only in
  ReadKeys. A concurrent PutObject inserting a manifest key
  in the scanned prefix between readTS and commit will not
  conflict — the OCC validator only inspects keys appearing
  in ReadKeys, and there is no ReadRanges mechanism.

This is a pre-existing race that adapter/s3.go:deleteBucket
(SigV4 path) inherits as well — closing the gap requires
either bumping BucketGenerationKey on every PutObject so it
serves as an OCC token here, or extending OperationGroup with
ReadRanges and teaching the FSM to validate range emptiness
atomically with commit. Both are larger schema changes outside
this PR's scope. Recorded the limitation as a code comment on
AdminDeleteBucket so the next reader of the function knows the
contract; tracking the fix as a follow-up rather than expanding
the PR's surface.

go test -race ./internal/admin/ — passes.
golangci-lint run ./internal/admin/... ./adapter/... — 0 issues.
P4 deliverable from docs/design/2026_04_24_proposed_admin_dashboard.md
Section 8: a single self-contained operator-facing reference for
the admin HTTP listener.

Covers:
- Quick-start invocation for a loopback dev cluster
- Required + optional flag reference, with explanations of why each
  guard exists (TLS hard-error, rolling-update caveats, etc.)
- TLS topologies (loopback / TLS / discouraged plaintext-non-loopback)
- Role model + how live role re-validation works on every state-
  changing request
- The full /admin/api/v1/* surface (auth + cluster + dynamo +
  s3, including the slice 2 write paths and the AdminForward
  forwarding contract)
- forwarded_from audit log shape and why it carries the follower's
  node ID
- Troubleshooting guide for the common failure modes operators
  hit during initial bring-up (missing credentials, TLS hard-error,
  401 ambiguity, 503 leader_unavailable, bucket_not_empty,
  blank-screen / placeholder bundle)
- Cross-references to the design doc and architecture overview

The Section 8 P4 plan also called out "TLS, read-only role, CSRF" as
deliverables; those are already implemented (see config.go's
validateTLS / validateAccessKeyRoles, the role gates in
DynamoHandler.principalForWrite + S3Handler.principalForWrite, and
the CSRFDoubleSubmit middleware. This doc stitches them into a
single reference operators can land on without reading code.

Stacked on #669 (P2 slice 2a) + #673 (P2 slice 2b) so the API-
surface table can describe S3 write endpoints as shipped. Once both
land in main, this rebases cleanly.
EOF
)
…ane label

Three substantive findings + one nit from claude review on #674:

1. **HS256 key size (high)**: said "≥ 32 raw bytes / 44 base64 chars"
   but the validator in internal/admin/config.go enforces exactly 64
   raw bytes. A 32-byte key fails startup with a confusing error.
   Updated to "exactly 64 raw bytes — 88 base64 chars (standard
   padding) or 86 (RawURLEncoding)". Operators copying the doc into
   key-generation tooling now produce a key that actually validates.

2. **Audit log sample (moderate)**: showed method=POST / path=... /
   status=201 / duration=8.2ms — none of those fields are emitted.
   The actual slog entries use operation= + a resource key
   (bucket= or table=). Replaced the example with two real shapes
   (leader-direct and forwarded), so an operator building a log
   parser against the documented contract finds matches.

3. **Cross-reference link (moderate)**: line 249 used the
   _partial_ filename for the design doc, but #675 (the rename PR)
   hasnt landed yet. Point to _proposed_ (matches main today)
   with an inline note that the link will follow once #675 lands.
   The intro paragraph at line 5 already used _proposed_, so this
   removes the inconsistency claude flagged.

4. **"data-plane" label nit**: /admin/api/v1/cluster is an admin-
   plane endpoint, not data-plane. "data-plane" implied the
   DynamoDB / S3 / Redis ports. Replaced with "admin".

The cosmetic Last-updated / Status verbosity items on #675 are
tracked separately.
Two findings from the latest claude review on #674:

1. **Audit log shapes (Bug A)**: the previous example claimed
   `admin_audit ... operation=delete_table table=orders` was a
   leader-direct sample, but `DynamoHandler.handleDelete`
   (dynamo_handler.go:379-396) does not emit any handler-level
   admin_audit — that shape only appears via `ForwardServer` on
   the forwarded path. It also omitted the `Audit` middleware line
   (middleware.go:206-232) which wraps every non-GET admin request
   with method= / path= / status= / remote= / duration=. An
   operator building a parser against the doc would have missed
   half of the actual lines.

   Rewrote the section as three explicit shapes — Audit middleware
   (always), S3Handler op line (only emits from s3_handler.go:299
   /333/355), ForwardServer op line (covers both Dynamo and S3
   forwarded paths). Removed the misleading `delete_table`
   leader-direct sample and replaced the Dynamo examples with
   forwarded-only ones, which is where they actually originate.
   Added a one-sentence note that a single request typically
   produces two lines (op-specific + middleware) so log parsers
   know to treat the key as a union.

2. **Login 401 vs 403 (Codex P2)**: the troubleshooting entry said
   "both produce 401" but auth_handler.go:336-345 returns 403
   forbidden when the credentials match but the access key is not
   in `-adminFullAccessKeys` / `-adminReadOnlyAccessKeys`. Split
   into two entries: 401 invalid_credentials (bad access_key or
   secret) and 403 forbidden (valid SigV4 creds, but no admin
   role assignment), with the remediation specific to each case.

No functional changes. Doc-only.
Two Codex P2 findings on the latest review of #674:

1. **Cookie Secure attribute (line 55 + the TLS topology
   section)**: I wrote that `-adminAllowPlaintextNonLoopback`
   "enables the dashboard to mint cookies without the `Secure`
   attribute". That conflates two flags. In the actual wiring
   (auth_handler.go:114 + auth_handler.go:458-468 +
   main_admin.go:597), `secureCookie` is `!opts.InsecureCookie`
   which comes from `-adminAllowInsecureDevCookie`, **not**
   `-adminAllowPlaintextNonLoopback`. The latter only bypasses
   the startup TLS guard (config.go:112).

   Operationally this is the difference between "fails closed"
   and "broken end-to-end": an operator who reads the old doc
   and sets only `-adminAllowPlaintextNonLoopback` gets a
   listener that accepts plaintext, but the dashboard still
   mints `Secure=true` cookies that the browser refuses to send
   back over the plaintext channel — login appears to succeed
   and every subsequent request 401s, exactly the bug Codex was
   guarding against.

   Rewrote the line-55 entry to scope `-adminAllowPlaintextNonLoopback`
   to the listener TLS guard only, and added the cross-reference
   to `-adminAllowInsecureDevCookie` for the cookie attribute.
   Updated the loopback section to drop the misleading "Secure
   regardless" claim and explain when a loopback rig actually
   needs the dev-cookie flag. Added a paragraph at the bottom
   of "Discouraged: plaintext non-loopback" calling out that a
   working plaintext rig needs both flags, and that the two are
   deliberately separate so misconfiguration fails closed on
   either axis instead of silently downgrading both at once.

2. **SPA Retry-After claim (line 172)**: I wrote that "the SPA's
   API client honours Retry-After and re-issues the request
   once." Codex caught that web/admin/src/api/client.ts:75 does
   a single `fetch` call with no retry path, and useApi.ts has
   no 503 retry logic either. The doc was setting an operator
   expectation the code does not meet — under leader election a
   user would see the 503 surface directly and conclude
   "intermittent failure" instead of re-issuing the action.

   Rewrote the section to describe the current behaviour
   accurately: server emits `Retry-After: 1`, but the SPA
   surfaces the 503 directly and the operator must re-issue.
   Kept the header documented because a future client (or an
   external operator script driving the JSON API) can still
   honour it — it is part of the server's contract.

No code changes; doc-only. The cookie wiring claim was verified
against auth_handler.go and main_admin.go; the SPA claim was
verified against web/admin/src/api/client.ts and useApi.ts.
Codex P2 on the latest review of #674: the audit-log section
said the Audit middleware "is emitted for every non-GET/HEAD/
OPTIONS request that reaches the admin mux on this node,
regardless of which handler served it." That overstates coverage:

- `/auth/login` is wired through `publicAuth`, not `protect` —
  the middleware can't identify the actor without a pre-existing
  session, so it is intentionally skipped. AuthService emits
  `admin_audit action=login` directly (auth_handler.go:432).
- `/auth/logout` is wired through `protectNoAudit` — the
  middleware is omitted to avoid emitting two audit lines per
  logout (a generic one plus the action-specific one); the
  specific line is the one operators want (auth_handler.go:442).
- The `protect` chain is `BodyLimit → SessionAuth → Audit →
  CSRFDoubleSubmit → handler` (server.go:247-255). Audit only
  fires AFTER SessionAuth accepts the session, so an
  unauthenticated write attempt (missing / expired / invalid
  cookie) gets 401'd before reaching Audit and produces no
  middleware line. CSRF-rejected requests still produce one,
  by design, because the actor is known by then.

An operator building log-monitoring on top of the previous text
could conclude that every write attempt is guaranteed to
appear in `admin_audit`, then miss CSRF-bypassing probes
against an unauthenticated endpoint.

Rewrote the middleware-shape section to:
- Scope the rule to the protected mux chain.
- Explain the ordering choice (Audit between SessionAuth and
  CSRFDoubleSubmit) so the CSRF-rejection-still-audits guarantee
  is preserved in the doc.
- Enumerate the two auth endpoints that bypass the middleware
  and what they emit instead.
- Adjust the intro paragraph: protected-chain mutations produce
  two audit lines, auth endpoints produce one (was: "every
  state-changing request typically produces two").

No behaviour change; doc-only. Verified against
internal/admin/server.go:236-296 and
internal/admin/auth_handler.go:432-449.
Codex P2 on the latest review of #674: my line 260 said both
login and logout `admin_audit` entries include `claimed_actor`,
but `auditLogout` (auth_handler.go:442-449) only emits `action`,
`actor`, `remote`, and `status`. `claimed_actor` is login-only
because the field exists to distinguish the typed access key
from the authenticated one — logout has no claim to verify (the
actor comes from the already-validated session cookie).

A log pipeline that validated field presence against the
previous text would misclassify every successful logout as
malformed, exactly the operator-facing bug Codex flagged.

Replaced the one-line summary with two concrete sample lines —
one for login (with `claimed_actor`), one for logout (without)
— and a closing sentence telling parsers that `claimed_actor`
is present-only-on-login. Same shape verified against
auth_handler.go:432-449.

No code changes; doc-only.
@bootjp bootjp force-pushed the feat/admin-s3-writes branch from 07bdc20 to 885eb50 Compare April 26, 2026 17:57
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
adapter/s3_admin_test.go (1)

154-310: Consider consolidating the negative-path tests into table-driven cases.

The Create/PutACL/Delete suites each repeat the same scaffold (NewMVCCStore + NewS3Server + one call + ErrorIs) for read-only / invalid-ACL / invalid-name / missing-bucket cases. A single table per operation (or one shared table keyed on op + principal + want-err) would shrink ~150 lines to ~30 and make adding a new sentinel a one-line append.

As per coding guidelines: "Co-locate test files in the same package and use table-driven test patterns."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/s3_admin_test.go` around lines 154 - 310, Consolidate the repeated
negative-path tests (functions named TestS3Server_AdminCreateBucket_*,
TestS3Server_AdminPutBucketAcl_*, TestS3Server_AdminDeleteBucket_*) into
table-driven subtests: define a table with entries containing a name, principal
generator (e.g., fullAdminBucketsPrincipal() or
readOnlyAdminBucketsPrincipal()), bucket name, ACL (if applicable), any pre-test
setup closure (for example to create a bucket when testing existing/duplicate
cases), and expected error (ErrAdminForbidden, ErrAdminInvalidACL,
ErrAdminInvalidBucketName, ErrAdminBucketNotFound, ErrAdminBucketAlreadyExists,
ErrAdminBucketNotEmpty); then loop over the table with t.Run and for each case
create a fresh store via store.NewMVCCStore() and server via NewS3Server(...),
invoke the appropriate method (AdminCreateBucket, AdminPutBucketAcl, or
AdminDeleteBucket) and assert require.ErrorIs/require.NoError as specified; keep
helper functions fullAdminBucketsPrincipal(), readOnlyAdminBucketsPrincipal(),
and any object-creation setup (the HTTP PUT via server.handle) to reuse within
case.setup so behavior remains identical.
internal/admin/buckets_source.go (1)

44-44: Rename AdminPutBucketAcl to AdminPutBucketACL for acronym consistency with DTO field and Go conventions.

The method name uses lowercase Acl while the DTO field PutBucketACLRequest.ACL and all field references use uppercase. Go style conventions (and stylecheck/revive) keep acronyms uniformly capitalized. The refactor is mechanical: rename the interface method in buckets_source.go:44 and its implementations in adapter/s3_admin.go, main_admin.go, and tests. Also apply the same fix to the related ForwardPutBucketAcl method in the forward client.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/admin/buckets_source.go` at line 44, Rename the interface method
AdminPutBucketAcl to AdminPutBucketACL and update all implementations and
callers (including the forward client) to use the new name; also rename
ForwardPutBucketAcl to ForwardPutBucketACL to match the DTO field
PutBucketACLRequest.ACL and Go acronym conventions. Locate the method
declaration in the buckets source interface (AdminPutBucketAcl), update its
signature to AdminPutBucketACL(ctx context.Context, principal AuthPrincipal,
name, acl string) error, then change the corresponding method definitions in the
S3 admin adapter (function named AdminPutBucketAcl), the main admin
implementation (AdminPutBucketAcl), the forward client wrapper
(ForwardPutBucketAcl), and any tests or references to call the new
AdminPutBucketACL and ForwardPutBucketACL names. After renaming, run the
build/tests and fix any remaining references or imports that fail due to the
renamed symbols.
internal/admin/forward_server.go (2)

147-167: Optional: simplify checkOpAvailability's tri-valued return.

The (*pb.AdminForwardResponse, error, bool) shape returns error in the middle position and the error is always nil because rejectForward (and therefore notImplementedForwardResponse) never returns a non-nil error. The signature could be tightened to (*pb.AdminForwardResponse, bool) (or just return *pb.AdminForwardResponse and let nil mean "continue"), removing the dead err propagation in dispatchForward. Minor readability nit; defer if you'd rather keep the signature aligned with the gRPC method shape it's eventually reused alongside.

♻️ Sketch
-func (s *ForwardServer) checkOpAvailability(op pb.AdminOperation) (*pb.AdminForwardResponse, error, bool) {
+func (s *ForwardServer) checkOpAvailability(op pb.AdminOperation) *pb.AdminForwardResponse {
 	switch op {
 	case pb.AdminOperation_ADMIN_OP_CREATE_TABLE, pb.AdminOperation_ADMIN_OP_DELETE_TABLE:
 		if s.source == nil {
-			resp, err := notImplementedForwardResponse("DynamoDB")
-			return resp, err, false
+			resp, _ := notImplementedForwardResponse("DynamoDB")
+			return resp
 		}
 	...
 	}
-	return nil, nil, true
+	return nil
 }

And in dispatchForward:

-	if resp, err, ok := s.checkOpAvailability(op); !ok {
-		return resp, err
+	if resp := s.checkOpAvailability(op); resp != nil {
+		return resp, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/admin/forward_server.go` around lines 147 - 167, The
checkOpAvailability function currently returns (*pb.AdminForwardResponse, error,
bool) but the error slot is always nil because
notImplementedForwardResponse/rejectForward never return a non-nil error;
simplify the signature to either (*pb.AdminForwardResponse, bool) or just
*pb.AdminForwardResponse (where nil means "ok") and update all call sites
(notably dispatchForward) to stop expecting a middle error value; change the
function declaration for checkOpAvailability and adjust dispatchForward to
handle the new two-valued (or single-valued) return by removing the unused err
variable and using the returned AdminForwardResponse (from
notImplementedForwardResponse) to short-circuit when present.

251-295: Consider extracting a shared strict-decode helper for the bucket payloads.

handleCreateBucket (251–295) and handlePutBucketAcl (336–385) each repeat the same 4-step preamble — size cap, NUL-byte rejection, DisallowUnknownFields decode, dec.More() trailing-token rejection — that decodeNamedPayload already centralises for the {"name":"…"} shape. Generifying decodeNamedPayload over the body type (e.g. a decodeForwardedJSON[T any]) would let both handlers drop ~15 lines apiece and guarantee future ops cannot diverge on the size/NUL/strict-decode contract. Optional given the bodies are typed differently and the duplication is fairly mechanical.

Also applies to: 336-385

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/admin/forward_server.go` around lines 251 - 295, Both
handleCreateBucket and handlePutBucketAcl duplicate the same 4-step strict JSON
preamble (size limit, NUL-byte check, DisallowUnknownFields decode,
trailing-data check); extract that into a generic helper (e.g.
decodeForwardedJSON[T any](payload []byte, limit int) (T, error)) that enforces
adminForwardPayloadLimit, rejects NUL bytes, uses json.Decoder with
DisallowUnknownFields and verifies !dec.More(), returning either the decoded
typed body or a canonical error you can map to rejectForward; then replace the
inline decoding in handleCreateBucket and handlePutBucketAcl with calls to
decodeForwardedJSON and keep the subsequent validateCreateBucketRequest /
AdminCreateBucket call and logging unchanged.
adapter/s3_admin.go (1)

414-420: Add a comment explaining why BucketGenerationKey is not deleted.

The dispatch deletes only BucketMetaKey(name) while preserving BucketGenerationKey(name). This is necessary to prevent manifest key namespace collisions across bucket recreations: the preserved generation counter ensures a subsequent AdminCreateBucket on the same name picks up the next generation via nextBucketGenerationAt, so any manifest key orphaned by the TOCTOU race documented above (lines 362–384) cannot collide with the recreated bucket's generation namespace.

♻️ Suggested comment
+		// Preserve BucketGenerationKey so a recreated bucket with
+		// the same name gets a fresh generation, preventing manifest
+		// namespace collisions with keys orphaned by the TOCTOU race
+		// documented above.
 		_, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
 			IsTxn:   true,
 			StartTS: startTS,
 			Elems: []*kv.Elem[kv.OP]{
 				{Op: kv.Del, Key: s3keys.BucketMetaKey(name)},
 			},
 		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapter/s3_admin.go` around lines 414 - 420, Add a clarifying comment in the
deletion operation explaining why we only remove BucketMetaKey(name) and
intentionally preserve BucketGenerationKey(name): note that keeping the
generation counter prevents manifest key namespace collisions when a bucket is
deleted and recreated (AdminCreateBucket uses nextBucketGenerationAt to pick a
new generation), so orphaned manifest keys from the TOCTOU window cannot collide
with a recreated bucket’s generation namespace; place this comment adjacent to
the s.coordinator.Dispatch call that deletes BucketMetaKey and reference
BucketGenerationKey, AdminCreateBucket, and nextBucketGenerationAt to make the
rationale discoverable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/admin.md`:
- Around line 228-230: The audit-log example line (the `admin_audit` example)
shows the wrong request path; update the example so the path includes the S3
segment by changing `path=/admin/api/v1/buckets` to
`path=/admin/api/v1/s3/buckets` (or alternatively replace it with a Dynamo
example like `/admin/api/v1/dynamo/tables` if you intend a Dynamo example) so
the example matches the API surface declared earlier.

---

Nitpick comments:
In `@adapter/s3_admin_test.go`:
- Around line 154-310: Consolidate the repeated negative-path tests (functions
named TestS3Server_AdminCreateBucket_*, TestS3Server_AdminPutBucketAcl_*,
TestS3Server_AdminDeleteBucket_*) into table-driven subtests: define a table
with entries containing a name, principal generator (e.g.,
fullAdminBucketsPrincipal() or readOnlyAdminBucketsPrincipal()), bucket name,
ACL (if applicable), any pre-test setup closure (for example to create a bucket
when testing existing/duplicate cases), and expected error (ErrAdminForbidden,
ErrAdminInvalidACL, ErrAdminInvalidBucketName, ErrAdminBucketNotFound,
ErrAdminBucketAlreadyExists, ErrAdminBucketNotEmpty); then loop over the table
with t.Run and for each case create a fresh store via store.NewMVCCStore() and
server via NewS3Server(...), invoke the appropriate method (AdminCreateBucket,
AdminPutBucketAcl, or AdminDeleteBucket) and assert
require.ErrorIs/require.NoError as specified; keep helper functions
fullAdminBucketsPrincipal(), readOnlyAdminBucketsPrincipal(), and any
object-creation setup (the HTTP PUT via server.handle) to reuse within
case.setup so behavior remains identical.

In `@adapter/s3_admin.go`:
- Around line 414-420: Add a clarifying comment in the deletion operation
explaining why we only remove BucketMetaKey(name) and intentionally preserve
BucketGenerationKey(name): note that keeping the generation counter prevents
manifest key namespace collisions when a bucket is deleted and recreated
(AdminCreateBucket uses nextBucketGenerationAt to pick a new generation), so
orphaned manifest keys from the TOCTOU window cannot collide with a recreated
bucket’s generation namespace; place this comment adjacent to the
s.coordinator.Dispatch call that deletes BucketMetaKey and reference
BucketGenerationKey, AdminCreateBucket, and nextBucketGenerationAt to make the
rationale discoverable.

In `@internal/admin/buckets_source.go`:
- Line 44: Rename the interface method AdminPutBucketAcl to AdminPutBucketACL
and update all implementations and callers (including the forward client) to use
the new name; also rename ForwardPutBucketAcl to ForwardPutBucketACL to match
the DTO field PutBucketACLRequest.ACL and Go acronym conventions. Locate the
method declaration in the buckets source interface (AdminPutBucketAcl), update
its signature to AdminPutBucketACL(ctx context.Context, principal AuthPrincipal,
name, acl string) error, then change the corresponding method definitions in the
S3 admin adapter (function named AdminPutBucketAcl), the main admin
implementation (AdminPutBucketAcl), the forward client wrapper
(ForwardPutBucketAcl), and any tests or references to call the new
AdminPutBucketACL and ForwardPutBucketACL names. After renaming, run the
build/tests and fix any remaining references or imports that fail due to the
renamed symbols.

In `@internal/admin/forward_server.go`:
- Around line 147-167: The checkOpAvailability function currently returns
(*pb.AdminForwardResponse, error, bool) but the error slot is always nil because
notImplementedForwardResponse/rejectForward never return a non-nil error;
simplify the signature to either (*pb.AdminForwardResponse, bool) or just
*pb.AdminForwardResponse (where nil means "ok") and update all call sites
(notably dispatchForward) to stop expecting a middle error value; change the
function declaration for checkOpAvailability and adjust dispatchForward to
handle the new two-valued (or single-valued) return by removing the unused err
variable and using the returned AdminForwardResponse (from
notImplementedForwardResponse) to short-circuit when present.
- Around line 251-295: Both handleCreateBucket and handlePutBucketAcl duplicate
the same 4-step strict JSON preamble (size limit, NUL-byte check,
DisallowUnknownFields decode, trailing-data check); extract that into a generic
helper (e.g. decodeForwardedJSON[T any](payload []byte, limit int) (T, error))
that enforces adminForwardPayloadLimit, rejects NUL bytes, uses json.Decoder
with DisallowUnknownFields and verifies !dec.More(), returning either the
decoded typed body or a canonical error you can map to rejectForward; then
replace the inline decoding in handleCreateBucket and handlePutBucketAcl with
calls to decodeForwardedJSON and keep the subsequent validateCreateBucketRequest
/ AdminCreateBucket call and logging unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 63834f9c-0607-4c5e-9e4f-3982aeca95cc

📥 Commits

Reviewing files that changed from the base of the PR and between ddc3b9b and 07bdc20.

⛔ Files ignored due to path filters (2)
  • proto/admin_forward.pb.go is excluded by !**/*.pb.go
  • proto/admin_forward_grpc.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (17)
  • adapter/s3_admin.go
  • adapter/s3_admin_test.go
  • docs/admin.md
  • internal/admin/buckets_source.go
  • internal/admin/forward_client.go
  • internal/admin/forward_client_test.go
  • internal/admin/forward_integration_test.go
  • internal/admin/forward_server.go
  • internal/admin/forward_server_test.go
  • internal/admin/s3_handler.go
  • internal/admin/s3_handler_test.go
  • internal/admin/server.go
  • main.go
  • main_admin.go
  • main_admin_forward.go
  • main_admin_forward_test.go
  • proto/admin_forward.proto
✅ Files skipped from review due to trivial changes (1)
  • proto/admin_forward.proto
🚧 Files skipped from review as they are similar to previous changes (2)
  • main_admin.go
  • internal/admin/server.go

Comment thread docs/admin.md
bootjp added 2 commits April 27, 2026 03:02
Operator-facing follow-up to docs/admin.md. The original operator
guide covered configuration semantics; this commit adds the
runbook side: how to actually deploy with the dashboard enabled,
how operators log in, and how to roll out role / key / TLS
changes without breaking sessions.

Three pieces, separable but mutually reinforcing:

1. **scripts/rolling-update.sh — admin-flag plumbing**

   The script previously had no awareness of the admin dashboard:
   none of the --admin* flags were forwarded to docker run, so an
   operator wanting to enable the dashboard had to fork the
   script or hand-edit the deployment manifest outside any
   reviewed code path. Added a build_admin_flags helper that the
   docker run composes alongside the existing s3_flags pattern:

   - ADMIN_ENABLED=false → emit nothing, skip every validation,
     no behaviour change for existing deployments.
   - ADMIN_ENABLED=true → require ADMIN_SESSION_SIGNING_KEY_FILE
     plus at least one role list, validate every referenced path
     exists / is readable on the remote host BEFORE
     stop_container fires, bind-mount each file read-only at the
     same in-container path, and pass the corresponding --admin*
     flags to the entrypoint.

   The pre-flight validation is deliberate: a missing signing-key
   file would otherwise hard-error the daemon at startup AFTER the
   previous container was already stopped, leaving the node
   unhealthy and the rollout's per-node health check exiting the
   script — but at most one node has by then been touched, so
   quorum survives. Failing earlier in build_admin_flags converts
   that to a no-op rollout with a script-level error pointing at
   the variable name.

   Also enforced the pair-or-neither contract for
   ADMIN_TLS_CERT_FILE / _KEY_FILE here, mirroring the daemon's
   own startup validation but at the script layer so the operator
   sees the variable name rather than a generic "exactly one of
   cert/key" daemon message.

2. **scripts/rolling-update.env.example — ADMIN_* block**

   Added the same nine variables as commented-out entries, with a
   leading paragraph pointing at docs/admin.md and the new
   docs/admin_deployment.md. ADMIN_ENABLED defaults to false so a
   user who copies the file verbatim and only fills in the data-
   plane variables gets the same behaviour as before.

3. **docs/admin_deployment.md — operator runbook**

   Six sections covering pre-deployment prep (HS256 key
   generation, credentials file, role lists, TLS material), the
   first-time rolling-update.sh invocation, the operator login
   flow + the four primary failure modes (401 / 403-login /
   403-write / 503), day-2 procedures (key add/remove,
   HS256 rotation with the previous-key window, TLS cert
   rotation, listener bind change, disabling the dashboard), and
   a failure-mode runbook for the messages an operator is most
   likely to hit (the "is not loopback but TLS is not configured"
   hard-error, the cookie-Secure mismatch that produces
   login-succeeds-then-401-on-every-request, the rollout-window
   role-flag inconsistency).

   The HS256 rotation section is the one that does not exist in
   admin.md: it walks through the two-key verification window
   using ADMIN_SESSION_SIGNING_KEY_PREVIOUS_FILE so a rotation
   does not invalidate existing sessions. Without this, an
   operator reading admin.md alone could not safely rotate keys.

   Cross-linked from docs/admin.md so a reader landing on the
   per-flag reference finds the runbook in one click. The
   deployment doc itself points back at admin.md for per-flag
   semantics so the two stay decoupled — admin.md owns the "what
   each flag means" contract, admin_deployment.md owns the "what
   to do as an operator" contract.

No code changes; bash syntax-check passes.
coderabbitai on PR #669 caught that the Audit-middleware sample
line on docs/admin.md:229 used `path=/admin/api/v1/buckets` but
the actual API surface declared earlier in the same doc is
`/admin/api/v1/s3/buckets`. An operator copy-pasting this into a
log-parser regex would miss every real S3 admin write.

Swapped to the correct path. The Dynamo path
(`/admin/api/v1/dynamo/tables`) was the other option suggested,
but the surrounding section already calls out that the Audit
middleware fires for the protected mux chain across all admin
resources — keeping S3 here matches the doc's working example
(create_bucket / put_bucket_acl / delete_bucket samples).
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 26, 2026

@claude review

Addressed the new coderabbitai 🟡 finding on docs/admin.md:229:

The Audit-middleware sample line used path=/admin/api/v1/buckets but the actual API surface is /admin/api/v1/s3/buckets. Operators copy-pasting the sample into log-parser regexes would have missed every real S3 admin write. Swapped to the correct path; the surrounding section already calls out that the middleware fires across all admin resources, so keeping S3 here matches the existing op-line samples for create_bucket / put_bucket_acl / delete_bucket.

The other three comments showing on cadffe4 are stale re-anchorings of the JWT-role / TOCTOU findings from 2026-04-26T12:27:57Z, all addressed in 71a8514.

Latest commit: 55c93c8. Doc-only.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Claude finished @bootjp's task in 43s —— View job


Code Review: admin S3 bucket write endpoints (P2 slice 2a) — final doc-fix pass

Todo list

  • Understand request (post-55c93c86 review: audit-log path fix in docs/admin.md)
  • Read docs/admin.md — verify the /s3/ segment fix at line 229
  • Confirm stale coderabbitai re-anchors are for findings already addressed in 71a8514
  • Post review

Finding — audit-log path: ✅ Fixed

docs/admin.md:229 now correctly reads:

admin_audit actor=AKIA_ADMIN role=full method=POST path=/admin/api/v1/s3/buckets status=201 ...

The /s3/ segment is present. Operators copy-pasting the line into log-parser regexes will match the actual path.

Stale coderabbitai re-anchors

The three coderabbitai comments showing against cadffe48 are re-anchorings of the JWT-role gate and TOCTOU findings from the 12:27 UTC run — both resolved in 71a8514. s3_handler.go:390-411 shows the aligned principalForWrite implementation with the unconditional JWT AllowsWrite() check followed by live-role re-validation, matching DynamoHandler exactly.

No new findings. This is a doc-only commit and the PR is ready to merge.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55c93c8626

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/rolling-update.sh
Comment on lines +841 to +843
if [[ "${ADMIN_ENABLED}" != "true" ]]; then
return 0
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward ADMIN_ env vars before gating admin flags*

This gate runs on the remote host, but in update_one_node the SSH env assignment still does not pass any ADMIN_* variables (I checked the env ... bash -s block), so ADMIN_ENABLED falls back to its default and this branch always returns early. In practice, setting ADMIN_ENABLED=true (and related key/cert settings) in deploy.env is ignored during rollout, and nodes come up without --adminEnabled or admin key/TLS mounts.

Useful? React with 👍 / 👎.

Comment thread scripts/rolling-update.sh
Comment on lines +797 to +800
# node with auth disabled.
local admin_flags=()
local admin_volumes=()
build_admin_flags admin_flags admin_volumes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate admin files before stopping the running container

The new build_admin_flags validation can exit 1 for missing admin key/cert files, but it is invoked inside run_container; the rollout flow stops the current container first and only then calls run_container (stop_container precedes run_container in this script). A bad admin path therefore turns a config typo into avoidable downtime by taking the node offline before failing. The validation step should run in preflight before stop_container.

Useful? React with 👍 / 👎.

@bootjp bootjp enabled auto-merge April 26, 2026 18:16
@bootjp bootjp merged commit 2bae541 into main Apr 26, 2026
12 of 13 checks passed
bootjp added a commit that referenced this pull request Apr 27, 2026
…669) (#678)

## Summary

Follow-up to #669 (now merged). The admin-flag plumbing I added in
that PR has a real bug I caught while re-reading the script: every
rollout would crash on the first remote node because the `ADMIN_*`
variables were never forwarded across the SSH heredoc.

## What was broken

`build_admin_flags` lives inside the remote SSH heredoc in
`update_one_node` (`bash -s <<'REMOTE'`), but the `env` block that
seeds the remote shell only forwarded the existing `IMAGE` /
`RAFT_PORT` / `EXTRA_ENV` / etc. variables — no `ADMIN_*`. With
`set -u` active on the remote, the first access of
`${ADMIN_ENABLED}` inside `build_admin_flags` crashes the rollout
**regardless** of whether admin is enabled (the helper is invoked
unconditionally from `run_container`).

So an operator running this script after #669 with the default
`ADMIN_ENABLED=false` would have seen `ADMIN_ENABLED: unbound variable`
on the first node touched, leaving at most one node restarted but
otherwise the cluster intact (the per-node health check would
exit non-zero before moving on).

## Fix

1. **Forward all 9 `ADMIN_*` variables through `env`**, alongside
   the existing forwarding pattern. Path-like values (`*_FILE`,
   `*_KEYS`, `ADDRESS`) get `printf %q` quoting at the bottom of
   the local script (matches the existing `RAFT_TO_REDIS_MAP_Q`
   etc. pattern). The three boolean flags (`ADMIN_ENABLED`,
`ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK`, `ADMIN_ALLOW_INSECURE_DEV_COOKIE`)
   are forwarded unquoted for readability — but only after a
   local validation pass that rejects anything other than the
   literal `true` / `false` so the unquoted forwarding stays
   metacharacter-safe.

2. **Defense-in-depth `:-` defaults inside `build_admin_flags`.**
   Every `ADMIN_*` reference inside the helper now reads through
   `${VAR:-}` once at the top into a local. A future refactor
   that ever drops one of the forwarded variables will produce
   the targeted "ADMIN_* required" error instead of an opaque
   `unbound variable` crash with no hint at which variable.

## Test plan

- [x] `bash -n scripts/rolling-update.sh` — passes
- [x] `ADMIN_ENABLED=invalid bash scripts/rolling-update.sh` →
      "must be 'true' or 'false', got 'invalid'"
- [x] `ADMIN_ALLOW_PLAINTEXT_NON_LOOPBACK=yes` → same validator catches
it
- [x] `ADMIN_ENABLED=true` (no signing key) → reaches the remote branch
      where build_admin_flags would 'aborting' on the missing key
- [ ] End-to-end rollout against a 3-node staging cluster with
      `ADMIN_ENABLED=true` (operator to verify before merging into
      a production deploy.env)
- [ ] End-to-end rollout with `ADMIN_ENABLED=false` (the previously
      broken default path)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Improved validation and error reporting for deployment configuration
parameters.
* Enhanced environment variable forwarding and path handling in
deployment scripts to increase reliability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
bootjp added a commit that referenced this pull request Apr 27, 2026
## Summary

State-refresh of `docs/design/2026_04_24_partial_admin_dashboard.md` now
that all four phases of the original P1–P4 plan have shipped.

| Phase | Was | Now |
|---|---|---|
| P1 | shipped | shipped (no change) |
| P2 | partial — read-only #658 only | **shipped** (+ #669 writes + #673
forwarding) |
| P3 | shipped | shipped (no change) |
| P4 | mostly | **shipped** (+ operator doc #674, deployment runbook
#669, script wiring #669+#678) |

Also added the **AdminDeleteBucket TOCTOU** to Outstanding open items —
coderabbitai flagged it during PR #669 review (pre-existing race,
`s3.go:deleteBucket` inherits the same shape; recorded as a code comment
on `AdminDeleteBucket` for the future fix).

Doc stays at `_partial_` rather than promoting to `_implemented_`
because two outstanding items remain: AdminForward acceptance criterion
5 (rolling-upgrade flag, explicitly deferred at design time) and the new
TOCTOU. Updated the closing rename trigger to cover both.

## Test plan

- [x] No design changes — this is a state refresh
- [x] Status table entries cross-checked against the merge commits on
main
- [x] AdminDeleteBucket TOCTOU description matches the code comment in
adapter/s3_admin.go and the operator-side workaround in
docs/admin_deployment.md
bootjp added a commit that referenced this pull request Apr 28, 2026
Per docs/design/README.md's lifecycle convention. The original
P1–P4 plan has fully shipped:

  - P1 (admin skeleton + Dynamo + AdminForward) — #634/#635/#644/#648
  - P2 (S3 endpoints incl. write paths and AdminForward integration)
        — #658 / #669 / #673 / #695 (TOCTOU safety net)
  - P3 (React SPA + embed) — #649 / #650
  - P4 (TLS / role / CSRF / operator doc / deployment runbook
        / scripts/rolling-update.sh admin support) — #674 / #669
        / #678

The AdminDeleteBucket TOCTOU caught during PR #669 review (the
last "in-flight" item that kept the doc at _partial_) is fully
resolved by the safety-net design landed in #695.

What changed:

  - git mv 2026_04_24_partial_admin_dashboard.md →
            2026_04_24_implemented_admin_dashboard.md (history
            follows the rename)
  - Header Status line: "Partial" → "Implemented", explanation
    updated to reflect the post-fix state and the rationale for
    promotion.
  - "Last updated" bumped to 2026-04-28 with the rename trigger.
  - Section heading "Outstanding open items" → "Out-of-scope
    follow-ups" — the remaining three entries (criterion 5,
    object browser, TLS hot-reload) are not in-flight work; they
    are deferred-at-design or Non-goal items. The TOCTOU bullet
    is removed (resolved) and replaced with a one-line
    cross-link to the safety-net design + admin_deployment.md
    §4.6 contract.
  - Removed the closing "rename trigger" sentence — we just did
    the rename.
  - Status table: P2 row now lists #695 alongside #658/#669/#673
    so a future reader can find the TOCTOU fix from the index.
  - Cross-references updated everywhere the old filename
    appeared:
      docs/admin.md (header link + Cross-references)
      docs/admin_deployment.md (header link + final cross-ref)
      docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md
        (Background section pointer)
      internal/admin/config.go (Section 7.1 reference comment)

No code changes other than the comment-only filename refresh in
config.go.
bootjp added a commit that referenced this pull request Apr 28, 2026
## Summary

Promote the admin dashboard design doc from `_partial_` →
`_implemented_` per `docs/design/README.md`'s lifecycle convention. PR
#695 landed the TOCTOU safety-net fix (the last in-flight item that kept
the doc at `_partial_`), so the original P1–P4 plan is now fully
shipped:

| Phase | Landed via |
|---|---|
| P1 (admin skeleton + Dynamo + AdminForward) | #634 / #635 / #644 /
#648 |
| P2 (S3 endpoints + writes + AdminForward S3 + TOCTOU fix) | #658 /
#669 / #673 / **#695** |
| P3 (React SPA + embed) | #649 / #650 |
| P4 (TLS / role / CSRF / operator doc / runbook / script wiring) | #674
/ #669 / #678 |

The remaining three items in the doc move from "Outstanding open items"
(in-flight) to **"Out-of-scope follow-ups"** (deferred-at-design or
Non-goal):

- AdminForward criterion 5 — rolling-upgrade flag, deferred behind a
cluster-version bump that doesn't exist yet
- S3 object browser — Non-goal per §2.2
- TLS cert hot-reload — out of scope per `docs/admin.md`

## Changes

- `git mv` partial → implemented (history follows the rename)
- Status line / Last-updated / status-table / Out-of-scope section
content reflects the promotion
- Cross-references updated in all 4 referencing files: `docs/admin.md`,
`docs/admin_deployment.md`,
`docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md`,
`internal/admin/config.go` (comment-only)

## Test plan

- [x] No code changes other than a comment-only filename refresh in
`config.go`
- [x] `go build ./...` passes
- [x] `golangci-lint run ./internal/admin/...` — 0 issues
- [x] `grep -rn "2026_04_24_partial\|2026_04_24_proposed_admin"` returns
nothing — no stale references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant