Skip to content

Commit 69c808c

Browse files
committed
s3: split bucket delete into two-phase dispatch (Codex P1 #695)
Codex P1 on PR #695 caught that the original single-OperationGroup shape would be rejected by the production coordinator on every bucket delete. The fix lands here together with a test-coordinator tightening that catches this class of bug in unit tests. Background: kv/sharded_coordinator.go:dispatchDelPrefixBroadcast rejects any OperationGroup that mixes DEL_PREFIX with other ops AND any OperationGroup that uses DEL_PREFIX inside a transaction. The earlier shape ([Del BucketMetaKey, DelPrefix×6] with IsTxn=true) hit both rejection rules, so production would have failed every bucket delete with ErrInvalidRequest. The local test coordinator bypassed those checks, so the regression test passed despite the production-breaking shape. Fix is two-part: 1. **Split AdminDeleteBucket and s3.go:deleteBucket into two Dispatch calls**: Phase 1 — `Del BucketMetaKey` in a txn (OCC-protected against a concurrent AdminCreateBucket racing the delete). Lives inside retryS3Mutation so an OCC conflict retries the whole closure. Phase 2 — DEL_PREFIX broadcast over every per-bucket key family in a non-txn OperationGroup. Lives outside retryS3Mutation because Phase 1 is the point of no return: a Phase-2 retry would 404 at loadBucketMetaAt. Phase-2 failure is logged via slog.WarnContext and not propagated. The bucket meta is already gone from the operator's POV; orphan keys (if any) are no worse than the pre-fix state on main and can be recovered by a future sweep tool. Surfacing a 500 to the operator after a successful delete would be a worse UX. Phase-1-first ordering is deliberate: a Phase-2-first ordering could leave the bucket meta extant while per-bucket data was wiped if Phase 1 then failed (concurrent recreate). Phase-1- first localises any partial failure to "bucket gone, orphan data may persist", which has a clean audit trail. 2. **Tighten localAdapterCoordinator validation** to mirror the production coordinator's dispatch-time rejection rules: reject IsTxn=true with any DelPrefix, reject mixed Del+DelPrefix groups. Without this, a future regression that ships the rejected shape would silently pass tests while breaking production. The existing TestS3Server_AdminDeleteBucket_SweepsOrphansAcrossAllPerBucketPrefixes test now exercises the production-realistic dispatch path and would have caught Codex P1 directly. Refactor: split the single `bucketDeleteOperationGroupElems` helper into `bucketDeleteSafetyNetElems` (DEL_PREFIX-only, used by Phase 2) and a new `runBucketDeleteSafetyNet` method that shares the dispatch-and-log logic between Admin and SigV4 paths. The Phase-1 Del op shape is small enough to inline at each call site. Design doc §6.2 rewritten with the two-phase rationale, the Codex P1 finding, the Phase-2 failure semantics, and why Phase-1- first ordering is correct. Tests: go test -count=1 -run TestS3Server_AdminDeleteBucket ./adapter/ — passes go test -race -count=1 -run TestS3 ./adapter/ — passes (1.455s) golangci-lint run ./adapter/... — 0 issues
1 parent 5f48cd7 commit 69c808c

4 files changed

Lines changed: 206 additions & 58 deletions

File tree

adapter/dynamodb_migration_test.go

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ import (
88

99
"github.com/bootjp/elastickv/kv"
1010
"github.com/bootjp/elastickv/store"
11+
"github.com/cockroachdb/errors"
1112
"github.com/stretchr/testify/require"
1213
)
1314

@@ -27,6 +28,9 @@ func (c *localAdapterCoordinator) Dispatch(ctx context.Context, req *kv.Operatio
2728
if req == nil {
2829
return &kv.CoordinateResponse{}, nil
2930
}
31+
if err := c.validateDispatchShape(req); err != nil {
32+
return nil, err
33+
}
3034
commitTS, err := c.commitTSForRequest(req)
3135
if err != nil {
3236
return nil, err
@@ -37,6 +41,45 @@ func (c *localAdapterCoordinator) Dispatch(ctx context.Context, req *kv.Operatio
3741
return &kv.CoordinateResponse{}, nil
3842
}
3943

44+
// validateDispatchShape mirrors the production coordinator's
45+
// dispatch-time rejection rules so tests catch the same class of
46+
// bug that real clusters would reject. Specifically:
47+
//
48+
// - DEL_PREFIX cannot be in a transactional OperationGroup
49+
// (kv/sharded_coordinator.go: dispatchDelPrefixBroadcast
50+
// refuses IsTxn=true).
51+
// - DEL_PREFIX cannot be mixed with Put or Del in the same
52+
// OperationGroup (validateDelPrefixOnly enforces all-or-none).
53+
//
54+
// Without these checks, a regression that ships
55+
// `IsTxn:true with [Del, DelPrefix...]` (Codex P1 on PR #695)
56+
// would silently pass the local coordinator while production
57+
// rejected every bucket delete with ErrInvalidRequest.
58+
func (c *localAdapterCoordinator) validateDispatchShape(req *kv.OperationGroup[kv.OP]) error {
59+
hasDelPrefix := false
60+
hasOther := false
61+
for _, elem := range req.Elems {
62+
if elem == nil {
63+
continue
64+
}
65+
if elem.Op == kv.DelPrefix {
66+
hasDelPrefix = true
67+
} else {
68+
hasOther = true
69+
}
70+
}
71+
if !hasDelPrefix {
72+
return nil
73+
}
74+
if req.IsTxn {
75+
return errors.Wrap(kv.ErrInvalidRequest, "DEL_PREFIX not supported in transactions")
76+
}
77+
if hasOther {
78+
return errors.Wrap(kv.ErrInvalidRequest, "DEL_PREFIX cannot be mixed with other operations")
79+
}
80+
return nil
81+
}
82+
4083
func (c *localAdapterCoordinator) commitTSForRequest(req *kv.OperationGroup[kv.OP]) (uint64, error) {
4184
if req == nil {
4285
return 0, nil

adapter/s3.go

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -636,6 +636,7 @@ func (s *S3Server) headBucket(w http.ResponseWriter, r *http.Request, bucket str
636636
}
637637

638638
func (s *S3Server) deleteBucket(w http.ResponseWriter, r *http.Request, bucket string) {
639+
var deletedGeneration uint64
639640
err := s.retryS3Mutation(r.Context(), func() error {
640641
readTS := s.readTS()
641642
startTS := s.txnStartTS(readTS)
@@ -669,24 +670,32 @@ func (s *S3Server) deleteBucket(w http.ResponseWriter, r *http.Request, bucket s
669670
}
670671
}
671672

672-
// Same DEL_PREFIX safety net as AdminDeleteBucket — see
673-
// design doc 2026_04_28_proposed_admin_delete_bucket_safety_net.md
674-
// for the race analysis. The empty-probe above is racy
675-
// against concurrent PutObject; the DEL_PREFIX ops in the
676-
// shared OperationGroup tombstone every per-bucket prefix
677-
// at the commit timestamp so anything that snuck in after
678-
// readTS is swept along with BucketMetaKey.
673+
// Phase 1: Del BucketMetaKey in a txn (OCC-protected
674+
// against concurrent createBucket racing this delete).
675+
// Phase 2 (DEL_PREFIX safety net) runs outside the txn
676+
// because the production coordinator rejects DEL_PREFIX
677+
// inside transactions and rejects mixed Del+DelPrefix
678+
// groups (kv/sharded_coordinator.go: dispatchDelPrefixBroadcast).
679+
// See AdminDeleteBucket's doc comment for the full
680+
// rationale.
679681
_, err = s.coordinator.Dispatch(r.Context(), &kv.OperationGroup[kv.OP]{
680682
IsTxn: true,
681683
StartTS: startTS,
682-
Elems: bucketDeleteOperationGroupElems(bucket, meta.Generation),
684+
Elems: []*kv.Elem[kv.OP]{{Op: kv.Del, Key: s3keys.BucketMetaKey(bucket)}},
683685
})
684-
return errors.WithStack(err)
686+
if err != nil {
687+
return errors.WithStack(err)
688+
}
689+
deletedGeneration = meta.Generation
690+
return nil
685691
})
686692
if err != nil {
687693
writeS3MutationError(w, err, bucket, "")
688694
return
689695
}
696+
// Phase 2: best-effort DEL_PREFIX safety net. See
697+
// AdminDeleteBucket / runBucketDeleteSafetyNet for the contract.
698+
s.runBucketDeleteSafetyNet(r.Context(), bucket, deletedGeneration)
690699
w.WriteHeader(http.StatusNoContent)
691700
}
692701

adapter/s3_admin.go

Lines changed: 81 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ package adapter
33
import (
44
"bytes"
55
"context"
6+
"log/slog"
67
"sort"
78
"strings"
89

@@ -359,34 +360,46 @@ func (s *S3Server) AdminPutBucketAcl(ctx context.Context, principal AdminPrincip
359360
// bucket-must-be-empty rule mirrors the SigV4 deleteBucket path —
360361
// the dashboard cannot force a recursive delete, by design.
361362
//
362-
// The empty-probe (ScanAt with limit=1 on the manifest prefix) is
363-
// the operator-facing UX: a non-empty bucket returns 409 with
364-
// ErrAdminBucketNotEmpty. The probe is racy on its own — a
365-
// PutObject that commits between readTS and the delete's commitTS
366-
// is not visible to the probe and would have left orphan keys
367-
// under a deleted bucket meta. See design doc
368-
// 2026_04_28_proposed_admin_delete_bucket_safety_net.md for the
369-
// race analysis. The DEL_PREFIX ops appended to the dispatch
370-
// below close that window: the same OperationGroup wipes every
371-
// per-bucket prefix at the shared commitTS, so anything that
372-
// snuck in during the race is tombstoned together with the
373-
// bucket meta.
363+
// The dispatch happens in two phases because the production
364+
// coordinator (kv/sharded_coordinator.go: dispatchDelPrefixBroadcast)
365+
// rejects DEL_PREFIX inside a transaction and rejects DEL_PREFIX
366+
// mixed with Del or Put in the same OperationGroup:
367+
//
368+
// Phase 1: Del BucketMetaKey in a txn (OCC-protected against
369+
// a concurrent AdminCreateBucket landing between our
370+
// readTS and commitTS).
371+
// Phase 2: DEL_PREFIX over every per-bucket key family in a
372+
// non-txn broadcast — the safety net that sweeps
373+
// orphans left by any PutObject that committed
374+
// chunks/manifest between the empty-probe and the
375+
// Phase-1 commit. See design doc
376+
// 2026_04_28_proposed_admin_delete_bucket_safety_net.md
377+
// §6.2 for the original single-OperationGroup design
378+
// and the dispatch-shape rejection that forced the
379+
// two-phase split.
380+
//
381+
// Phase 2 is best-effort: a Phase-2 failure leaves the bucket meta
382+
// already deleted (Phase 1 succeeded) but per-bucket prefixes
383+
// possibly still containing orphans. That state is no worse than
384+
// the pre-fix behaviour on main and recovers on operator-driven
385+
// re-cleanup. We log a warning rather than propagate the error so
386+
// the operator-visible delete reports success — the bucket really
387+
// is gone from the API surface, and a retry would 404 because
388+
// loadBucketMetaAt no longer finds the meta.
374389
//
375390
// BucketGenerationKey is intentionally NOT deleted. Re-creating
376-
// the bucket bumps the generation; orphan blobs that escaped
377-
// this delete (e.g. on an older generation from a previous
378-
// delete-recreate cycle) stay isolated under the old generation
379-
// prefix and never surface in the new bucket. Removing the
380-
// generation key would lose this property — pinned by
381-
// TestS3Server_AdminDeleteBucket_BucketGenerationKeySurvives.
391+
// the bucket bumps the generation; orphan blobs that escaped this
392+
// delete (e.g. on an older generation) stay isolated under the
393+
// old generation prefix and never surface in the new bucket.
394+
// Pinned by TestS3Server_AdminDeleteBucket_BucketGenerationKeySurvives.
382395
//
383-
// The contract change for clients: a PutObject that returned
384-
// 200 OK during the race window can have its data swept by the
385-
// concurrent delete. Operators are advised to pause writes
386-
// before AdminDeleteBucket; the alternative (orphan objects
387-
// that no API can enumerate or remove) is strictly worse.
396+
// The contract change for clients: a PutObject that returned 200
397+
// OK during the race window can have its data swept by the
398+
// concurrent delete. Operators are advised to pause writes before
399+
// AdminDeleteBucket; the alternative (orphan objects that no API
400+
// can enumerate or remove) is strictly worse.
388401
//
389-
// The same DEL_PREFIX shape is mirrored on the SigV4 path
402+
// The same shape is mirrored on the SigV4 path
390403
// (adapter/s3.go:deleteBucket) so both delete entrypoints share
391404
// the same race-window guarantees.
392405
func (s *S3Server) AdminDeleteBucket(ctx context.Context, principal AdminPrincipal, name string) error {
@@ -397,6 +410,7 @@ func (s *S3Server) AdminDeleteBucket(ctx context.Context, principal AdminPrincip
397410
return ErrAdminNotLeader
398411
}
399412

413+
var deletedGeneration uint64
400414
err := s.retryS3Mutation(ctx, func() error {
401415
readTS := s.readTS()
402416
startTS := s.txnStartTS(readTS)
@@ -418,39 +432,49 @@ func (s *S3Server) AdminDeleteBucket(ctx context.Context, principal AdminPrincip
418432
if len(kvs) > 0 {
419433
return ErrAdminBucketNotEmpty
420434
}
435+
// Phase 1: Del BucketMetaKey in a txn so a concurrent
436+
// AdminCreateBucket racing the delete is rejected by OCC.
437+
// retryS3Mutation handles ErrWriteConflict / ErrTxnLocked
438+
// by re-running this whole closure.
421439
_, err = s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
422440
IsTxn: true,
423441
StartTS: startTS,
424-
Elems: bucketDeleteOperationGroupElems(name, meta.Generation),
442+
Elems: []*kv.Elem[kv.OP]{{Op: kv.Del, Key: s3keys.BucketMetaKey(name)}},
425443
})
426-
return errors.WithStack(err)
444+
if err != nil {
445+
return errors.WithStack(err)
446+
}
447+
deletedGeneration = meta.Generation
448+
return nil
427449
})
428450
if err != nil {
429451
return err //nolint:wrapcheck // sentinel errors propagate as-is.
430452
}
453+
// Phase 2: best-effort safety-net DEL_PREFIX. Outside the
454+
// retryS3Mutation closure because retrying after Phase 1
455+
// committed would 404 at loadBucketMetaAt; we want the error
456+
// (if any) logged but not propagated to the operator.
457+
s.runBucketDeleteSafetyNet(ctx, name, deletedGeneration)
431458
return nil
432459
}
433460

434-
// bucketDeleteOperationGroupElems returns the OperationGroup elem
435-
// list for a bucket delete: a Del on BucketMetaKey followed by
436-
// DEL_PREFIX over each per-bucket key family. Shared between
437-
// AdminDeleteBucket and the SigV4 deleteBucket path so both
438-
// entrypoints sweep the same set of orphan-prone prefixes; if a
439-
// future per-bucket key family is added, updating this one helper
440-
// covers both delete paths in lockstep.
461+
// bucketDeleteSafetyNetElems returns the DEL_PREFIX elem list for
462+
// the Phase-2 safety-net dispatch shared between AdminDeleteBucket
463+
// and the SigV4 deleteBucket path. One helper so a future
464+
// per-bucket key family added to the data plane covers both delete
465+
// entrypoints in lockstep.
441466
//
442467
// BucketGenerationKey is intentionally not in the list — see the
443468
// AdminDeleteBucket doc comment for the orphan-isolation rationale.
444469
//
445470
// The 6 DEL_PREFIX ops broadcast across every shard
446471
// (kv/sharded_coordinator.go: DEL_PREFIX cannot be routed to a
447-
// single shard). This is acceptable because (a) the empty-probe
448-
// already confirmed the manifest prefix is empty in the common
449-
// case, so per-shard scans return 0 keys, (b) AdminDeleteBucket /
450-
// SigV4 deleteBucket are operator-frequency, not data-plane.
451-
func bucketDeleteOperationGroupElems(bucket string, generation uint64) []*kv.Elem[kv.OP] {
472+
// single shard). Acceptable because (a) the empty-probe already
473+
// confirmed the manifest prefix is empty in the common case, so
474+
// per-shard scans return 0 keys, (b) bucket delete is operator-
475+
// frequency, not data-plane.
476+
func bucketDeleteSafetyNetElems(bucket string, generation uint64) []*kv.Elem[kv.OP] {
452477
return []*kv.Elem[kv.OP]{
453-
{Op: kv.Del, Key: s3keys.BucketMetaKey(bucket)},
454478
{Op: kv.DelPrefix, Key: s3keys.ObjectManifestPrefixForBucket(bucket, generation)},
455479
{Op: kv.DelPrefix, Key: s3keys.UploadMetaPrefixForBucket(bucket, generation)},
456480
{Op: kv.DelPrefix, Key: s3keys.UploadPartPrefixForBucket(bucket, generation)},
@@ -460,6 +484,24 @@ func bucketDeleteOperationGroupElems(bucket string, generation uint64) []*kv.Ele
460484
}
461485
}
462486

487+
// runBucketDeleteSafetyNet runs the Phase-2 DEL_PREFIX dispatch
488+
// and swallows transport / cluster errors after logging — the
489+
// caller has already deleted the bucket meta and the operator-
490+
// visible state is consistent with that. Shared between admin and
491+
// SigV4 paths.
492+
func (s *S3Server) runBucketDeleteSafetyNet(ctx context.Context, bucket string, generation uint64) {
493+
if _, err := s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
494+
Elems: bucketDeleteSafetyNetElems(bucket, generation),
495+
}); err != nil {
496+
slog.WarnContext(ctx,
497+
"bucket delete safety-net DEL_PREFIX failed; bucket meta is gone but orphan sweep incomplete",
498+
slog.String("bucket", bucket),
499+
slog.Uint64("generation", generation),
500+
slog.String("error", err.Error()),
501+
)
502+
}
503+
}
504+
463505
// adminCanonicalACL normalises an empty input to the canned
464506
// "private" default. The SigV4 createBucket / putBucketAcl paths
465507
// apply the same default after trimming the x-amz-acl header.

docs/design/2026_04_28_proposed_admin_delete_bucket_safety_net.md

Lines changed: 64 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -170,21 +170,75 @@ identical to the existing `ObjectManifestPrefixForBucket`.
170170

171171
### 6.2 `AdminDeleteBucket` (and `s3.go:deleteBucket`) commit shape
172172

173+
The original revision of this design proposed a **single
174+
OperationGroup** carrying both the `Del BucketMetaKey` and the
175+
six `DelPrefix` ops, relying on the FSM to apply them all at one
176+
commitTS. That shape is rejected by the production coordinator
177+
(Codex P1 on PR #695):
178+
179+
- `kv/sharded_coordinator.go:dispatchDelPrefixBroadcast` rejects
180+
any `OperationGroup` containing `DelPrefix` when `IsTxn` is
181+
true: *"DEL_PREFIX not supported in transactions"*.
182+
- The same broadcast path runs `validateDelPrefixOnly` and rejects
183+
mixed `Del` / `Put` + `DelPrefix` groups: *"DEL_PREFIX cannot be
184+
mixed with other operations"*.
185+
186+
Together those two rules forbid the single-group shape. The
187+
implementation splits into two `Dispatch` calls:
188+
173189
```go
174-
elems := []*kv.Elem[kv.OP]{
175-
{Op: kv.Del, Key: BucketMetaKey(name)},
176-
{Op: kv.DelPrefix, Key: ObjectManifestPrefixForBucket(name, gen)},
177-
{Op: kv.DelPrefix, Key: UploadMetaPrefixForBucket(name, gen)},
178-
{Op: kv.DelPrefix, Key: UploadPartPrefixForBucket(name, gen)},
179-
{Op: kv.DelPrefix, Key: BlobPrefixForBucket(name, gen)},
180-
{Op: kv.DelPrefix, Key: GCUploadPrefixForBucket(name, gen)},
181-
{Op: kv.DelPrefix, Key: RoutePrefixForBucket(name, gen)},
182-
}
190+
// Phase 1: Del BucketMetaKey in a txn (OCC-protected against
191+
// concurrent AdminCreateBucket racing this delete).
192+
s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
193+
IsTxn: true,
194+
StartTS: startTS,
195+
Elems: []*kv.Elem[kv.OP]{{Op: kv.Del, Key: BucketMetaKey(name)}},
196+
})
197+
198+
// Phase 2: DEL_PREFIX safety net broadcast (non-txn). Only runs
199+
// after Phase 1 commits; failure here is best-effort and logged
200+
// rather than propagated.
201+
s.coordinator.Dispatch(ctx, &kv.OperationGroup[kv.OP]{
202+
Elems: []*kv.Elem[kv.OP]{
203+
{Op: kv.DelPrefix, Key: ObjectManifestPrefixForBucket(name, gen)},
204+
{Op: kv.DelPrefix, Key: UploadMetaPrefixForBucket(name, gen)},
205+
{Op: kv.DelPrefix, Key: UploadPartPrefixForBucket(name, gen)},
206+
{Op: kv.DelPrefix, Key: BlobPrefixForBucket(name, gen)},
207+
{Op: kv.DelPrefix, Key: GCUploadPrefixForBucket(name, gen)},
208+
{Op: kv.DelPrefix, Key: RoutePrefixForBucket(name, gen)},
209+
},
210+
})
183211
```
184212

185213
The empty-probe stays as the operator-facing UX: when the bucket
186214
is genuinely non-empty (no race), the operator still sees 409 and
187-
the `DEL_PREFIX` ops never run.
215+
neither phase runs.
216+
217+
**Phase-2 failure semantics**: Phase 1 is the point of no return.
218+
If Phase 2's `Dispatch` returns an error (transport blip,
219+
cluster transient, etc.), the bucket meta is already gone but
220+
the per-bucket prefixes may still contain orphans. The operator-
221+
visible delete reports success (the bucket really is gone from
222+
the API surface, and a retry would 404 at `loadBucketMetaAt`) and
223+
the failure is recorded via `slog.WarnContext`. The resulting
224+
state is no worse than the pre-fix behaviour on main — orphans
225+
were already the failure mode the original race produced. A
226+
future cluster-wide sweep tool (when one exists) can recover the
227+
disk space.
228+
229+
**Why not Phase 2 first?** A "DEL_PREFIX before Del" ordering
230+
would wipe per-bucket data while the bucket meta still exists. If
231+
Phase 1 then fails (concurrent recreate races the OCC), readers
232+
see "bucket exists" but their chunks/manifests don't — confusing
233+
state with no clean recovery. Phase-1-first localises any partial
234+
failure to "bucket gone, orphan data may persist", which has a
235+
well-defined audit trail in slog.
236+
237+
**Test-coordinator parity**: `adapter/dynamodb_migration_test.go`
238+
mirrors the production rejection rules in `localAdapterCoordinator.
239+
validateDispatchShape`. Without that, the original single-group
240+
shape passed local tests while production rejected every bucket
241+
delete with `ErrInvalidRequest` — exactly the gap Codex P1 caught.
188242

189243
### 6.3 Apply-time cost
190244

0 commit comments

Comments
 (0)