Skip to content

Commit 1fb39ba

Browse files
committed
docs(design): address Claude bot round-5 review (1 MEDIUM, 4 LOW)
s3_raft_blob_offload.md — MEDIUM: §3.2 PushChunkBlob latency claim ("PUT p99 ≈ legacy") was load-bearing but the local-write/peer-push pipeline was unspecified. Sequential ordering would silently double the per-chunk latency (`chunkBlobMinReplicas × fsync_latency`). Pin the contract: local Pebble write and PushChunkBlob fan out **concurrently**; multiple followers' pushes are **fanned out in parallel**, not sequentially. Update the flow diagram to show the pipeline explicitly and call out that this is part of the contract, not an optional optimization. s3_admission_control.md — LOW: §3.3.1 malformed-client failure mode said "we 503 immediately" but the accumulation design (`acquire(s3ChunkSize)` only) means the immediate-503 path is never reachable on the chunked side — the per-frame acquire is always 1 MiB, well under the 256 MiB cap. Reword to specify the actual path: successive 1 MiB acquires block under Raft pressure and the PUT eventually surfaces 503 on `dispatchAdmissionTimeout`. The "immediate 503 for oversized request" path is fixed-length only. s3_admission_control.md — LOW: §5 milestone table had M2 saying "Add `dispatchAdmissionTimeout`" but M1 already ships the chunked per-frame admission B path which is gated on it. Move the constant into M1; M2 narrows to "add fixed-length per-batch admission B + cleanup," with chunked already using the path from M1. s3_raft_blob_offload.md — LOW: the §3.2 flow-diagram step 3 phrasing "synchronously replicate to ≥ chunkBlobMinReplicas peers" was inconsistent with the prose's "chunkBlobMinReplicas - 1 followers." Resolved as part of the MEDIUM rewrite — the diagram now reads "PushChunkBlob to chunkBlobMinReplicas-1 followers" with parallel fan-out, matching the prose count. s3_raft_blob_offload.md — LOW: §3.5 orphan scan was framed as the recovery path for "sweeper crash between Phase i and ii," but it implicitly also covers a more common scenario — chunkblobs written to local Pebble by §3.2 step 2 when the PUT then fails before `coordinator.Dispatch` is called (admission 503, client disconnect, PushChunkBlob quorum failure). In that case neither RC nor GC queue entry exists, so the sweeper never sees the orphan; only the orphan scan does. Make this case explicit so the PUT-handler abort path can rely on the orphan scan rather than needing its own best-effort local delete. No code changes; design docs only.
1 parent fabc81e commit 1fb39ba

2 files changed

Lines changed: 69 additions & 21 deletions

File tree

docs/design/2026_04_25_proposed_s3_admission_control.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -283,12 +283,18 @@ Failure modes:
283283
flushes on stream EOF: a final `acquire(actual_buffered_bytes)`
284284
rounded up to one slot is taken (semaphore charges in 1-slot
285285
units regardless of actual byte count), so the bound holds.
286-
- If the awsChunkedReader frame size ever exceeds
287-
`s3RaftEntryByteBudget` (a malformed client whose decoded
288-
cumulative output between framing acks already exceeds 4 MiB),
289-
the first per-frame acquire asks for more than the cap allows
290-
and we 503 immediately — same as a fixed-length PUT whose
291-
`Content-Length` exceeds the global cap.
286+
- A malformed client that decodes bytes faster than Raft drains
287+
*cannot* trigger the immediate-503 path the way a fixed-length
288+
PUT can. The accumulation design (callback always calls
289+
`acquire(s3ChunkSize)`, never larger) means the per-frame
290+
acquire request is bounded by 1 MiB — the
291+
"if `bytes > capacity * s3ChunkSize`" early-return in
292+
`acquire`'s spec is never hit on the chunked path. Instead,
293+
successive 1 MiB acquires block under Raft pressure and the
294+
PUT eventually surfaces 503 on `dispatchAdmissionTimeout`
295+
the same path a slow follower triggers. The "immediate 503 for
296+
oversized request" failure mode applies only to fixed-length
297+
PUTs (via `peekHeadroom(Content-Length > 256 MiB)`).
292298

293299
This change moves chunked admission from M4 (originally "deferred
294300
optimisation") into M1 (the first shippable milestone). M1 ships
@@ -383,8 +389,8 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load).
383389

384390
| Milestone | Scope | Risk |
385391
|---|---|---|
386-
| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
387-
| M2 | Per-batch admission B inside `flushBatch` for fixed-length PUTs. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. (Chunked PUTs already use this path through their incremental charging.) | Medium. Cleanup path on partial failure. |
392+
| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission (`peekHeadroom`). Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. **`dispatchAdmissionTimeout` ships here** (the chunked per-frame `acquire(s3ChunkSize)` path is gated on it from day one), not in M2. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
393+
| M2 | Per-batch admission B inside `flushBatch` for **fixed-length** PUTs (chunked PUTs already use admission B as of M1). Mid-stream 503 with cleanup on the fixed-length path. | Medium. Cleanup path on partial failure. |
388394
| M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. |
389395
| M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. |
390396

docs/design/2026_04_25_proposed_s3_raft_blob_offload.md

Lines changed: 55 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,15 @@ client ─► HTTP PUT body
133133
─► chunk loop (s3ChunkSize):
134134
1. compute SHA-256 of chunk
135135
2. write chunk to LOCAL Pebble at !s3|chunkblob|<SHA>
136-
3. *** synchronously replicate to ≥ chunkBlobMinReplicas peers ***
137-
4. queue ChunkRef into pendingBatch
136+
(fsync) ─┐ pipelined: bytes also stream out
137+
3. PushChunkBlob to ─┤ to chunkBlobMinReplicas-1 followers
138+
followers in parallel (one RPC per follower; bytes
139+
start flowing the moment the leader has them — not
140+
after local fsync completes)
141+
4. wait until BOTH local fsync AND a quorum of follower
142+
fsync-acks have returned (= chunkBlobMinReplicas
143+
durable copies including the leader)
144+
5. queue ChunkRef into pendingBatch
138145
─► flushBatch:
139146
coordinator.Dispatch(OperationGroup{
140147
Elems: [ chunkref Puts ... + manifest Put ],
@@ -151,13 +158,27 @@ quorum guarantees the chunkref but tells you nothing about the
151158
blob payload. We close that gap by treating the chunkblob like a
152159
mini-Raft entry of its own with **semi-synchronous quorum**:
153160

154-
1. Leader writes the chunkblob to local Pebble (fsync).
155-
2. Leader pushes the chunkblob to `chunkBlobMinReplicas - 1`
156-
followers via the `S3BlobFetch.PushChunkBlob` RPC and waits for
157-
each follower's "fsync ack." (`PushChunkBlob` is the leader-
158-
initiated counterpart to the follower-initiated `FetchChunkBlob`
159-
defined in §3.6.)
160-
3. Only after the chunkblob is durable on a quorum of nodes does
161+
1. Leader starts writing the chunkblob to local Pebble (fsync in
162+
flight).
163+
2. **Concurrently** with step 1, the leader streams the chunkblob
164+
to `chunkBlobMinReplicas - 1` followers via parallel
165+
`S3BlobFetch.PushChunkBlob` RPCs. Pushes are **fanned out in
166+
parallel**, not sequential — each follower's RPC is started
167+
immediately, and the leader waits on a quorum of fsync-acks
168+
rather than serially blocking on each one. (`PushChunkBlob` is
169+
the leader-initiated counterpart to the follower-initiated
170+
`FetchChunkBlob` defined in §3.6.)
171+
3. The leader waits for *both* the local fsync AND a quorum of
172+
follower fsync-acks. The dominant cost is therefore
173+
`max(local_fsync, slowest_quorum_follower_fsync)` — typically
174+
≈ 10 ms on consumer SSD, equivalent to a Raft quorum write.
175+
This is what makes the p99 latency claim below load-bearing:
176+
if step 1 and step 2 were *sequential* (write local → then
177+
push to followers → then wait), per-chunk latency would be
178+
`chunkBlobMinReplicas × fsync_latency` and silently double the
179+
PUT p99 vs. the legacy path. The pipelined / parallel model
180+
is part of the contract, not an optimization.
181+
4. Only after the chunkblob is durable on a quorum of nodes does
161182
the leader propose the chunkref through Raft.
162183

163184
`chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a
@@ -386,10 +407,31 @@ sweeper needs to know not just *that* the RC reached zero but
386407
that for the inverse failure mode: a crash between the two
387408
phases leaves the queue entry deleted but the local
388409
chunkblob still on disk — a **bounded local space leak,
389-
not a correctness bug**. A periodic "orphan scan" (chunkblob
390-
keys whose SHA has RC=0 *and* no queue entry) reclaims
391-
these without urgency. The orphan scan runs at low priority
392-
out of band from the sweeper.
410+
not a correctness bug**. A periodic "orphan scan" reclaims
411+
these.
412+
413+
The orphan scan covers two distinct sources of orphans:
414+
415+
- **Sweeper crash between Phase (3b.i) and (3b.ii)** — the
416+
case described above; queue entry was removed via Raft
417+
but the local Pebble delete never fired.
418+
- **PUT failure before chunkref Dispatch** — chunkblob
419+
bytes were written to local Pebble in §3.2 step 2, then
420+
the PUT aborted before reaching `coordinator.Dispatch`
421+
(admission control 503, client disconnect, `PushChunkBlob`
422+
quorum failure, request context cancel). In that
423+
scenario neither an RC entry nor a GC queue entry was
424+
ever written, so the sweeper's queue-range scan never
425+
sees these orphans — only the orphan scan does.
426+
427+
Detection criterion (covers both): `!s3|chunkblob|<SHA>`
428+
keys whose SHA has either no RC entry at all, or RC=0 with
429+
no corresponding queue entry. The orphan scan runs at low
430+
priority out of band from the sweeper (proposed default
431+
`chunkBlobOrphanScanInterval = 1 hour`); it is the safety
432+
net behind both the sweeper crash path and the PUT-abort
433+
cleanup path, so the PUT handler does not need its own
434+
best-effort local-delete on the abort path.
393435
c. if the RC has bounced above 0 in the meantime, the queue
394436
entry is stale (a re-reference txn forgot to remove it, or
395437
the sweeper raced) and the sweeper deletes only the queue

0 commit comments

Comments
 (0)