Skip to content

Commit fabc81e

Browse files
committed
docs(design): address Claude bot round-5 review (2 MEDIUM, 2 LOW)
s3_admission_control.md — MEDIUM: §3.3.1 "Bootstrap reservation" was ambiguous between peek and acquire. Pin it as a peek (`peekHeadroom(s3RaftEntryByteBudget)`, no slot acquisition, matching admission A's contract) and rename the heading to "Bootstrap headroom check." Document why it must be a peek (an acquire would multiply per-request slot hold by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit with no corresponding payload, reintroducing the head-of-line hazard the design exists to prevent). s3_admission_control.md — LOW: §3.3.1 "frame size up to 64 KiB" was incoherent with the §3.3 semaphore's 1 MiB slot unit (a channel-backed semaphore can't acquire fractional slots). Clarify that the awsChunkedReader progress callback **buffers decoded bytes until a full s3ChunkSize is accumulated, then calls acquire(s3ChunkSize)**. Worst-case extra buffer per concurrent chunked PUT is bounded by 1 MiB; on stream EOF the partial buffer flushes via one final acquire rounded up to one slot. Also adds `s3RaftEntryByteBudget` to §3.2's constant block (it was used throughout §3.3.1 but never defined) with a comment showing the derivation (s3ChunkSize × s3ChunkBatchOps). s3_raft_blob_offload.md — MEDIUM: §3.2 degraded path floor of 2 chunkblob copies provides weaker-than-Raft durability for N > 3 clusters. On a 5-node cluster Raft tolerates 2 simultaneous failures for the chunkref but the degraded chunkblob path (leader + 1 follower) tolerates only 1. Add an explicit note acknowledging the asymmetry, recommend `chunkBlobMinReplicas = N` for operators who need the legacy "blob durability == Raft durability" guarantee, and clarify that the default `(N/2)+1` is sized for "match Raft quorum" not "match Raft fault tolerance" — a distinction that is invisible at N=3 and material at N≥5. s3_raft_blob_offload.md — LOW: §3.5 Phase (3b.i) needs to specify that the queue-entry delete is **conditional** on (a) the entry existing and (b) the RC counter still being 0 at the txn's read timestamp. An unconditional delete would silently succeed on a queue entry that a re-reference txn has just removed, then proceed to phase (3b.ii) and local-delete a chunkblob whose RC has bounced back to 1 — a correctness bug, not just a space leak. The conditional form is what makes the sweeper safe against the re-reference race. No code changes; design docs only.
1 parent 7475060 commit fabc81e

2 files changed

Lines changed: 102 additions & 28 deletions

File tree

docs/design/2026_04_25_proposed_s3_admission_control.md

Lines changed: 59 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,18 @@ const (
132132
// per PR #617) — leaving headroom for Lua, scan buffers, and Pebble
133133
// memtables.
134134
s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB
135+
// s3RaftEntryByteBudget is the per-batch unit acquired and
136+
// released against the semaphore. It must equal the byte
137+
// budget of one Raft entry produced by PutObject /
138+
// UploadPart's flush loop (PR #636: s3ChunkSize ×
139+
// s3ChunkBatchOps = 1 MiB × 4 = 4 MiB minus protobuf framing
140+
// overhead — kept abstract here so the admission contract
141+
// does not lock the entry-size choice). The semaphore's
142+
// capacity is `s3PutAdmissionMaxInflightBytes /
143+
// s3ChunkSize` 1 MiB-units; per-batch acquire takes
144+
// `s3RaftEntryByteBudget / s3ChunkSize` units at a time
145+
// (= 4 units on the default tunables).
146+
s3RaftEntryByteBudget = s3ChunkSize * s3ChunkBatchOps
135147
// dispatchAdmissionTimeout is how long a per-batch flush will wait
136148
// for a slot before giving up. The 256 MiB cap drains in ~2 s at
137149
// 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s),
@@ -225,34 +237,58 @@ chunked stream finishes — exactly the failure mode admission control
225237
exists to prevent. We therefore split chunked admission across two
226238
mechanisms instead of pre-charging:
227239

228-
1. **Bootstrap reservation = `s3RaftEntryByteBudget` (4 MiB)** at
229-
request entry. This is enough to admit the request and let the
230-
awsChunkedReader produce its first decoded window. Chunked PUTs
231-
are not "free" — they still must beat the same admission queue
232-
as fixed-length PUTs at the per-batch level.
240+
1. **Bootstrap headroom check** at request entry. Calls
241+
`peekHeadroom(s3RaftEntryByteBudget)` — exactly the admission-A
242+
contract: a fast-fail check that 4 MiB *would have fit* at the
243+
moment we asked. **No slot is acquired.** This is intentionally
244+
racy with concurrent PUTs (same as fixed-length admission A);
245+
its job is to fail at request entry rather than partway
246+
through the first decoded frame. Chunked PUTs are not "free"
247+
— they still must beat the same admission queue as fixed-length
248+
PUTs at the per-frame level.
233249
2. **Pay-as-you-decode** thereafter, charged via an
234-
`awsChunkedReader` progress callback. Each decoded chunk frame
235-
(typically up to 64 KiB on the wire after framing overhead)
236-
acquires a slot equal to the bytes about to flow into Pebble; the
237-
slot is released once the corresponding `coordinator.Dispatch`
238-
acks. This is the same path admission B uses for fixed-length
239-
PUTs — chunked traffic just hooks into it incrementally instead of
240-
pre-charging the worst case.
250+
`awsChunkedReader` progress callback. The callback **buffers
251+
decoded bytes until a full slot unit (`s3ChunkSize = 1 MiB`) is
252+
accumulated**, then calls `acquire(s3ChunkSize)` on the
253+
semaphore (same path as fixed-length admission B). This keeps
254+
the slot unit coherent: the semaphore's capacity is
255+
`s3PutAdmissionMaxInflightBytes / s3ChunkSize` 1 MiB-units, so
256+
acquiring at sub-MiB granularity is not representable. The
257+
slot is released once the corresponding
258+
`coordinator.Dispatch` acks the chunk. The buffer never holds
259+
more than `s3ChunkSize - 1` decoded bytes, so the worst-case
260+
memory overhead beyond the semaphore-tracked bytes is bounded
261+
by 1 MiB per concurrent chunked PUT.
241262

242263
Failure modes:
243264

244-
- If the awsChunkedReader produces frames faster than Raft drains, the
245-
per-batch acquire blocks (capped by `dispatchAdmissionTimeout`).
246-
Beyond that timeout, mid-stream 503 closes the connection. The
247-
legacy "reserve 5 GiB" approach would have surfaced as 503 *at
248-
request entry* for unrelated PUTs; this approach surfaces as
249-
mid-stream 503 for the chunked PUT itself, which is the right
250-
blame attribution.
265+
- If the awsChunkedReader produces decoded bytes faster than Raft
266+
drains, the next 1 MiB acquire blocks (capped by
267+
`dispatchAdmissionTimeout`). Beyond that timeout, mid-stream 503
268+
closes the connection. The legacy "reserve 5 GiB" approach
269+
would have surfaced as 503 *at request entry* for unrelated
270+
PUTs; this approach surfaces as mid-stream 503 for the chunked
271+
PUT itself, which is the right blame attribution.
272+
- The bootstrap check at step 1 is racy: another PUT can consume
273+
the headroom between the check and the first per-frame acquire.
274+
When that happens the first acquire blocks (or 503s on
275+
timeout) — the same path the fixed-length admission B handles
276+
for the contending case. The race is intentional: making the
277+
check a real reservation would multiply per-request slot hold
278+
by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit
279+
with no corresponding payload, reintroducing a head-of-line
280+
hazard.
281+
- If the awsChunkedReader produces a single frame whose decoded
282+
size never accumulates to a full `s3ChunkSize`, the buffer
283+
flushes on stream EOF: a final `acquire(actual_buffered_bytes)`
284+
rounded up to one slot is taken (semaphore charges in 1-slot
285+
units regardless of actual byte count), so the bound holds.
251286
- If the awsChunkedReader frame size ever exceeds
252-
`s3RaftEntryByteBudget` (a malformed client), the per-batch acquire
253-
asks for more than the cap allows and we 503 immediately — same
254-
as a fixed-length PUT whose `Content-Length` exceeds the global
255-
cap.
287+
`s3RaftEntryByteBudget` (a malformed client whose decoded
288+
cumulative output between framing acks already exceeds 4 MiB),
289+
the first per-frame acquire asks for more than the cap allows
290+
and we 503 immediately — same as a fixed-length PUT whose
291+
`Content-Length` exceeds the global cap.
256292

257293
This change moves chunked admission from M4 (originally "deferred
258294
optimisation") into M1 (the first shippable milestone). M1 ships

docs/design/2026_04_25_proposed_s3_raft_blob_offload.md

Lines changed: 43 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,29 @@ configuration. Tuning `chunkBlobMinReplicas` higher trades PUT
197197
availability for stronger durability; tuning lower than 2 is
198198
rejected at config-load.
199199

200+
**Important durability note for N > 3 clusters.** On a 3-node
201+
cluster the degraded floor of 2 chunkblob copies happens to match
202+
Raft's quorum-of-2, so a single node failure is tolerated by both
203+
the chunkref *and* the chunkblob. For N > 3 this is no longer
204+
true: a 5-node cluster has Raft quorum 3 and tolerates 2
205+
simultaneous failures for the chunkref, but the degraded
206+
chunkblob path (leader + 1 follower) tolerates only 1. If the
207+
leader and the chunkblob-holding follower both fail during the
208+
degraded window, the surviving Raft quorum elects a new leader,
209+
finds a committed chunkref, and discovers that no surviving node
210+
holds the chunkblob — the chunkref is durable but the object data
211+
is lost. This is **weaker than the legacy "every byte through
212+
Raft" path**, which loses data only when Raft itself loses quorum
213+
(3 simultaneous failures on N=5). Operators on N > 3 clusters who
214+
need the legacy "blob durability == Raft durability" guarantee
215+
should configure `chunkBlobMinReplicas = N` (full replication;
216+
trades some PUT availability — any single peer outage stalls
217+
PUTs — for the strongest durability the cluster can offer).
218+
The default `(N/2)+1` is sized for "match Raft quorum," not "match
219+
Raft fault tolerance"; this distinction is invisible at N=3 but
220+
material at N≥5 and is what makes this configuration knob
221+
operationally meaningful.
222+
200223
The trade-off is PUT latency: a PUT now blocks on
201224
`chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft
202225
quorum write of the chunkref. Empirically the chunkblob fsync is
@@ -331,13 +354,28 @@ sweeper needs to know not just *that* the RC reached zero but
331354
`!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases
332355
MUST run in this order:
333356

334-
i. **Raft phase first.** Delete the queue entry through a
335-
Raft txn. Concurrent sweepers serialise here on
336-
write-write-conflict; only the winner proceeds to the
337-
local phase. This makes the queue the global "we are
338-
GC-ing this SHA" lock.
357+
i. **Raft phase first — conditional delete.** Delete the
358+
queue entry through a Raft txn that is **conditional on
359+
the queue entry existing AND the RC counter still being
360+
0 at the txn's read timestamp**. The conditional form is
361+
load-bearing: if a re-reference txn has committed
362+
between the sweeper's queue scan and this txn (driving
363+
RC back to 1 and atomically removing the queue entry —
364+
see §3.1's atomic invariant), the conditional delete
365+
fails and the sweeper aborts before reaching the local
366+
phase. An *unconditional* delete would silently succeed
367+
on the now-absent queue entry and let the sweeper
368+
proceed to local-delete a chunkblob that is currently
369+
live (RC=1) — a **correctness bug, not just a space
370+
leak**. Concurrent sweepers also serialise on this txn
371+
(write-write-conflict on the queue key); only the winner
372+
proceeds.
339373
ii. **Local phase second.** Delete the local
340374
`!s3|chunkblob|<SHA>` from Pebble. No Raft round-trip.
375+
Reaching this phase implies (i) succeeded, which
376+
implies the RC was 0 at the txn read timestamp and
377+
remained 0 throughout the txn's commit window — i.e.
378+
the blob is genuinely unreachable.
341379

342380
The phase ordering is the load-bearing detail. If we did
343381
local-first then Raft, a crash between the two phases would

0 commit comments

Comments
 (0)