docs(design): address Claude bot round-5 review (2 MEDIUM, 2 LOW)

bootjp · bootjp · commit fabc81e40612 · 2026-04-26T04:36:19.000+09:00
s3_admission_control.md — MEDIUM: §3.3.1 "Bootstrap reservation"
was ambiguous between peek and acquire. Pin it as a peek
(`peekHeadroom(s3RaftEntryByteBudget)`, no slot acquisition,
matching admission A's contract) and rename the heading to
"Bootstrap headroom check." Document why it must be a peek (an
acquire would multiply per-request slot hold by
`concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit with no
corresponding payload, reintroducing the head-of-line hazard the
design exists to prevent).

s3_admission_control.md — LOW: §3.3.1 "frame size up to 64 KiB"
was incoherent with the §3.3 semaphore's 1 MiB slot unit (a
channel-backed semaphore can't acquire fractional slots). Clarify
that the awsChunkedReader progress callback **buffers decoded
bytes until a full s3ChunkSize is accumulated, then calls
acquire(s3ChunkSize)**. Worst-case extra buffer per concurrent
chunked PUT is bounded by 1 MiB; on stream EOF the partial buffer
flushes via one final acquire rounded up to one slot. Also adds
`s3RaftEntryByteBudget` to §3.2's constant block (it was used
throughout §3.3.1 but never defined) with a comment showing
the derivation (s3ChunkSize × s3ChunkBatchOps).

s3_raft_blob_offload.md — MEDIUM: §3.2 degraded path floor of 2
chunkblob copies provides weaker-than-Raft durability for N &gt; 3
clusters. On a 5-node cluster Raft tolerates 2 simultaneous
failures for the chunkref but the degraded chunkblob path
(leader + 1 follower) tolerates only 1. Add an explicit note
acknowledging the asymmetry, recommend `chunkBlobMinReplicas = N`
for operators who need the legacy "blob durability == Raft
durability" guarantee, and clarify that the default `(N/2)+1` is
sized for "match Raft quorum" not "match Raft fault tolerance" —
a distinction that is invisible at N=3 and material at N≥5.

s3_raft_blob_offload.md — LOW: §3.5 Phase (3b.i) needs to specify
that the queue-entry delete is **conditional** on (a) the entry
existing and (b) the RC counter still being 0 at the txn's read
timestamp. An unconditional delete would silently succeed on a
queue entry that a re-reference txn has just removed, then proceed
to phase (3b.ii) and local-delete a chunkblob whose RC has bounced
back to 1 — a correctness bug, not just a space leak. The
conditional form is what makes the sweeper safe against the
re-reference race.

No code changes; design docs only.
diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -132,6 +132,18 @@ const (
     // per PR #617) — leaving headroom for Lua, scan buffers, and Pebble
     // memtables.
     s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB
+    // s3RaftEntryByteBudget is the per-batch unit acquired and
+    // released against the semaphore. It must equal the byte
+    // budget of one Raft entry produced by PutObject /
+    // UploadPart's flush loop (PR #636: s3ChunkSize ×
+    // s3ChunkBatchOps = 1 MiB × 4 = 4 MiB minus protobuf framing
+    // overhead — kept abstract here so the admission contract
+    // does not lock the entry-size choice). The semaphore's
+    // capacity is `s3PutAdmissionMaxInflightBytes /
+    // s3ChunkSize` 1 MiB-units; per-batch acquire takes
+    // `s3RaftEntryByteBudget / s3ChunkSize` units at a time
+    // (= 4 units on the default tunables).
+    s3RaftEntryByteBudget = s3ChunkSize * s3ChunkBatchOps
     // dispatchAdmissionTimeout is how long a per-batch flush will wait
     // for a slot before giving up. The 256 MiB cap drains in ~2 s at
     // 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s),
@@ -225,34 +237,58 @@ chunked stream finishes — exactly the failure mode admission control
 exists to prevent. We therefore split chunked admission across two
 mechanisms instead of pre-charging:
 
-1. **Bootstrap reservation = `s3RaftEntryByteBudget` (4 MiB)** at
-   request entry. This is enough to admit the request and let the
-   awsChunkedReader produce its first decoded window. Chunked PUTs
-   are not "free" — they still must beat the same admission queue
-   as fixed-length PUTs at the per-batch level.
+1. **Bootstrap headroom check** at request entry. Calls
+   `peekHeadroom(s3RaftEntryByteBudget)` — exactly the admission-A
+   contract: a fast-fail check that 4 MiB *would have fit* at the
+   moment we asked. **No slot is acquired.** This is intentionally
+   racy with concurrent PUTs (same as fixed-length admission A);
+   its job is to fail at request entry rather than partway
+   through the first decoded frame. Chunked PUTs are not "free"
+   — they still must beat the same admission queue as fixed-length
+   PUTs at the per-frame level.
 2. **Pay-as-you-decode** thereafter, charged via an
-   `awsChunkedReader` progress callback. Each decoded chunk frame
-   (typically up to 64 KiB on the wire after framing overhead)
-   acquires a slot equal to the bytes about to flow into Pebble; the
-   slot is released once the corresponding `coordinator.Dispatch`
-   acks. This is the same path admission B uses for fixed-length
-   PUTs — chunked traffic just hooks into it incrementally instead of
-   pre-charging the worst case.
+   `awsChunkedReader` progress callback. The callback **buffers
+   decoded bytes until a full slot unit (`s3ChunkSize = 1 MiB`) is
+   accumulated**, then calls `acquire(s3ChunkSize)` on the
+   semaphore (same path as fixed-length admission B). This keeps
+   the slot unit coherent: the semaphore's capacity is
+   `s3PutAdmissionMaxInflightBytes / s3ChunkSize` 1 MiB-units, so
+   acquiring at sub-MiB granularity is not representable. The
+   slot is released once the corresponding
+   `coordinator.Dispatch` acks the chunk. The buffer never holds
+   more than `s3ChunkSize - 1` decoded bytes, so the worst-case
+   memory overhead beyond the semaphore-tracked bytes is bounded
+   by 1 MiB per concurrent chunked PUT.
 
 Failure modes:
 
-- If the awsChunkedReader produces frames faster than Raft drains, the
-  per-batch acquire blocks (capped by `dispatchAdmissionTimeout`).
-  Beyond that timeout, mid-stream 503 closes the connection. The
-  legacy "reserve 5 GiB" approach would have surfaced as 503 *at
-  request entry* for unrelated PUTs; this approach surfaces as
-  mid-stream 503 for the chunked PUT itself, which is the right
-  blame attribution.
+- If the awsChunkedReader produces decoded bytes faster than Raft
+  drains, the next 1 MiB acquire blocks (capped by
+  `dispatchAdmissionTimeout`). Beyond that timeout, mid-stream 503
+  closes the connection. The legacy "reserve 5 GiB" approach
+  would have surfaced as 503 *at request entry* for unrelated
+  PUTs; this approach surfaces as mid-stream 503 for the chunked
+  PUT itself, which is the right blame attribution.
+- The bootstrap check at step 1 is racy: another PUT can consume
+  the headroom between the check and the first per-frame acquire.
+  When that happens the first acquire blocks (or 503s on
+  timeout) — the same path the fixed-length admission B handles
+  for the contending case. The race is intentional: making the
+  check a real reservation would multiply per-request slot hold
+  by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit
+  with no corresponding payload, reintroducing a head-of-line
+  hazard.
+- If the awsChunkedReader produces a single frame whose decoded
+  size never accumulates to a full `s3ChunkSize`, the buffer
+  flushes on stream EOF: a final `acquire(actual_buffered_bytes)`
+  rounded up to one slot is taken (semaphore charges in 1-slot
+  units regardless of actual byte count), so the bound holds.
 - If the awsChunkedReader frame size ever exceeds
-  `s3RaftEntryByteBudget` (a malformed client), the per-batch acquire
-  asks for more than the cap allows and we 503 immediately — same
-  as a fixed-length PUT whose `Content-Length` exceeds the global
-  cap.
+  `s3RaftEntryByteBudget` (a malformed client whose decoded
+  cumulative output between framing acks already exceeds 4 MiB),
+  the first per-frame acquire asks for more than the cap allows
+  and we 503 immediately — same as a fixed-length PUT whose
+  `Content-Length` exceeds the global cap.
 
 This change moves chunked admission from M4 (originally "deferred
 optimisation") into M1 (the first shippable milestone). M1 ships
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -197,6 +197,29 @@ configuration. Tuning `chunkBlobMinReplicas` higher trades PUT
 availability for stronger durability; tuning lower than 2 is
 rejected at config-load.
 
+**Important durability note for N > 3 clusters.** On a 3-node
+cluster the degraded floor of 2 chunkblob copies happens to match
+Raft's quorum-of-2, so a single node failure is tolerated by both
+the chunkref *and* the chunkblob. For N > 3 this is no longer
+true: a 5-node cluster has Raft quorum 3 and tolerates 2
+simultaneous failures for the chunkref, but the degraded
+chunkblob path (leader + 1 follower) tolerates only 1. If the
+leader and the chunkblob-holding follower both fail during the
+degraded window, the surviving Raft quorum elects a new leader,
+finds a committed chunkref, and discovers that no surviving node
+holds the chunkblob — the chunkref is durable but the object data
+is lost. This is **weaker than the legacy "every byte through
+Raft" path**, which loses data only when Raft itself loses quorum
+(3 simultaneous failures on N=5). Operators on N > 3 clusters who
+need the legacy "blob durability == Raft durability" guarantee
+should configure `chunkBlobMinReplicas = N` (full replication;
+trades some PUT availability — any single peer outage stalls
+PUTs — for the strongest durability the cluster can offer).
+The default `(N/2)+1` is sized for "match Raft quorum," not "match
+Raft fault tolerance"; this distinction is invisible at N=3 but
+material at N≥5 and is what makes this configuration knob
+operationally meaningful.
+
 The trade-off is PUT latency: a PUT now blocks on
 `chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft
 quorum write of the chunkref. Empirically the chunkblob fsync is
@@ -331,13 +354,28 @@ sweeper needs to know not just *that* the RC reached zero but
       `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases
       MUST run in this order:
 
-      i. **Raft phase first.** Delete the queue entry through a
-         Raft txn. Concurrent sweepers serialise here on
-         write-write-conflict; only the winner proceeds to the
-         local phase. This makes the queue the global "we are
-         GC-ing this SHA" lock.
+      i. **Raft phase first — conditional delete.** Delete the
+         queue entry through a Raft txn that is **conditional on
+         the queue entry existing AND the RC counter still being
+         0 at the txn's read timestamp**. The conditional form is
+         load-bearing: if a re-reference txn has committed
+         between the sweeper's queue scan and this txn (driving
+         RC back to 1 and atomically removing the queue entry —
+         see §3.1's atomic invariant), the conditional delete
+         fails and the sweeper aborts before reaching the local
+         phase. An *unconditional* delete would silently succeed
+         on the now-absent queue entry and let the sweeper
+         proceed to local-delete a chunkblob that is currently
+         live (RC=1) — a **correctness bug, not just a space
+         leak**. Concurrent sweepers also serialise on this txn
+         (write-write-conflict on the queue key); only the winner
+         proceeds.
       ii. **Local phase second.** Delete the local
           `!s3|chunkblob|<SHA>` from Pebble. No Raft round-trip.
+          Reaching this phase implies (i) succeeded, which
+          implies the RC was 0 at the txn read timestamp and
+          remained 0 throughout the txn's commit window — i.e.
+          the blob is genuinely unreachable.
 
       The phase ordering is the load-bearing detail. If we did
       local-first then Raft, a crash between the two phases would