docs(design): address Claude bot round-5 review (1 MEDIUM, 4 LOW)

bootjp · bootjp · commit 1fb39ba1b31f · 2026-04-26T04:47:10.000+09:00
s3_raft_blob_offload.md — MEDIUM: §3.2 PushChunkBlob latency claim
("PUT p99 ≈ legacy") was load-bearing but the
local-write/peer-push pipeline was unspecified. Sequential
ordering would silently double the per-chunk latency
(`chunkBlobMinReplicas × fsync_latency`). Pin the contract:
local Pebble write and PushChunkBlob fan out **concurrently**;
multiple followers' pushes are **fanned out in parallel**, not
sequentially. Update the flow diagram to show the pipeline
explicitly and call out that this is part of the contract, not
an optional optimization.

s3_admission_control.md — LOW: §3.3.1 malformed-client failure
mode said "we 503 immediately" but the accumulation design
(`acquire(s3ChunkSize)` only) means the immediate-503 path is
never reachable on the chunked side — the per-frame acquire is
always 1 MiB, well under the 256 MiB cap. Reword to specify the
actual path: successive 1 MiB acquires block under Raft pressure
and the PUT eventually surfaces 503 on
`dispatchAdmissionTimeout`. The "immediate 503 for oversized
request" path is fixed-length only.

s3_admission_control.md — LOW: §5 milestone table had M2 saying
"Add `dispatchAdmissionTimeout`" but M1 already ships the
chunked per-frame admission B path which is gated on it. Move
the constant into M1; M2 narrows to "add fixed-length per-batch
admission B + cleanup," with chunked already using the path
from M1.

s3_raft_blob_offload.md — LOW: the §3.2 flow-diagram step 3
phrasing "synchronously replicate to ≥ chunkBlobMinReplicas
peers" was inconsistent with the prose's
"chunkBlobMinReplicas - 1 followers." Resolved as part of the
MEDIUM rewrite — the diagram now reads "PushChunkBlob to
chunkBlobMinReplicas-1 followers" with parallel fan-out, matching
the prose count.

s3_raft_blob_offload.md — LOW: §3.5 orphan scan was framed as
the recovery path for "sweeper crash between Phase i and ii,"
but it implicitly also covers a more common scenario — chunkblobs
written to local Pebble by §3.2 step 2 when the PUT then fails
before `coordinator.Dispatch` is called (admission 503, client
disconnect, PushChunkBlob quorum failure). In that case neither
RC nor GC queue entry exists, so the sweeper never sees the
orphan; only the orphan scan does. Make this case explicit so the
PUT-handler abort path can rely on the orphan scan rather than
needing its own best-effort local delete.

No code changes; design docs only.
diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -283,12 +283,18 @@ Failure modes:
   flushes on stream EOF: a final `acquire(actual_buffered_bytes)`
   rounded up to one slot is taken (semaphore charges in 1-slot
   units regardless of actual byte count), so the bound holds.
-- If the awsChunkedReader frame size ever exceeds
-  `s3RaftEntryByteBudget` (a malformed client whose decoded
-  cumulative output between framing acks already exceeds 4 MiB),
-  the first per-frame acquire asks for more than the cap allows
-  and we 503 immediately — same as a fixed-length PUT whose
-  `Content-Length` exceeds the global cap.
+- A malformed client that decodes bytes faster than Raft drains
+  *cannot* trigger the immediate-503 path the way a fixed-length
+  PUT can. The accumulation design (callback always calls
+  `acquire(s3ChunkSize)`, never larger) means the per-frame
+  acquire request is bounded by 1 MiB — the
+  "if `bytes > capacity * s3ChunkSize`" early-return in
+  `acquire`'s spec is never hit on the chunked path. Instead,
+  successive 1 MiB acquires block under Raft pressure and the
+  PUT eventually surfaces 503 on `dispatchAdmissionTimeout` —
+  the same path a slow follower triggers. The "immediate 503 for
+  oversized request" failure mode applies only to fixed-length
+  PUTs (via `peekHeadroom(Content-Length > 256 MiB)`).
 
 This change moves chunked admission from M4 (originally "deferred
 optimisation") into M1 (the first shippable milestone). M1 ships
@@ -383,8 +389,8 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load).
 
 | Milestone | Scope | Risk |
 |---|---|---|
-| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
-| M2 | Per-batch admission B inside `flushBatch` for fixed-length PUTs. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. (Chunked PUTs already use this path through their incremental charging.) | Medium. Cleanup path on partial failure. |
+| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission (`peekHeadroom`). Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. **`dispatchAdmissionTimeout` ships here** (the chunked per-frame `acquire(s3ChunkSize)` path is gated on it from day one), not in M2. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
+| M2 | Per-batch admission B inside `flushBatch` for **fixed-length** PUTs (chunked PUTs already use admission B as of M1). Mid-stream 503 with cleanup on the fixed-length path. | Medium. Cleanup path on partial failure. |
 | M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. |
 | M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. |
 
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -133,8 +133,15 @@ client ─► HTTP PUT body
         ─► chunk loop (s3ChunkSize):
              1. compute SHA-256 of chunk
              2. write chunk to LOCAL Pebble at !s3|chunkblob|<SHA>
-             3. *** synchronously replicate to ≥ chunkBlobMinReplicas peers ***
-             4. queue ChunkRef into pendingBatch
+                (fsync)         ─┐ pipelined: bytes also stream out
+             3. PushChunkBlob to ─┤ to chunkBlobMinReplicas-1 followers
+                followers in parallel (one RPC per follower; bytes
+                start flowing the moment the leader has them — not
+                after local fsync completes)
+             4. wait until BOTH local fsync AND a quorum of follower
+                fsync-acks have returned (= chunkBlobMinReplicas
+                durable copies including the leader)
+             5. queue ChunkRef into pendingBatch
         ─► flushBatch:
              coordinator.Dispatch(OperationGroup{
                  Elems: [ chunkref Puts ... + manifest Put ],
@@ -151,13 +158,27 @@ quorum guarantees the chunkref but tells you nothing about the
 blob payload. We close that gap by treating the chunkblob like a
 mini-Raft entry of its own with **semi-synchronous quorum**:
 
-1. Leader writes the chunkblob to local Pebble (fsync).
-2. Leader pushes the chunkblob to `chunkBlobMinReplicas - 1`
-   followers via the `S3BlobFetch.PushChunkBlob` RPC and waits for
-   each follower's "fsync ack." (`PushChunkBlob` is the leader-
-   initiated counterpart to the follower-initiated `FetchChunkBlob`
-   defined in §3.6.)
-3. Only after the chunkblob is durable on a quorum of nodes does
+1. Leader starts writing the chunkblob to local Pebble (fsync in
+   flight).
+2. **Concurrently** with step 1, the leader streams the chunkblob
+   to `chunkBlobMinReplicas - 1` followers via parallel
+   `S3BlobFetch.PushChunkBlob` RPCs. Pushes are **fanned out in
+   parallel**, not sequential — each follower's RPC is started
+   immediately, and the leader waits on a quorum of fsync-acks
+   rather than serially blocking on each one. (`PushChunkBlob` is
+   the leader-initiated counterpart to the follower-initiated
+   `FetchChunkBlob` defined in §3.6.)
+3. The leader waits for *both* the local fsync AND a quorum of
+   follower fsync-acks. The dominant cost is therefore
+   `max(local_fsync, slowest_quorum_follower_fsync)` — typically
+   ≈ 10 ms on consumer SSD, equivalent to a Raft quorum write.
+   This is what makes the p99 latency claim below load-bearing:
+   if step 1 and step 2 were *sequential* (write local → then
+   push to followers → then wait), per-chunk latency would be
+   `chunkBlobMinReplicas × fsync_latency` and silently double the
+   PUT p99 vs. the legacy path. The pipelined / parallel model
+   is part of the contract, not an optimization.
+4. Only after the chunkblob is durable on a quorum of nodes does
    the leader propose the chunkref through Raft.
 
 `chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a
@@ -386,10 +407,31 @@ sweeper needs to know not just *that* the RC reached zero but
       that for the inverse failure mode: a crash between the two
       phases leaves the queue entry deleted but the local
       chunkblob still on disk — a **bounded local space leak,
-      not a correctness bug**. A periodic "orphan scan" (chunkblob
-      keys whose SHA has RC=0 *and* no queue entry) reclaims
-      these without urgency. The orphan scan runs at low priority
-      out of band from the sweeper.
+      not a correctness bug**. A periodic "orphan scan" reclaims
+      these.
+
+      The orphan scan covers two distinct sources of orphans:
+
+      - **Sweeper crash between Phase (3b.i) and (3b.ii)** — the
+        case described above; queue entry was removed via Raft
+        but the local Pebble delete never fired.
+      - **PUT failure before chunkref Dispatch** — chunkblob
+        bytes were written to local Pebble in §3.2 step 2, then
+        the PUT aborted before reaching `coordinator.Dispatch`
+        (admission control 503, client disconnect, `PushChunkBlob`
+        quorum failure, request context cancel). In that
+        scenario neither an RC entry nor a GC queue entry was
+        ever written, so the sweeper's queue-range scan never
+        sees these orphans — only the orphan scan does.
+
+      Detection criterion (covers both): `!s3|chunkblob|<SHA>`
+      keys whose SHA has either no RC entry at all, or RC=0 with
+      no corresponding queue entry. The orphan scan runs at low
+      priority out of band from the sweeper (proposed default
+      `chunkBlobOrphanScanInterval = 1 hour`); it is the safety
+      net behind both the sweeper crash path and the PUT-abort
+      cleanup path, so the PUT handler does not need its own
+      best-effort local-delete on the abort path.
    c. if the RC has bounced above 0 in the meantime, the queue
       entry is stale (a re-reference txn forgot to remove it, or
       the sweeper raced) and the sweeper deletes only the queue