docs(design): address Claude bot round-4 review (1 HIGH, 3 LOW)

bootjp · bootjp · commit 747506002296 · 2026-04-26T04:25:32.000+09:00
s3_raft_blob_offload.md — HIGH: §3.5 step (3b) claimed "delete the
local chunkblob AND the queue entry as a single Raft txn" — but
chunkblob is local-only Pebble per §3.1 and the queue is
Raft-replicated, so the two ops live in different storage layers
and cannot share a Raft txn. Rewrite the step to specify a
two-phase ordering:

  i.  Raft phase first: delete the queue entry through Raft.
      Concurrent sweepers serialise here on write-write-conflict;
      the queue is therefore the global "we are GC-ing this SHA"
      lock.
  ii. Local phase second: delete the local chunkblob from Pebble.

Document the failure mode of the inverse ordering (local-first
would orphan the queue entry on crash) and of the chosen ordering
(crash between phases leaves a local space leak — bounded, no
correctness consequence — recoverable by a periodic orphan scan).

s3_raft_blob_offload.md — LOW: the same §3.5 closing paragraph
said "Both are Raft-replicated" referring to the queue and RC.
That phrasing implied the chunkblob deletes were Raft-replicated
too. Rewrite to explicitly distinguish: queue + RC are
Raft-replicated; local chunkblob deletes are deliberately
node-local because that is the whole point of the architecture.

s3_admission_control.md — LOW: §4 retry-budget bound formula
referenced `redisDispatchTimeout`, a Redis-path constant copy-
pasted into the S3 design. The S3 PUT path actually uses the
inbound `*http.Request` context (no S3-specific Dispatch timeout),
so the formula now reads `single_dispatch_budget` with an explicit
note that the upper bound is whatever the request context allows
at that moment.

s3_admission_control.md — LOW: §3.5 metrics spec defined only
`stage="prereserve" | "perbatch"` but §6 and the Rolling Upgrade
subsection both reference a `stage="perbatch", protocol="chunked"`
label combination for isolating chunked-PUT rejection events. Add
the `protocol="fixed-length" | "chunked"` label dimension to
`elastickv_s3_put_admission_rejections_total` and
`elastickv_s3_put_admission_wait_seconds`, with a brief paragraph
explaining why the split is operationally meaningful (chunked HoL
events vs. fixed-length client-concurrency events).

No code changes; design docs only.
diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -283,10 +283,23 @@ materialise by default.
 
 ```
 elastickv_s3_put_admission_inflight_bytes        gauge
-elastickv_s3_put_admission_rejections_total      counter (label: stage="prereserve" | "perbatch")
-elastickv_s3_put_admission_wait_seconds          histogram (label: stage)
+elastickv_s3_put_admission_rejections_total      counter (labels:
+                                                    stage    = "prereserve" | "perbatch",
+                                                    protocol = "fixed-length" | "chunked")
+elastickv_s3_put_admission_wait_seconds          histogram (labels: stage, protocol)
 ```
 
+The `protocol` label distinguishes fixed-length PUTs (those with a
+declared `Content-Length`, hitting admission A's `peekHeadroom`)
+from aws-chunked PUTs (admission via §3.3.1's pay-as-you-decode).
+This split is what makes the chunked-PUT 503 surface (§6) and the
+rolling-upgrade alerting story actionable: a spike on
+`stage="perbatch", protocol="chunked"` points at "chunked clients
+beat Raft drain"; a spike on `stage="prereserve",
+protocol="fixed-length"` points at "client concurrency exceeds
+the per-node aggregate cap." Without the dimension the two
+failure modes are indistinguishable in a single counter.
+
 Grafana panel: inflight gauge with the cap as a horizontal line so
 the operator sees how often the system saturates. Rejection rate
 suggests bumping the cap or scaling out (more nodes spreads PUT load).
@@ -316,13 +329,19 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load).
   pendingBatch slice for the entire retry window, so the budget
   must reflect them; a release-between-retries scheme would let a
   second PUT proceed while the first is still memory-resident,
-  breaking the bound. The total wall-clock cost of holding through
-  one full retry chain is bounded by
-  `s3TxnRetryMaxAttempts × (redisDispatchTimeout + s3TxnRetryMaxBackoff)`;
-  if that ever exceeds `dispatchAdmissionTimeout` the per-batch
-  acquire on the *next* batch surfaces as 503, which is the right
-  failure mode (chronic dispatch failure → caller learns instead of
-  silently consuming the budget).
+  breaking the bound. The S3 PUT path uses the inbound
+  `*http.Request` context for `coordinator.Dispatch` (no
+  S3-specific Dispatch timeout — the HTTP server's
+  `writeTimeout` / client-side cancellation is the upper bound on
+  one Dispatch attempt), so the wall-clock cost of holding the
+  slot through one full retry chain is bounded by
+  `s3TxnRetryMaxAttempts × (single_dispatch_budget + s3TxnRetryMaxBackoff)`
+  where `single_dispatch_budget` is whatever the request context
+  permits at that moment. If the retry chain duration ever
+  exceeds `dispatchAdmissionTimeout` the per-batch acquire on the
+  *next* batch surfaces as 503 — the right failure mode
+  (chronic dispatch failure → caller learns instead of silently
+  consuming the budget).
 
 ## 5. Implementation plan
 
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -323,22 +323,48 @@ sweeper needs to know not just *that* the RC reached zero but
    a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue|<now-gracePeriod>|)`
       for entries whose grace window has elapsed,
    b. for each `<SHA>` returned, re-checks the RC counter at the
-      sweeper's read timestamp; if the RC is still 0, deletes the
-      local `!s3|chunkblob|<SHA>` AND deletes the queue entry —
-      both as a single Raft txn so the queue stays consistent with
-      the keyspace,
+      sweeper's read timestamp.
+
+      *The deletion is two-phase across two storage layers and is
+      NOT a single transaction* — `!s3|chunkblob|<SHA>` is local
+      Pebble (per §3.1, never written through Raft), while
+      `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases
+      MUST run in this order:
+
+      i. **Raft phase first.** Delete the queue entry through a
+         Raft txn. Concurrent sweepers serialise here on
+         write-write-conflict; only the winner proceeds to the
+         local phase. This makes the queue the global "we are
+         GC-ing this SHA" lock.
+      ii. **Local phase second.** Delete the local
+          `!s3|chunkblob|<SHA>` from Pebble. No Raft round-trip.
+
+      The phase ordering is the load-bearing detail. If we did
+      local-first then Raft, a crash between the two phases would
+      leave the chunkblob gone locally but the queue entry still
+      present — every subsequent sweep would re-attempt the local
+      delete (no-op) and the queue entry would never get removed
+      until manual intervention. The Raft-first ordering trades
+      that for the inverse failure mode: a crash between the two
+      phases leaves the queue entry deleted but the local
+      chunkblob still on disk — a **bounded local space leak,
+      not a correctness bug**. A periodic "orphan scan" (chunkblob
+      keys whose SHA has RC=0 *and* no queue entry) reclaims
+      these without urgency. The orphan scan runs at low priority
+      out of band from the sweeper.
    c. if the RC has bounced above 0 in the meantime, the queue
       entry is stale (a re-reference txn forgot to remove it, or
       the sweeper raced) and the sweeper deletes only the queue
-      entry, leaving the chunkblob in place.
+      entry through a Raft txn, leaving the chunkblob in place.
 
 The queue is the authoritative "blob is GC-eligible since T"
-signal; the RC is the authoritative "is reachable" signal. Both
-are Raft-replicated, so every node arrives at the same set of
-sweepable SHAs and the same eligibility window. Different nodes
-running sweepers concurrently is safe because step (3b) commits
-through Raft and the txn fails with a write-write conflict on the
-second sweeper, leaving the first sweeper's deletion authoritative.
+signal *and* the global "we are GC-ing this SHA" lock — its
+Raft-replicated single-writer-per-key property is what makes
+concurrent sweepers safe across nodes. The RC is the
+authoritative "is reachable" signal, also Raft-replicated. Local
+chunkblob deletes are deliberately *not* replicated: each node
+deletes its own copy independently after the queue-entry txn
+commits, because that's the whole point of the architecture.
 
 `chunkBlobGCGracePeriod` defaults to 1 hour. The grace window
 absorbs in-flight reads (a peer that has already started fetching