Skip to content

Commit 7475060

Browse files
committed
docs(design): address Claude bot round-4 review (1 HIGH, 3 LOW)
s3_raft_blob_offload.md — HIGH: §3.5 step (3b) claimed "delete the local chunkblob AND the queue entry as a single Raft txn" — but chunkblob is local-only Pebble per §3.1 and the queue is Raft-replicated, so the two ops live in different storage layers and cannot share a Raft txn. Rewrite the step to specify a two-phase ordering: i. Raft phase first: delete the queue entry through Raft. Concurrent sweepers serialise here on write-write-conflict; the queue is therefore the global "we are GC-ing this SHA" lock. ii. Local phase second: delete the local chunkblob from Pebble. Document the failure mode of the inverse ordering (local-first would orphan the queue entry on crash) and of the chosen ordering (crash between phases leaves a local space leak — bounded, no correctness consequence — recoverable by a periodic orphan scan). s3_raft_blob_offload.md — LOW: the same §3.5 closing paragraph said "Both are Raft-replicated" referring to the queue and RC. That phrasing implied the chunkblob deletes were Raft-replicated too. Rewrite to explicitly distinguish: queue + RC are Raft-replicated; local chunkblob deletes are deliberately node-local because that is the whole point of the architecture. s3_admission_control.md — LOW: §4 retry-budget bound formula referenced `redisDispatchTimeout`, a Redis-path constant copy- pasted into the S3 design. The S3 PUT path actually uses the inbound `*http.Request` context (no S3-specific Dispatch timeout), so the formula now reads `single_dispatch_budget` with an explicit note that the upper bound is whatever the request context allows at that moment. s3_admission_control.md — LOW: §3.5 metrics spec defined only `stage="prereserve" | "perbatch"` but §6 and the Rolling Upgrade subsection both reference a `stage="perbatch", protocol="chunked"` label combination for isolating chunked-PUT rejection events. Add the `protocol="fixed-length" | "chunked"` label dimension to `elastickv_s3_put_admission_rejections_total` and `elastickv_s3_put_admission_wait_seconds`, with a brief paragraph explaining why the split is operationally meaningful (chunked HoL events vs. fixed-length client-concurrency events). No code changes; design docs only.
1 parent a69d902 commit 7475060

2 files changed

Lines changed: 65 additions & 20 deletions

File tree

docs/design/2026_04_25_proposed_s3_admission_control.md

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -283,10 +283,23 @@ materialise by default.
283283

284284
```
285285
elastickv_s3_put_admission_inflight_bytes gauge
286-
elastickv_s3_put_admission_rejections_total counter (label: stage="prereserve" | "perbatch")
287-
elastickv_s3_put_admission_wait_seconds histogram (label: stage)
286+
elastickv_s3_put_admission_rejections_total counter (labels:
287+
stage = "prereserve" | "perbatch",
288+
protocol = "fixed-length" | "chunked")
289+
elastickv_s3_put_admission_wait_seconds histogram (labels: stage, protocol)
288290
```
289291

292+
The `protocol` label distinguishes fixed-length PUTs (those with a
293+
declared `Content-Length`, hitting admission A's `peekHeadroom`)
294+
from aws-chunked PUTs (admission via §3.3.1's pay-as-you-decode).
295+
This split is what makes the chunked-PUT 503 surface (§6) and the
296+
rolling-upgrade alerting story actionable: a spike on
297+
`stage="perbatch", protocol="chunked"` points at "chunked clients
298+
beat Raft drain"; a spike on `stage="prereserve",
299+
protocol="fixed-length"` points at "client concurrency exceeds
300+
the per-node aggregate cap." Without the dimension the two
301+
failure modes are indistinguishable in a single counter.
302+
290303
Grafana panel: inflight gauge with the cap as a horizontal line so
291304
the operator sees how often the system saturates. Rejection rate
292305
suggests bumping the cap or scaling out (more nodes spreads PUT load).
@@ -316,13 +329,19 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load).
316329
pendingBatch slice for the entire retry window, so the budget
317330
must reflect them; a release-between-retries scheme would let a
318331
second PUT proceed while the first is still memory-resident,
319-
breaking the bound. The total wall-clock cost of holding through
320-
one full retry chain is bounded by
321-
`s3TxnRetryMaxAttempts × (redisDispatchTimeout + s3TxnRetryMaxBackoff)`;
322-
if that ever exceeds `dispatchAdmissionTimeout` the per-batch
323-
acquire on the *next* batch surfaces as 503, which is the right
324-
failure mode (chronic dispatch failure → caller learns instead of
325-
silently consuming the budget).
332+
breaking the bound. The S3 PUT path uses the inbound
333+
`*http.Request` context for `coordinator.Dispatch` (no
334+
S3-specific Dispatch timeout — the HTTP server's
335+
`writeTimeout` / client-side cancellation is the upper bound on
336+
one Dispatch attempt), so the wall-clock cost of holding the
337+
slot through one full retry chain is bounded by
338+
`s3TxnRetryMaxAttempts × (single_dispatch_budget + s3TxnRetryMaxBackoff)`
339+
where `single_dispatch_budget` is whatever the request context
340+
permits at that moment. If the retry chain duration ever
341+
exceeds `dispatchAdmissionTimeout` the per-batch acquire on the
342+
*next* batch surfaces as 503 — the right failure mode
343+
(chronic dispatch failure → caller learns instead of silently
344+
consuming the budget).
326345

327346
## 5. Implementation plan
328347

docs/design/2026_04_25_proposed_s3_raft_blob_offload.md

Lines changed: 37 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -323,22 +323,48 @@ sweeper needs to know not just *that* the RC reached zero but
323323
a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue|<now-gracePeriod>|)`
324324
for entries whose grace window has elapsed,
325325
b. for each `<SHA>` returned, re-checks the RC counter at the
326-
sweeper's read timestamp; if the RC is still 0, deletes the
327-
local `!s3|chunkblob|<SHA>` AND deletes the queue entry —
328-
both as a single Raft txn so the queue stays consistent with
329-
the keyspace,
326+
sweeper's read timestamp.
327+
328+
*The deletion is two-phase across two storage layers and is
329+
NOT a single transaction*`!s3|chunkblob|<SHA>` is local
330+
Pebble (per §3.1, never written through Raft), while
331+
`!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases
332+
MUST run in this order:
333+
334+
i. **Raft phase first.** Delete the queue entry through a
335+
Raft txn. Concurrent sweepers serialise here on
336+
write-write-conflict; only the winner proceeds to the
337+
local phase. This makes the queue the global "we are
338+
GC-ing this SHA" lock.
339+
ii. **Local phase second.** Delete the local
340+
`!s3|chunkblob|<SHA>` from Pebble. No Raft round-trip.
341+
342+
The phase ordering is the load-bearing detail. If we did
343+
local-first then Raft, a crash between the two phases would
344+
leave the chunkblob gone locally but the queue entry still
345+
present — every subsequent sweep would re-attempt the local
346+
delete (no-op) and the queue entry would never get removed
347+
until manual intervention. The Raft-first ordering trades
348+
that for the inverse failure mode: a crash between the two
349+
phases leaves the queue entry deleted but the local
350+
chunkblob still on disk — a **bounded local space leak,
351+
not a correctness bug**. A periodic "orphan scan" (chunkblob
352+
keys whose SHA has RC=0 *and* no queue entry) reclaims
353+
these without urgency. The orphan scan runs at low priority
354+
out of band from the sweeper.
330355
c. if the RC has bounced above 0 in the meantime, the queue
331356
entry is stale (a re-reference txn forgot to remove it, or
332357
the sweeper raced) and the sweeper deletes only the queue
333-
entry, leaving the chunkblob in place.
358+
entry through a Raft txn, leaving the chunkblob in place.
334359

335360
The queue is the authoritative "blob is GC-eligible since T"
336-
signal; the RC is the authoritative "is reachable" signal. Both
337-
are Raft-replicated, so every node arrives at the same set of
338-
sweepable SHAs and the same eligibility window. Different nodes
339-
running sweepers concurrently is safe because step (3b) commits
340-
through Raft and the txn fails with a write-write conflict on the
341-
second sweeper, leaving the first sweeper's deletion authoritative.
361+
signal *and* the global "we are GC-ing this SHA" lock — its
362+
Raft-replicated single-writer-per-key property is what makes
363+
concurrent sweepers safe across nodes. The RC is the
364+
authoritative "is reachable" signal, also Raft-replicated. Local
365+
chunkblob deletes are deliberately *not* replicated: each node
366+
deletes its own copy independently after the queue-entry txn
367+
commits, because that's the whole point of the architecture.
342368

343369
`chunkBlobGCGracePeriod` defaults to 1 hour. The grace window
344370
absorbs in-flight reads (a peer that has already started fetching

0 commit comments

Comments
 (0)