@@ -94,6 +94,18 @@ command byte" simplicity, and a bounded `ZRANGE 0 10` contributes at
9494most one unmarshal per request (cheap). Dynamic (observed-cost)
9595classification is a follow-up; v1 bias is a boring, reviewable list.
9696
97+ ** Blocking ` XREAD BLOCK ms ` is a special case** — it may hold a
98+ worker slot for up to ` ms ` milliseconds while doing no work, which
99+ can trivially exhaust the pool if even a handful of long-polling
100+ consumers are active. v1 resolution: ** blocking variants (`XREAD
101+ BLOCK` , ` BLPOP` , ` BRPOP` , ` BZPOPMIN` / ` MAX`) bypass the heavy-command
102+ pool and are handled on their own goroutine** . They are I/O-bound
103+ waiting, not CPU-bound; their CPU cost lands on wake-up, when the
104+ dispatcher can re-evaluate whether to gate the follow-up work. The
105+ gating decision is made in ` dispatchCommand ` when it sees the command
106+ name plus the ` BLOCK ` /` B* ` prefix — the simplest arg inspection we
107+ allow, limited to "is this command blocking?".
108+
97109### Tradeoffs
98110
99111- Adds an enqueue → pickup hop for gated commands. Pool-has-capacity
@@ -121,6 +133,17 @@ outer Lua waiting for an inner call that can never start. Two options:
121133(A) preserves "one client request = one slot." Ship must pick one
122134explicitly; do not discover this at test time.
123135
136+ ** Implementation note for (A): context propagation.** ` Submit `
137+ identifies "inside a pool slot" by attaching a sentinel value to
138+ ` context.Context ` (` ctxKeyInPoolSlot ` ). The Lua adapter threads that
139+ ` ctx ` into every ` redis.call ` it makes; the dispatcher's pool-gate
140+ check returns immediately when ` ctx.Value(ctxKeyInPoolSlot) != nil `
141+ instead of attempting another ` Submit ` . This is the only mechanism
142+ that reliably distinguishes "new client request" from "inner call"
143+ without tagging every goroutine or holding a pool-wide set of
144+ goroutine IDs. The sentinel must be package-private so external
145+ callers cannot fake it.
146+
124147### Recommended v1 shape
125148
126149Package-level pool in ` adapter/ ` with a ` Submit(command, fn) ` entry
@@ -129,15 +152,18 @@ commands in `dispatchCommand` call `Submit`; ungated stay
129152synchronous. Static list lives next to ` dispatchCommand ` . Pool-full →
130153` -BUSY server overloaded ` . Lua follows option (A).
131154
132- ** Container-aware sizing.** ` runtime.GOMAXPROCS(0) ` on Linux returns
133- the host CPU count, not the container's cgroup quota (unless
134- ` GOMAXPROCS ` is set explicitly or something has configured it).
135- Operators running under Kubernetes/Docker with a CPU limit should
136- either (a) set ` GOMAXPROCS ` in the deploy environment to match the
137- cgroup quota (the project's rolling-update script is the natural
138- place), or (b) wire ` go.uber.org/automaxprocs ` into ` main.go ` so the
139- correction happens at startup. v1 prefers (a) for auditability;
140- (b) is acceptable as a follow-up if operators want it automatic.
155+ ** Container-aware sizing.** Go 1.25+ (which this repo uses) derives
156+ the default ` GOMAXPROCS ` from the cgroup v2 CPU quota on Linux
157+ automatically, so in most cases ` runtime.GOMAXPROCS(0) ` already
158+ reflects the container's share. Two caveats remain: (a) Go runtimes
159+ older than 1.25 do not, and (b) explicitly setting ` GOMAXPROCS `
160+ disables the runtime's periodic quota-change detection, so an
161+ operator who hard-codes the value in the deploy environment loses
162+ auto-updates if the quota changes at runtime. v1 leaves the runtime
163+ default in place and documents the two caveats; a ` GOMAXPROCS ` env
164+ override is still honoured for operators who want explicit control.
165+ ` go.uber.org/automaxprocs ` remains an option for pre-1.25 toolchains
166+ but is not needed for this repo.
141167
142168** Single pool vs per-class sub-pools.** v1 uses a single global pool.
143169The risk: a burst of ` KEYS * ` or ` SCAN ` from a management client can
@@ -285,6 +311,18 @@ network failure. Per-client in-flight semaphore is deferred: it
285311requires threading client identity through every dispatch, which is
286312a bigger change than 2026-04-24 justifies.
287313
314+ ** Avoiding a reject-storm feedback loop.** A client with an
315+ aggressive reconnect pool can answer each ` -ERR max connections `
316+ with an immediate new ` connect() ` — the server spends CPU on the
317+ accept/write/close cycle and the client makes no progress. Two
318+ mitigations: (a) ** rate-limit the reject itself** : once a peer IP
319+ has been rejected ` R ` times in the last second, the next rejects
320+ are answered with ` RST ` (cheap kernel-level reset) instead of an
321+ accept + write + close; (b) document operator-side client
322+ configuration (e.g., for redis-rb: ` reconnect_attempts=3 ` plus an
323+ exponential backoff). (a) ships in v1 behind a compile-time
324+ constant; (b) belongs in the ops runbook.
325+
288326### Where in the code
289327
290328- ` adapter/redis.go:631 ` — ` Run ` , where ` redcon.Serve ` is called.
@@ -338,34 +376,55 @@ new entries. O(new), matching the XREAD spec.
338376### Migration path
339377
340378Streams persist across restarts and can be large, so no flag-day
341- rewrite. Dual-format read:
342-
343- 1 . On XREAD/XRANGE/XLEN/XREVRANGE, try the new per-entry layout first.
344- 2 . If empty AND the legacy single-blob key exists, fall back to the
345- legacy path.
346- 3 . On the next write (` XADD ` /` XDEL ` /...), rewrite to per-entry and
347- delete the legacy blob in the same commit.
348-
349- Remove the legacy fallback later once
350- ` elastickv_stream_legacy_format_reads_total ` has sat at zero across
351- all nodes for a soak window.
352-
353- ** Chunked migration for large legacy streams.** A single ` XADD ` on a
354- 100k-entry legacy stream would rewrite every entry in one Raft
355- transaction — regressing into the same CPU and commit-time spike the
356- design is supposed to prevent. The migration write therefore caps how
357- many entries it rewrites per transaction (` STREAM_MIGRATION_CHUNK ` ,
358- default 1 024). When the legacy stream exceeds the chunk, the first
359- write migrates the oldest ` CHUNK ` entries and the remaining legacy
360- tail stays in a legacy-* suffix* key (the blob minus the entries
361- already promoted). Subsequent writes drain another chunk each, until
362- the legacy tail is empty and can be deleted. A background "migrator"
363- goroutine driven by the same ` legacy_format_reads_total ` metric is a
364- follow-up if operator-driven migration proves too slow.
365-
366- The existing stream PR (#620 ) ships the simpler "rewrite all in one
367- txn" version; chunked migration is a stacked follow-up once we see
368- legacy stream sizes in production.
379+ rewrite.
380+
381+ ** Two migration modes** — simple (PR #620 , v1 stream PR) and chunked
382+ (stacked follow-up). The dual-read rule differs between them; the
383+ distinction matters for correctness.
384+
385+ ** Mode A — simple migration (PR #620 ships this):** the first write
386+ rewrites the entire legacy blob and deletes it in one Raft commit.
387+ At any given instant a stream is either entirely legacy or entirely
388+ per-entry; there is no mixed state. Read rule:
389+
390+ 1 . On XREAD/XRANGE/XLEN/XREVRANGE, read the per-entry layout.
391+ 2 . If the per-entry meta key is absent AND the legacy blob key
392+ exists, fall back to the legacy path.
393+ 3 . On the next write, rewrite to per-entry and delete the legacy blob
394+ in the same commit.
395+
396+ ** Mode B — chunked migration (follow-up):** each write drains at
397+ most ` STREAM_MIGRATION_CHUNK ` (default 1 024) entries from the legacy
398+ blob into per-entry keys, and leaves the rest in a * legacy-suffix*
399+ key until a subsequent write drains more. During this window the
400+ stream exists in BOTH layouts simultaneously: the oldest N entries
401+ are per-entry, the newer M entries are still in the suffix blob.
402+
403+ Read rule for Mode B — ** always merge both layouts** , do not
404+ fall-through on "new layout empty":
405+
406+ 1 . Read ` meta ` if present; read all per-entry keys that match the
407+ requested ID range.
408+ 2 . Read the legacy-suffix blob if present; decode only entries
409+ falling in the ID range.
410+ 3 . Merge by ID order, deduplicate (the migrator is responsible for
411+ never writing the same ID in both layouts in a single commit), and
412+ return.
413+
414+ The v1 dual-read (Mode A) is safe because there is no mixed state.
415+ Extending it verbatim to Mode B would return incomplete results
416+ during chunked migration — entries still in the legacy suffix would
417+ be invisible to readers until the suffix was fully drained. Mode B
418+ must ship together with the "always merge" read rule.
419+
420+ ` elastickv_stream_legacy_format_reads_total ` counts reads that
421+ touched a legacy-format key in either mode. Remove the legacy
422+ fallback only after it has sat at zero across all nodes for a soak
423+ window.
424+
425+ The existing stream PR (#620 ) ships ** Mode A only** . Chunked
426+ migration (Mode B) is explicitly deferred and must not be enabled
427+ before the merged-read rule lands alongside it.
369428
370429### Other one-blob-per-key collections
371430
@@ -427,8 +486,17 @@ Recommended order of implementation:
427486 workload.
4284872 . ** Layer 1 second.** Generic defense for the next unknown
429488 hotspot. Static command list is small, reviewable, and composes
430- with Layer 4 — streams become a cheap command once Layer 4 ships,
431- but the pool still catches Lua, KEYS, and HGETALL.
489+ with Layer 4. ** Once Layer 4 ships, XREAD's per-call cost is
490+ O(new) so in steady state it is cheap** , but we deliberately
491+ keep it gated in Layer 1 v1 for three reasons: (i) a client can
492+ still request a huge ID range via XRANGE / a massive ` COUNT ` on
493+ XREAD that the adapter must scan; (ii) the legacy fallback path
494+ is still reachable during the migration soak window and that
495+ path is still O(n); (iii) revisiting the classification after
496+ Layer 4 + Layer 6 metric is a reviewable data-driven decision,
497+ not a v1 speculation. The ` elastickv_heavy_command_pool_submit_total{cmd="XREAD"} `
498+ metric added in Layer 1 is the signal that tells us when XREAD
499+ can graduate to ungated.
4325003 . ** Layer 3 third.** Per-client fairness. Coordinate with the
433501 resilience roadmap item-6 work so we don't ship two overlapping
434502 admission-control mechanisms. If item 6 ships first, Layer 3 is
0 commit comments