Skip to content

Commit 731475e

Browse files
committed
docs(design): address round-2 workload-isolation review
- Layer 1 classification (line 87, Gemini HIGH): carve out blocking XREAD/BLPOP/BRPOP/BZPOP from pool gating. Blocking variants hold a slot while idle, trivially exhaust the pool; handle them on their own goroutine and re-gate on wake-up. - Layer 1 Lua recursion (line 117, Gemini Medium): document the context-propagation mechanism that makes option A implementable. A package-private sentinel value on ctx distinguishes inner call from new request; external callers cannot fake it. - Layer 1 GOMAXPROCS (line 134, Codex P2): correct stale guidance. Go 1.25+ already derives GOMAXPROCS from cgroup v2 quota, so the automaxprocs recommendation no longer applies to this repo; keep the override env knob for operators who want explicit control. - Layer 3 reject semantics (line 283, Gemini Medium): add a reject-storm mitigation. Rate-limit the reject itself — after R rejects/s to one peer, switch from accept+write+close to RST; recommend client-side backoff in the ops runbook. - Layer 4 migration (line 345, Gemini HIGH + Codex P1): split the migration doc into Mode A (simple, PR #620 ships this) and Mode B (chunked, stacked follow-up). The read rules differ: Mode A has no mixed state so fall-through is correct; Mode B has a legal mixed state and MUST always merge both layouts. Making the distinction explicit prevents the dual-read correctness bug the reviewers flagged — Mode B with Mode A's fall-through rule returns incomplete results during chunking. - Sequencing (line 431, Gemini Medium): resolve the XREAD gated-vs-cheap contradiction. Layer 4 makes XREAD steady-state O(new) but we keep it gated in Layer 1 v1 for three concrete reasons (large XRANGE bounds, legacy-fallback window, data-driven promotion); promotion to ungated is gated on the pool-submit metric added in Layer 1.
1 parent 9d44b6c commit 731475e

1 file changed

Lines changed: 107 additions & 39 deletions

File tree

docs/design/2026_04_24_proposed_workload_isolation.md

Lines changed: 107 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,18 @@ command byte" simplicity, and a bounded `ZRANGE 0 10` contributes at
9494
most one unmarshal per request (cheap). Dynamic (observed-cost)
9595
classification is a follow-up; v1 bias is a boring, reviewable list.
9696

97+
**Blocking `XREAD BLOCK ms` is a special case** — it may hold a
98+
worker slot for up to `ms` milliseconds while doing no work, which
99+
can trivially exhaust the pool if even a handful of long-polling
100+
consumers are active. v1 resolution: **blocking variants (`XREAD
101+
BLOCK`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX`) bypass the heavy-command
102+
pool and are handled on their own goroutine**. They are I/O-bound
103+
waiting, not CPU-bound; their CPU cost lands on wake-up, when the
104+
dispatcher can re-evaluate whether to gate the follow-up work. The
105+
gating decision is made in `dispatchCommand` when it sees the command
106+
name plus the `BLOCK`/`B*` prefix — the simplest arg inspection we
107+
allow, limited to "is this command blocking?".
108+
97109
### Tradeoffs
98110

99111
- Adds an enqueue → pickup hop for gated commands. Pool-has-capacity
@@ -121,6 +133,17 @@ outer Lua waiting for an inner call that can never start. Two options:
121133
(A) preserves "one client request = one slot." Ship must pick one
122134
explicitly; do not discover this at test time.
123135

136+
**Implementation note for (A): context propagation.** `Submit`
137+
identifies "inside a pool slot" by attaching a sentinel value to
138+
`context.Context` (`ctxKeyInPoolSlot`). The Lua adapter threads that
139+
`ctx` into every `redis.call` it makes; the dispatcher's pool-gate
140+
check returns immediately when `ctx.Value(ctxKeyInPoolSlot) != nil`
141+
instead of attempting another `Submit`. This is the only mechanism
142+
that reliably distinguishes "new client request" from "inner call"
143+
without tagging every goroutine or holding a pool-wide set of
144+
goroutine IDs. The sentinel must be package-private so external
145+
callers cannot fake it.
146+
124147
### Recommended v1 shape
125148

126149
Package-level pool in `adapter/` with a `Submit(command, fn)` entry
@@ -129,15 +152,18 @@ commands in `dispatchCommand` call `Submit`; ungated stay
129152
synchronous. Static list lives next to `dispatchCommand`. Pool-full →
130153
`-BUSY server overloaded`. Lua follows option (A).
131154

132-
**Container-aware sizing.** `runtime.GOMAXPROCS(0)` on Linux returns
133-
the host CPU count, not the container's cgroup quota (unless
134-
`GOMAXPROCS` is set explicitly or something has configured it).
135-
Operators running under Kubernetes/Docker with a CPU limit should
136-
either (a) set `GOMAXPROCS` in the deploy environment to match the
137-
cgroup quota (the project's rolling-update script is the natural
138-
place), or (b) wire `go.uber.org/automaxprocs` into `main.go` so the
139-
correction happens at startup. v1 prefers (a) for auditability;
140-
(b) is acceptable as a follow-up if operators want it automatic.
155+
**Container-aware sizing.** Go 1.25+ (which this repo uses) derives
156+
the default `GOMAXPROCS` from the cgroup v2 CPU quota on Linux
157+
automatically, so in most cases `runtime.GOMAXPROCS(0)` already
158+
reflects the container's share. Two caveats remain: (a) Go runtimes
159+
older than 1.25 do not, and (b) explicitly setting `GOMAXPROCS`
160+
disables the runtime's periodic quota-change detection, so an
161+
operator who hard-codes the value in the deploy environment loses
162+
auto-updates if the quota changes at runtime. v1 leaves the runtime
163+
default in place and documents the two caveats; a `GOMAXPROCS` env
164+
override is still honoured for operators who want explicit control.
165+
`go.uber.org/automaxprocs` remains an option for pre-1.25 toolchains
166+
but is not needed for this repo.
141167

142168
**Single pool vs per-class sub-pools.** v1 uses a single global pool.
143169
The risk: a burst of `KEYS *` or `SCAN` from a management client can
@@ -285,6 +311,18 @@ network failure. Per-client in-flight semaphore is deferred: it
285311
requires threading client identity through every dispatch, which is
286312
a bigger change than 2026-04-24 justifies.
287313

314+
**Avoiding a reject-storm feedback loop.** A client with an
315+
aggressive reconnect pool can answer each `-ERR max connections`
316+
with an immediate new `connect()` — the server spends CPU on the
317+
accept/write/close cycle and the client makes no progress. Two
318+
mitigations: (a) **rate-limit the reject itself**: once a peer IP
319+
has been rejected `R` times in the last second, the next rejects
320+
are answered with `RST` (cheap kernel-level reset) instead of an
321+
accept + write + close; (b) document operator-side client
322+
configuration (e.g., for redis-rb: `reconnect_attempts=3` plus an
323+
exponential backoff). (a) ships in v1 behind a compile-time
324+
constant; (b) belongs in the ops runbook.
325+
288326
### Where in the code
289327

290328
- `adapter/redis.go:631``Run`, where `redcon.Serve` is called.
@@ -338,34 +376,55 @@ new entries. O(new), matching the XREAD spec.
338376
### Migration path
339377

340378
Streams persist across restarts and can be large, so no flag-day
341-
rewrite. Dual-format read:
342-
343-
1. On XREAD/XRANGE/XLEN/XREVRANGE, try the new per-entry layout first.
344-
2. If empty AND the legacy single-blob key exists, fall back to the
345-
legacy path.
346-
3. On the next write (`XADD`/`XDEL`/...), rewrite to per-entry and
347-
delete the legacy blob in the same commit.
348-
349-
Remove the legacy fallback later once
350-
`elastickv_stream_legacy_format_reads_total` has sat at zero across
351-
all nodes for a soak window.
352-
353-
**Chunked migration for large legacy streams.** A single `XADD` on a
354-
100k-entry legacy stream would rewrite every entry in one Raft
355-
transaction — regressing into the same CPU and commit-time spike the
356-
design is supposed to prevent. The migration write therefore caps how
357-
many entries it rewrites per transaction (`STREAM_MIGRATION_CHUNK`,
358-
default 1 024). When the legacy stream exceeds the chunk, the first
359-
write migrates the oldest `CHUNK` entries and the remaining legacy
360-
tail stays in a legacy-*suffix* key (the blob minus the entries
361-
already promoted). Subsequent writes drain another chunk each, until
362-
the legacy tail is empty and can be deleted. A background "migrator"
363-
goroutine driven by the same `legacy_format_reads_total` metric is a
364-
follow-up if operator-driven migration proves too slow.
365-
366-
The existing stream PR (#620) ships the simpler "rewrite all in one
367-
txn" version; chunked migration is a stacked follow-up once we see
368-
legacy stream sizes in production.
379+
rewrite.
380+
381+
**Two migration modes** — simple (PR #620, v1 stream PR) and chunked
382+
(stacked follow-up). The dual-read rule differs between them; the
383+
distinction matters for correctness.
384+
385+
**Mode A — simple migration (PR #620 ships this):** the first write
386+
rewrites the entire legacy blob and deletes it in one Raft commit.
387+
At any given instant a stream is either entirely legacy or entirely
388+
per-entry; there is no mixed state. Read rule:
389+
390+
1. On XREAD/XRANGE/XLEN/XREVRANGE, read the per-entry layout.
391+
2. If the per-entry meta key is absent AND the legacy blob key
392+
exists, fall back to the legacy path.
393+
3. On the next write, rewrite to per-entry and delete the legacy blob
394+
in the same commit.
395+
396+
**Mode B — chunked migration (follow-up):** each write drains at
397+
most `STREAM_MIGRATION_CHUNK` (default 1 024) entries from the legacy
398+
blob into per-entry keys, and leaves the rest in a *legacy-suffix*
399+
key until a subsequent write drains more. During this window the
400+
stream exists in BOTH layouts simultaneously: the oldest N entries
401+
are per-entry, the newer M entries are still in the suffix blob.
402+
403+
Read rule for Mode B — **always merge both layouts**, do not
404+
fall-through on "new layout empty":
405+
406+
1. Read `meta` if present; read all per-entry keys that match the
407+
requested ID range.
408+
2. Read the legacy-suffix blob if present; decode only entries
409+
falling in the ID range.
410+
3. Merge by ID order, deduplicate (the migrator is responsible for
411+
never writing the same ID in both layouts in a single commit), and
412+
return.
413+
414+
The v1 dual-read (Mode A) is safe because there is no mixed state.
415+
Extending it verbatim to Mode B would return incomplete results
416+
during chunked migration — entries still in the legacy suffix would
417+
be invisible to readers until the suffix was fully drained. Mode B
418+
must ship together with the "always merge" read rule.
419+
420+
`elastickv_stream_legacy_format_reads_total` counts reads that
421+
touched a legacy-format key in either mode. Remove the legacy
422+
fallback only after it has sat at zero across all nodes for a soak
423+
window.
424+
425+
The existing stream PR (#620) ships **Mode A only**. Chunked
426+
migration (Mode B) is explicitly deferred and must not be enabled
427+
before the merged-read rule lands alongside it.
369428

370429
### Other one-blob-per-key collections
371430

@@ -427,8 +486,17 @@ Recommended order of implementation:
427486
workload.
428487
2. **Layer 1 second.** Generic defense for the next unknown
429488
hotspot. Static command list is small, reviewable, and composes
430-
with Layer 4 — streams become a cheap command once Layer 4 ships,
431-
but the pool still catches Lua, KEYS, and HGETALL.
489+
with Layer 4. **Once Layer 4 ships, XREAD's per-call cost is
490+
O(new) so in steady state it is cheap**, but we deliberately
491+
keep it gated in Layer 1 v1 for three reasons: (i) a client can
492+
still request a huge ID range via XRANGE / a massive `COUNT` on
493+
XREAD that the adapter must scan; (ii) the legacy fallback path
494+
is still reachable during the migration soak window and that
495+
path is still O(n); (iii) revisiting the classification after
496+
Layer 4 + Layer 6 metric is a reviewable data-driven decision,
497+
not a v1 speculation. The `elastickv_heavy_command_pool_submit_total{cmd="XREAD"}`
498+
metric added in Layer 1 is the signal that tells us when XREAD
499+
can graduate to ungated.
432500
3. **Layer 3 third.** Per-client fairness. Coordinate with the
433501
resilience roadmap item-6 work so we don't ship two overlapping
434502
admission-control mechanisms. If item 6 ships first, Layer 3 is

0 commit comments

Comments
 (0)