Skip to content

Commit 104237d

Browse files
committed
docs(design): address round-3 workload-isolation review
- Layer 1 blocking variants (Codex P1 + Gemini Medium): retract the previous "I/O-bound, bypass the pool" decision. The current adapter busy-polls XREAD BLOCK / B*POP with time.Sleep + KV+leader re-check on each wake, so bypassing would hand CPU unbounded to them. v1 keeps blocking variants gated; a condvar-based wake-up is listed as the prereq for an honest bypass in a stacked follow-up. - Layer 1 Lua ctx caveat (Gemini Medium): runLuaScript currently rebuilds the script context from context.Background(), which drops the ctxKeyInPoolSlot sentinel Submit installed when it dispatched the outer EVAL. Called out explicitly that Layer 1 MUST NOT ship before runLuaScript derives its context from the caller-supplied ctx; otherwise Lua scripts inside the pool self-deadlock on every redis.call. - Layer 4 Mode B cost (Gemini HIGH): Mode B preserves the O(N_suffix) unmarshal cost during the migration window, not constant O(new). Added a read-driven drain mitigation, top-N legacy-suffix-size metric for operator visibility, and an explicit note that Mode A is the conservative choice for operators who can't tolerate transient cost. - Layer 4 removal criterion (Codex P1): just "no legacy reads" could keep the fallback alive on cold streams forever. Require a paired elastickv_stream_legacy_format_keys_total derived from a periodic prefix-scan; the fallback is safe to remove only when both counters are zero.
1 parent 731475e commit 104237d

1 file changed

Lines changed: 70 additions & 15 deletions

File tree

docs/design/2026_04_24_proposed_workload_isolation.md

Lines changed: 70 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -94,17 +94,30 @@ command byte" simplicity, and a bounded `ZRANGE 0 10` contributes at
9494
most one unmarshal per request (cheap). Dynamic (observed-cost)
9595
classification is a follow-up; v1 bias is a boring, reviewable list.
9696

97-
**Blocking `XREAD BLOCK ms` is a special case** — it may hold a
98-
worker slot for up to `ms` milliseconds while doing no work, which
99-
can trivially exhaust the pool if even a handful of long-polling
100-
consumers are active. v1 resolution: **blocking variants (`XREAD
101-
BLOCK`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX`) bypass the heavy-command
102-
pool and are handled on their own goroutine**. They are I/O-bound
103-
waiting, not CPU-bound; their CPU cost lands on wake-up, when the
104-
dispatcher can re-evaluate whether to gate the follow-up work. The
105-
gating decision is made in `dispatchCommand` when it sees the command
106-
name plus the `BLOCK`/`B*` prefix — the simplest arg inspection we
107-
allow, limited to "is this command blocking?".
97+
**Blocking variants are NOT I/O-bound in the current implementation.**
98+
`XREAD BLOCK ms`, `BLPOP`, `BRPOP`, `BZPOPMIN`/`MAX` look idle from
99+
the outside but the adapter (`adapter/redis_compat_commands.go:xread`,
100+
`:bzpopmin` around line 3432) implements them as a **busy-poll loop**:
101+
on a miss it calls `time.Sleep(redisBusyPollBackoff)` and re-issues
102+
the underlying KV+leader lookup. Every wake-up does CPU work and a
103+
Raft leadership round-trip, then sleeps again. A pool-bypass for
104+
"blocking" variants under this implementation would hand them
105+
unbounded CPU on the fast path, the opposite of what we want.
106+
107+
**v1 resolution: keep the blocking variants gated** alongside the
108+
other heavy commands. Reject with `-BUSY` when the pool is full,
109+
same as XREAD. The behaviour is strictly worse than a true
110+
condition-variable wake-up (which would be slot-free), but correct
111+
under the existing busy-poll, and consistent with the rest of the
112+
heavy-command accounting.
113+
114+
**Stacked follow-up to unblock a real bypass:** replace the
115+
busy-poll with a condvar/notification hook fed by the write path.
116+
Only after that lands can blocking variants honestly be called
117+
I/O-bound; at that point carve them out of the pool with the
118+
simplest form of arg inspection (`XREAD …BLOCK…`, `B*POP`) and
119+
re-evaluate pool sizing. Tracked as a separate item in the stream
120+
and list/zset adapters; not required by Layer 1 v1.
108121

109122
### Tradeoffs
110123

@@ -144,6 +157,22 @@ without tagging every goroutine or holding a pool-wide set of
144157
goroutine IDs. The sentinel must be package-private so external
145158
callers cannot fake it.
146159

160+
**Caveat — `runLuaScript` currently clobbers the parent ctx.**
161+
`adapter/redis_lua.go:runLuaScript` builds its per-script context as
162+
`context.WithTimeout(context.Background(), ...)`, which throws away
163+
the `ctxKeyInPoolSlot` sentinel that `Submit` attached when it
164+
dispatched the outer `EVAL`/`EVALSHA`. Option (A) is therefore not
165+
implementable as-is — the inner `redis.call` would see a plain
166+
background context and try to acquire another pool slot, triggering
167+
the exact deadlock we were trying to avoid.
168+
169+
The implementation PR MUST fix this before enabling the pool gate
170+
on Lua. The fix is to replace `context.Background()` with the
171+
caller-supplied `ctx` (a timeout derived from it, not from
172+
`Background`) so the sentinel propagates. This is a one-line change
173+
but a blocker for Layer 1 v1; Layer 1 must not ship without it or
174+
Lua inside the pool will self-deadlock under steady load.
175+
147176
### Recommended v1 shape
148177

149178
Package-level pool in `adapter/` with a `Submit(command, fn)` entry
@@ -417,10 +446,36 @@ during chunked migration — entries still in the legacy suffix would
417446
be invisible to readers until the suffix was fully drained. Mode B
418447
must ship together with the "always merge" read rule.
419448

420-
`elastickv_stream_legacy_format_reads_total` counts reads that
421-
touched a legacy-format key in either mode. Remove the legacy
422-
fallback only after it has sat at zero across all nodes for a soak
423-
window.
449+
**Mode B cost model.** Decoding the legacy-suffix blob on every read
450+
is O(N_suffix) protobuf unmarshal — the exact cost that Layer 4 was
451+
introduced to eliminate. A partially-migrated stream therefore
452+
still has the pre-fix hot path, just bounded by the suffix size
453+
rather than the full stream size. Two mitigations:
454+
455+
- The migrator uses a **read-driven drain**: when a read observes a
456+
suffix blob, it enqueues a low-priority rewrite-N-entries job so
457+
hot streams drain first. Cold streams drain on their next write.
458+
- Exporting `elastickv_stream_legacy_suffix_entries{stream}` as a
459+
top-N sketch lets operators see which streams still carry a
460+
suffix and size the `STREAM_MIGRATION_CHUNK` accordingly.
461+
462+
Neither fully reclaims Layer 4's O(new) guarantee during the
463+
migration window; operators who cannot tolerate the transient cost
464+
must stay on Mode A and accept the single-txn cost of the initial
465+
migration instead.
466+
467+
**Legacy-fallback removal criterion.** Just watching
468+
`elastickv_stream_legacy_format_reads_total == 0` is insufficient —
469+
a cold legacy-format stream that is neither read nor written for
470+
the soak window would keep the counter at zero while still needing
471+
the fallback. Add a paired counter
472+
`elastickv_stream_legacy_format_keys_total` populated from a
473+
periodic prefix scan (`!redis|stream|<...>` with no matching
474+
`!stream|meta|<...>`). The fallback is safe to remove only when
475+
**both** counters are zero across every node. The periodic scan
476+
runs at the same cadence as snapshot cleanup; its cost is bounded
477+
by the number of legacy keys remaining, which decays as migration
478+
progresses.
424479

425480
The existing stream PR (#620) ships **Mode A only**. Chunked
426481
migration (Mode B) is explicitly deferred and must not be enabled

0 commit comments

Comments
 (0)