Skip to content

Commit 2ae4056

Browse files
committed
docs(design): round-4 workload-isolation review fixes
- Gemini HIGH (line 89 XADD starvation): explicitly flag that Mode A first-XADD migration rewrite is O(N) and therefore the ungated classification is a latent incident if Layer 4 Mode A lands before Layer 1. Add two ordering options: ship together, or dynamically gate XADD while the legacy-keys counter is non-zero. - Gemini Medium (line 438 protobuf partial decode): correct the Mode B read rule. Protobuf cannot decode a repeated field partially, so the suffix blob is fully unmarshaled and filtered in memory. This matches (not contradicts) the Mode B cost model. - Gemini HIGH (line 475 scan cost) + Codex P2 (line 474 key pattern): the prefix-scan approach to counting legacy keys would iterate every entry of every migrated stream because entries share the !redis|stream| prefix; also the prior wording compared !redis|stream| keys to a phantom !stream|meta| key. Replace both with either a sidecar legacy-index or a per-stream layout walk (O(num_streams), not O(num_entries)). Retract the wrong wording. - Gemini Medium (line 350 RST overhead): pure-Go net cannot emit a true kernel RST without Accept(); clarify that (a) cheap path means skip the RESP write + SetLinger(0) close, not no accept. Real RST-before-accept is v2 via raw listener/eBPF. - Codex P1 (line 597 removal criterion contradiction): open-question 5 referenced only reads_total == 0, contradicting the Layer 4 rule that requires BOTH reads_total and keys_total. Tie them together.
1 parent 104237d commit 2ae4056

1 file changed

Lines changed: 84 additions & 19 deletions

File tree

docs/design/2026_04_24_proposed_workload_isolation.md

Lines changed: 84 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,29 @@ so the dispatcher can gate on the command byte without allocating):
8888
- **Ungated:** `GET`, `SET`, `DEL`, `EXISTS`, `INCR`, `EXPIRE`, `TTL`,
8989
`HGET`, `HSET`, `LPUSH`/`RPUSH`, `XADD`, single-key fast paths.
9090

91+
**XADD during the Mode A migration window is a latent starvation
92+
risk.** Mode A rewrites the entire legacy blob in the first XADD that
93+
touches a migrated stream (see Layer 4, line 414). On a 100k-entry
94+
stream that single "ungated" XADD does O(N) unmarshal + re-marshal +
95+
per-entry Put, which is exactly the CPU profile Layer 1 is trying to
96+
bound. v1 mitigations, ordered by preference:
97+
98+
1. Ship Layer 4 Mode A and Layer 1 together. The very first migration
99+
XADD is expensive, but it happens once per stream; subsequent XADDs
100+
are O(1) and genuinely ungated.
101+
2. If Mode A lands ahead of Layer 1, XADD is promoted to gated **only
102+
while the periodic scan still reports legacy keys present** (see
103+
Layer 4 removal criterion). Once the legacy-keys-total counter
104+
reaches zero cluster-wide, XADD demotes back to ungated. This is
105+
dynamic classification for exactly one command for exactly the
106+
migration window; worth the complexity because the alternative is
107+
a repeat of the 2026-04-24 incident triggered by a single large
108+
stream's first write.
109+
110+
Either ordering is acceptable; (1) is simpler and preferred. The
111+
doc does not pick an ordering — the implementer of whichever PR lands
112+
second owns the promotion/demotion logic.
113+
91114
The entire `ZRANGE` family is gated, not only "full-range" variants —
92115
arg inspection (e.g., detecting `LIMIT 0 N`) breaks the "classify by
93116
command byte" simplicity, and a bounded `ZRANGE 0 10` contributes at
@@ -345,12 +368,17 @@ aggressive reconnect pool can answer each `-ERR max connections`
345368
with an immediate new `connect()` — the server spends CPU on the
346369
accept/write/close cycle and the client makes no progress. Two
347370
mitigations: (a) **rate-limit the reject itself**: once a peer IP
348-
has been rejected `R` times in the last second, the next rejects
349-
are answered with `RST` (cheap kernel-level reset) instead of an
350-
accept + write + close; (b) document operator-side client
351-
configuration (e.g., for redis-rb: `reconnect_attempts=3` plus an
352-
exponential backoff). (a) ships in v1 behind a compile-time
353-
constant; (b) belongs in the ops runbook.
371+
has been rejected `R` times in the last second, *skip the RESP
372+
write* for subsequent rejects and close the fd with `SetLinger(0)`.
373+
Pure Go `net` cannot emit a true kernel-level `RST` without
374+
`Accept()`; the connection is already accepted by the time we know
375+
to reject it, so "cheap" here means "fewer syscalls per reject
376+
(accept + close)," not "no accept." A true `RST`-before-accept
377+
requires dropping to a raw listener (`syscall.Accept4` + direct
378+
`SO_LINGER` setup or an eBPF filter), deferred to v2. (b) document
379+
operator-side client configuration (e.g., for redis-rb:
380+
`reconnect_attempts=3` plus an exponential backoff). (a) ships in
381+
v1 behind a compile-time constant; (b) belongs in the ops runbook.
354382

355383
### Where in the code
356384

@@ -434,12 +462,20 @@ fall-through on "new layout empty":
434462

435463
1. Read `meta` if present; read all per-entry keys that match the
436464
requested ID range.
437-
2. Read the legacy-suffix blob if present; decode only entries
438-
falling in the ID range.
465+
2. Read the legacy-suffix blob if present; **protobuf cannot decode a
466+
repeated field partially**, so the blob is fully unmarshaled and
467+
then filtered to the requested ID range in memory. There is no
468+
cheap "decode only the range" path without custom wire-format
469+
parsing, which is out of scope for v1.
439470
3. Merge by ID order, deduplicate (the migrator is responsible for
440471
never writing the same ID in both layouts in a single commit), and
441472
return.
442473

474+
Because step 2 is a full unmarshal, the suffix-blob cost is O(N_suffix)
475+
per read regardless of how narrow the requested ID range is. This
476+
matches the Mode B cost model below and is the reason Mode B ships
477+
together with the read-driven drain.
478+
443479
The v1 dual-read (Mode A) is safe because there is no mixed state.
444480
Extending it verbatim to Mode B would return incomplete results
445481
during chunked migration — entries still in the legacy suffix would
@@ -469,13 +505,40 @@ migration instead.
469505
a cold legacy-format stream that is neither read nor written for
470506
the soak window would keep the counter at zero while still needing
471507
the fallback. Add a paired counter
472-
`elastickv_stream_legacy_format_keys_total` populated from a
473-
periodic prefix scan (`!redis|stream|<...>` with no matching
474-
`!stream|meta|<...>`). The fallback is safe to remove only when
475-
**both** counters are zero across every node. The periodic scan
476-
runs at the same cadence as snapshot cleanup; its cost is bounded
477-
by the number of legacy keys remaining, which decays as migration
478-
progresses.
508+
`elastickv_stream_legacy_format_keys_total`.
509+
510+
Naive implementation would scan `!redis|stream|<...>` prefix, but
511+
**that prefix is shared by every per-entry key**
512+
(`!redis|stream|<key>|entry|<id>`), so a scan over it is
513+
O(total_entries_in_cluster), not O(legacy_blobs). In a deployment
514+
with many large migrated streams this is the exact cost profile
515+
Layer 4 was introduced to eliminate.
516+
517+
Two implementable options, pick whichever is cheaper in the target
518+
backend:
519+
520+
1. **Bloom-filter / sidecar index.** On every write that creates a
521+
legacy blob record the logical stream name in a dedicated
522+
`!redis|stream_legacy_index|<key>` tombstone-style marker. The
523+
migration write that rewrites the stream deletes that marker in
524+
the same commit. The counter becomes `SCAN !redis|stream_legacy_index|`,
525+
bounded by the number of legacy blobs, not total entries.
526+
2. **Layout-walk.** Iterate `!redis|stream|<key>|meta` keys and, for
527+
each stream, probe `!redis|stream|<key>` (the legacy blob key has
528+
no suffix). Scan cost is O(num_streams), not O(num_entries).
529+
Equivalent answer; avoids the sidecar index.
530+
531+
Both pass the key-pattern sanity check: legacy keys live at
532+
`!redis|stream|<logical>` with no further suffix, per-entry keys at
533+
`!redis|stream|<logical>|entry|<id>`, meta at
534+
`!redis|stream|<logical>|meta`. The "prefix scan on `!redis|stream|`"
535+
wording in earlier drafts was wrong and has been retracted; the
536+
layout above is authoritative.
537+
538+
The fallback is safe to remove only when **both** counters are zero
539+
across every node. The index/walk runs at the same cadence as
540+
snapshot cleanup; its cost is bounded by the chosen option as
541+
described.
479542

480543
The existing stream PR (#620) ships **Mode A only**. Chunked
481544
migration (Mode B) is explicitly deferred and must not be enabled
@@ -592,10 +655,12 @@ Recommended order of implementation:
592655
(mTLS or PROXY protocol identity) is unsurprising.
593656

594657
5. **Layer 4 — migration window.** When can the dual-read
595-
compatibility code go away? Proposed:
596-
`elastickv_stream_legacy_format_reads_total` = 0 for 30 days
597-
across all nodes → remove in a follow-up PR. 30 days is arbitrary;
598-
revisit.
658+
compatibility code go away? Proposed: **both**
659+
`elastickv_stream_legacy_format_reads_total` = 0 **and**
660+
`elastickv_stream_legacy_format_keys_total` = 0 for 30 days across
661+
all nodes → remove in a follow-up PR. `reads_total` alone would
662+
miss cold streams (see Layer 4 removal criterion); the paired
663+
counter closes that gap. 30 days is arbitrary; revisit.
599664

600665
6. **Interaction with memwatch (PR #612).** memwatch fires graceful
601666
shutdown on hard-threshold crossing. Admission (Layer 3 / roadmap

0 commit comments

Comments
 (0)