@@ -88,6 +88,29 @@ so the dispatcher can gate on the command byte without allocating):
8888- ** Ungated:** ` GET ` , ` SET ` , ` DEL ` , ` EXISTS ` , ` INCR ` , ` EXPIRE ` , ` TTL ` ,
8989 ` HGET ` , ` HSET ` , ` LPUSH ` /` RPUSH ` , ` XADD ` , single-key fast paths.
9090
91+ ** XADD during the Mode A migration window is a latent starvation
92+ risk.** Mode A rewrites the entire legacy blob in the first XADD that
93+ touches a migrated stream (see Layer 4, line 414). On a 100k-entry
94+ stream that single "ungated" XADD does O(N) unmarshal + re-marshal +
95+ per-entry Put, which is exactly the CPU profile Layer 1 is trying to
96+ bound. v1 mitigations, ordered by preference:
97+
98+ 1 . Ship Layer 4 Mode A and Layer 1 together. The very first migration
99+ XADD is expensive, but it happens once per stream; subsequent XADDs
100+ are O(1) and genuinely ungated.
101+ 2 . If Mode A lands ahead of Layer 1, XADD is promoted to gated ** only
102+ while the periodic scan still reports legacy keys present** (see
103+ Layer 4 removal criterion). Once the legacy-keys-total counter
104+ reaches zero cluster-wide, XADD demotes back to ungated. This is
105+ dynamic classification for exactly one command for exactly the
106+ migration window; worth the complexity because the alternative is
107+ a repeat of the 2026-04-24 incident triggered by a single large
108+ stream's first write.
109+
110+ Either ordering is acceptable; (1) is simpler and preferred. The
111+ doc does not pick an ordering — the implementer of whichever PR lands
112+ second owns the promotion/demotion logic.
113+
91114The entire ` ZRANGE ` family is gated, not only "full-range" variants —
92115arg inspection (e.g., detecting ` LIMIT 0 N ` ) breaks the "classify by
93116command byte" simplicity, and a bounded ` ZRANGE 0 10 ` contributes at
@@ -345,12 +368,17 @@ aggressive reconnect pool can answer each `-ERR max connections`
345368with an immediate new ` connect() ` — the server spends CPU on the
346369accept/write/close cycle and the client makes no progress. Two
347370mitigations: (a) ** rate-limit the reject itself** : once a peer IP
348- has been rejected ` R ` times in the last second, the next rejects
349- are answered with ` RST ` (cheap kernel-level reset) instead of an
350- accept + write + close; (b) document operator-side client
351- configuration (e.g., for redis-rb: ` reconnect_attempts=3 ` plus an
352- exponential backoff). (a) ships in v1 behind a compile-time
353- constant; (b) belongs in the ops runbook.
371+ has been rejected ` R ` times in the last second, * skip the RESP
372+ write* for subsequent rejects and close the fd with ` SetLinger(0) ` .
373+ Pure Go ` net ` cannot emit a true kernel-level ` RST ` without
374+ ` Accept() ` ; the connection is already accepted by the time we know
375+ to reject it, so "cheap" here means "fewer syscalls per reject
376+ (accept + close)," not "no accept." A true ` RST ` -before-accept
377+ requires dropping to a raw listener (` syscall.Accept4 ` + direct
378+ ` SO_LINGER ` setup or an eBPF filter), deferred to v2. (b) document
379+ operator-side client configuration (e.g., for redis-rb:
380+ ` reconnect_attempts=3 ` plus an exponential backoff). (a) ships in
381+ v1 behind a compile-time constant; (b) belongs in the ops runbook.
354382
355383### Where in the code
356384
@@ -434,12 +462,20 @@ fall-through on "new layout empty":
434462
4354631 . Read ` meta ` if present; read all per-entry keys that match the
436464 requested ID range.
437- 2 . Read the legacy-suffix blob if present; decode only entries
438- falling in the ID range.
465+ 2 . Read the legacy-suffix blob if present; ** protobuf cannot decode a
466+ repeated field partially** , so the blob is fully unmarshaled and
467+ then filtered to the requested ID range in memory. There is no
468+ cheap "decode only the range" path without custom wire-format
469+ parsing, which is out of scope for v1.
4394703 . Merge by ID order, deduplicate (the migrator is responsible for
440471 never writing the same ID in both layouts in a single commit), and
441472 return.
442473
474+ Because step 2 is a full unmarshal, the suffix-blob cost is O(N_suffix)
475+ per read regardless of how narrow the requested ID range is. This
476+ matches the Mode B cost model below and is the reason Mode B ships
477+ together with the read-driven drain.
478+
443479The v1 dual-read (Mode A) is safe because there is no mixed state.
444480Extending it verbatim to Mode B would return incomplete results
445481during chunked migration — entries still in the legacy suffix would
@@ -469,13 +505,40 @@ migration instead.
469505a cold legacy-format stream that is neither read nor written for
470506the soak window would keep the counter at zero while still needing
471507the fallback. Add a paired counter
472- ` elastickv_stream_legacy_format_keys_total ` populated from a
473- periodic prefix scan (` !redis|stream|<...> ` with no matching
474- ` !stream|meta|<...> ` ). The fallback is safe to remove only when
475- ** both** counters are zero across every node. The periodic scan
476- runs at the same cadence as snapshot cleanup; its cost is bounded
477- by the number of legacy keys remaining, which decays as migration
478- progresses.
508+ ` elastickv_stream_legacy_format_keys_total ` .
509+
510+ Naive implementation would scan ` !redis|stream|<...> ` prefix, but
511+ ** that prefix is shared by every per-entry key**
512+ (` !redis|stream|<key>|entry|<id> ` ), so a scan over it is
513+ O(total_entries_in_cluster), not O(legacy_blobs). In a deployment
514+ with many large migrated streams this is the exact cost profile
515+ Layer 4 was introduced to eliminate.
516+
517+ Two implementable options, pick whichever is cheaper in the target
518+ backend:
519+
520+ 1 . ** Bloom-filter / sidecar index.** On every write that creates a
521+ legacy blob record the logical stream name in a dedicated
522+ ` !redis|stream_legacy_index|<key> ` tombstone-style marker. The
523+ migration write that rewrites the stream deletes that marker in
524+ the same commit. The counter becomes ` SCAN !redis|stream_legacy_index| ` ,
525+ bounded by the number of legacy blobs, not total entries.
526+ 2 . ** Layout-walk.** Iterate ` !redis|stream|<key>|meta ` keys and, for
527+ each stream, probe ` !redis|stream|<key> ` (the legacy blob key has
528+ no suffix). Scan cost is O(num_streams), not O(num_entries).
529+ Equivalent answer; avoids the sidecar index.
530+
531+ Both pass the key-pattern sanity check: legacy keys live at
532+ ` !redis|stream|<logical> ` with no further suffix, per-entry keys at
533+ ` !redis|stream|<logical>|entry|<id> ` , meta at
534+ ` !redis|stream|<logical>|meta ` . The "prefix scan on ` !redis|stream| ` "
535+ wording in earlier drafts was wrong and has been retracted; the
536+ layout above is authoritative.
537+
538+ The fallback is safe to remove only when ** both** counters are zero
539+ across every node. The index/walk runs at the same cadence as
540+ snapshot cleanup; its cost is bounded by the chosen option as
541+ described.
479542
480543The existing stream PR (#620 ) ships ** Mode A only** . Chunked
481544migration (Mode B) is explicitly deferred and must not be enabled
@@ -592,10 +655,12 @@ Recommended order of implementation:
592655 (mTLS or PROXY protocol identity) is unsurprising.
593656
5946575 . ** Layer 4 — migration window.** When can the dual-read
595- compatibility code go away? Proposed:
596- ` elastickv_stream_legacy_format_reads_total ` = 0 for 30 days
597- across all nodes → remove in a follow-up PR. 30 days is arbitrary;
598- revisit.
658+ compatibility code go away? Proposed: ** both**
659+ ` elastickv_stream_legacy_format_reads_total ` = 0 ** and**
660+ ` elastickv_stream_legacy_format_keys_total ` = 0 for 30 days across
661+ all nodes → remove in a follow-up PR. ` reads_total ` alone would
662+ miss cold streams (see Layer 4 removal criterion); the paired
663+ counter closes that gap. 30 days is arbitrary; revisit.
599664
6006656 . ** Interaction with memwatch (PR #612 ).** memwatch fires graceful
601666 shutdown on hard-threshold crossing. Admission (Layer 3 / roadmap
0 commit comments