Skip to content

Commit 9d44b6c

Browse files
committed
docs(design): address Gemini review on workload-isolation proposal
- Layer 1 (line 84): replace "full-range ZRANGE/ZRANGEBYSCORE" with unconditional gating of the whole ZRANGE family. Arg-inspection contradicted the stated "static, byte-level classification"; a bounded ZRANGE 0 10 costs at most one unmarshal and gating it is cheaper than the dispatcher branch that would distinguish the bounded vs unbounded case. - Layer 1 (line 122, container awareness): document that runtime.GOMAXPROCS(0) returns the host CPU count on Linux without cgroup awareness and call out two mitigations (operator-set GOMAXPROCS env at rolling-update level, or wire uber-go/automaxprocs). v1 prefers the operator-set path for auditability; automaxprocs acceptable as follow-up. - Layer 1 (line 139, single-pool starvation): acknowledge the risk that KEYS/SCAN bursts can exhaust pool slots and starve XREAD/Lua. v1 still ships a single pool but requires a per-command submit metric so a tier split is measurable from observability rather than guessed; sub-pools/slot reservation are the named follow-up. - Layer 3 (line 252, reject semantics): change the v1 shape from "close TCP without RESP" to "accept, write -ERR, then close." The protocol-level error is the signal that distinguishes "server overload" from "network blip" on the client side. - Layer 4 (line 315, synchronous migration cost): add a chunked migration section. A single XADD on a 100k-entry legacy blob would rewrite every entry in one Raft commit, reproducing the CPU/commit spike the design is supposed to prevent. Document STREAM_MIGRATION_CHUNK (default 1024) and the rolling drain model; explicitly scope chunked migration as a stacked follow-up to PR #620 (which ships the simple one-txn version).
1 parent af3fa59 commit 9d44b6c

1 file changed

Lines changed: 58 additions & 9 deletions

File tree

docs/design/2026_04_24_proposed_workload_isolation.md

Lines changed: 58 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -78,16 +78,21 @@ commands stay on the accept-goroutine path. Pool full → reply
7878
`-BUSY server overloaded` and return. Redis clients already treat
7979
`-BUSY` as retryable; reusing it means no client-library changes.
8080

81-
Static v1 classification:
81+
Static v1 classification (name-based only — no argument inspection,
82+
so the dispatcher can gate on the command byte without allocating):
8283

8384
- **Pool-gated:** `XREAD`, `XRANGE`, `XREVRANGE`, `KEYS`, `SCAN`,
84-
`HGETALL`, `SMEMBERS`, full-range `ZRANGE`/`ZRANGEBYSCORE`,
85-
`EVAL`/`EVALSHA`, `FCALL`/`FCALL_RO`, the `*SCAN` family.
85+
`HGETALL`, `HVALS`, `HKEYS`, `SMEMBERS`, `SUNION`, `SINTER`,
86+
`ZRANGE`, `ZRANGEBYSCORE`, `ZRANGEBYLEX`, `EVAL`/`EVALSHA`,
87+
`FCALL`/`FCALL_RO`, the `*SCAN` family.
8688
- **Ungated:** `GET`, `SET`, `DEL`, `EXISTS`, `INCR`, `EXPIRE`, `TTL`,
8789
`HGET`, `HSET`, `LPUSH`/`RPUSH`, `XADD`, single-key fast paths.
8890

89-
Dynamic (observed-cost) classification is a follow-up; v1 bias is a
90-
boring, reviewable list.
91+
The entire `ZRANGE` family is gated, not only "full-range" variants —
92+
arg inspection (e.g., detecting `LIMIT 0 N`) breaks the "classify by
93+
command byte" simplicity, and a bounded `ZRANGE 0 10` contributes at
94+
most one unmarshal per request (cheap). Dynamic (observed-cost)
95+
classification is a follow-up; v1 bias is a boring, reviewable list.
9196

9297
### Tradeoffs
9398

@@ -124,6 +129,30 @@ commands in `dispatchCommand` call `Submit`; ungated stay
124129
synchronous. Static list lives next to `dispatchCommand`. Pool-full →
125130
`-BUSY server overloaded`. Lua follows option (A).
126131

132+
**Container-aware sizing.** `runtime.GOMAXPROCS(0)` on Linux returns
133+
the host CPU count, not the container's cgroup quota (unless
134+
`GOMAXPROCS` is set explicitly or something has configured it).
135+
Operators running under Kubernetes/Docker with a CPU limit should
136+
either (a) set `GOMAXPROCS` in the deploy environment to match the
137+
cgroup quota (the project's rolling-update script is the natural
138+
place), or (b) wire `go.uber.org/automaxprocs` into `main.go` so the
139+
correction happens at startup. v1 prefers (a) for auditability;
140+
(b) is acceptable as a follow-up if operators want it automatic.
141+
142+
**Single pool vs per-class sub-pools.** v1 uses a single global pool.
143+
The risk: a burst of `KEYS *` or `SCAN` from a management client can
144+
exhaust all slots and force `-BUSY` onto latency-sensitive `XREAD` or
145+
Lua requests. Two mitigations exist: (i) classify gated commands into
146+
priority tiers and reserve a minimum slot share per tier (e.g., 50%
147+
data-path, 25% scan, 25% Lua), (ii) ship separate sub-pools per
148+
tier. Both add complexity that's only justified if we actually
149+
observe a scan-command burst displacing data-path work. **v1 defers
150+
sub-pools; observability must call this out so the need is
151+
measurable.** New metric `elastickv_heavy_command_pool_submit_total`
152+
labelled by command name is sufficient: if pool-full rejections
153+
concentrate on `XREAD` while `KEYS` dominates successful submissions,
154+
the tier split is warranted.
155+
127156
### Where in the code
128157

129158
- `adapter/redis.go:575` (`dispatchCommand`), `:631` (`Run`) — gate
@@ -248,10 +277,13 @@ one check per accept, not per command.
248277
### Recommended v1 shape
249278

250279
**Per-peer-IP connection cap, default `N=8`, env-configurable,
251-
enforced at accept.** On reject, close the TCP connection immediately
252-
(no RESP — clients retry the connect). Per-client in-flight
253-
semaphore is deferred: it requires threading client identity through
254-
every dispatch, which is a bigger change than 2026-04-24 justifies.
280+
enforced at accept.** On reject, accept the TCP connection, write a
281+
`-ERR max connections per client exceeded` RESP error, then close —
282+
so the client sees a protocol-level message instead of a bare
283+
`connection reset` or `EOF` that's indistinguishable from a real
284+
network failure. Per-client in-flight semaphore is deferred: it
285+
requires threading client identity through every dispatch, which is
286+
a bigger change than 2026-04-24 justifies.
255287

256288
### Where in the code
257289

@@ -318,6 +350,23 @@ Remove the legacy fallback later once
318350
`elastickv_stream_legacy_format_reads_total` has sat at zero across
319351
all nodes for a soak window.
320352

353+
**Chunked migration for large legacy streams.** A single `XADD` on a
354+
100k-entry legacy stream would rewrite every entry in one Raft
355+
transaction — regressing into the same CPU and commit-time spike the
356+
design is supposed to prevent. The migration write therefore caps how
357+
many entries it rewrites per transaction (`STREAM_MIGRATION_CHUNK`,
358+
default 1 024). When the legacy stream exceeds the chunk, the first
359+
write migrates the oldest `CHUNK` entries and the remaining legacy
360+
tail stays in a legacy-*suffix* key (the blob minus the entries
361+
already promoted). Subsequent writes drain another chunk each, until
362+
the legacy tail is empty and can be deleted. A background "migrator"
363+
goroutine driven by the same `legacy_format_reads_total` metric is a
364+
follow-up if operator-driven migration proves too slow.
365+
366+
The existing stream PR (#620) ships the simpler "rewrite all in one
367+
txn" version; chunked migration is a stacked follow-up once we see
368+
legacy stream sizes in production.
369+
321370
### Other one-blob-per-key collections
322371

323372
Spot-check of `adapter/redis_compat_helpers.go` confirms the same

0 commit comments

Comments
 (0)