Commit bbf31eb
authored
docs(design): propose workload-class isolation after 2026-04-24 XREAD starvation (#619)
## Summary
Design doc (only — no code in this PR) for a four-layer
workload-isolation model, prompted by the 2026-04-24 incident's
afternoon phase.
**Problem:** Today, one client host with 37 connections running a tight
XREAD loop consumed 14 CPU cores on the leader via `loadStreamAt →
unmarshalStreamValue → proto.Unmarshal` (81% of CPU per pprof). Raft
goroutines couldn't get CPU → step_queue_full = 75,692 on the leader (vs
0-119 on followers) → Raft commit p99 jumped to 6-10s, Lua p99 stuck at
6-8s. Follower replication was healthy (applied-index within 34 of
leader); the damage was entirely CPU-scheduling on the leader.
**Gap:** elastickv has no explicit workload-class isolation. Go's
scheduler treats every goroutine equally; a single heavy command path
can starve unrelated paths (raft, lease, Lua, GET/SET).
## Four-layer defense model
- **Layer 1 — heavy-command worker pool**: gate XREAD / KEYS / SCAN /
Lua onto a bounded pool (~`2 × GOMAXPROCS`); reply `-BUSY` when full.
Cheap commands keep their own fast path.
- **Layer 2 — locked OS threads for raft**: `runtime.LockOSThread()` on
the Ready loop + dispatcher lanes so the Go scheduler can't starve them.
**Not v1** — only if measurement after Layer 1 + 4 still shows
`step_queue_full > 0`.
- **Layer 3 — per-client admission control**: per-peer-IP connection cap
(default 8). Extends, doesn't replace, roadmap item 6's global in-flight
semaphore.
- **Layer 4 — XREAD O(N) → O(new)**: entry-per-key layout
(`!redis|stream|<key>|entry|<id>`) with range-scan, dual-read migration
fallback, legacy-removal gated on
`elastickv_stream_legacy_format_reads_total == 0`. Hashes/sets/zsets
share the same one-blob pattern and are called out as follow-up.
## Recommended sequencing
Layer 4 (correctness bug, concentrated change) → Layer 1 (generic
defense for next unknown hotspot) → Layer 3 (reconcile with roadmap item
6) → Layer 2 (only if forced by measurement).
## Relationship to other in-flight work
- Complements (does not replace)
`docs/design/2026_04_24_proposed_resilience_roadmap.md` item 6
(admission control). This doc's Layer 3 focuses on per-client fairness;
the roadmap's item 6 is global in-flight capping. Both are needed.
- Consistent with memwatch (#612): Layer 3 admission threshold should
fire **before** memwatch's shutdown threshold — flagged as an open
question in the doc.
- Assumes WAL auto-repair (#613), GOMEMLIMIT defaults (#617) are landed
so the cluster survives long enough to matter.
## Open questions called out in the doc
- Static vs dynamic command classification (Layer 1)
- `-BUSY` backoff semantics — how do we avoid client retry spinning
becoming the new hot loop?
- Number of locked OS threads on variable-core hosts (Layer 2)
- Stream migration soak window before removing legacy-format fallback
(Layer 4, currently 30 days, arbitrary)
## Deliverable
`docs/design/2026_04_24_proposed_workload_isolation.md` — 446 lines,
dated-prefix / `**Status: Proposed**` convention matching the rest of
`docs/design/`. No code.
## Test plan
- [x] File paths and function references in the doc spot-checked against
`origin/main`
- [x] Cross-references to `2026_04_24_proposed_resilience_roadmap.md`
reconciled (complements, doesn't duplicate)
- [ ] Design review — decide on the open questions before implementing
Layer 4 (which blocks Layer 1 on XREAD specifically)1 file changed
Lines changed: 683 additions & 0 deletions
0 commit comments