Skip to content

Commit 40cd013

Browse files
xrlclaude
andcommitted
docs(e2e): stress tiers + measured spinup-burst budget
Document why the stress e2e runs as three batches, grounded in salvaged spinup-burst measurements (10-CPU/8GB Docker VM): - Tier 1 (size 1-3): a full size-7 bring-up is only ~8 CPU-s, front-loaded on the bootstrap member, so small clusters parallelize freely. - Tier 2 (size 7, churn): 4 concurrent size-7 spinups peak at 12.34 busy cores on a 10-core VM (~1.4-3 cores each) -> only ~1 fits a 2-vCPU runner; --max-concurrent-reconciles, not namespace isolation, is the overlap lever. - Tier 3 (TestStressCrashDuringScale): gofail failpoints are operator-global (one pod), so a crash test must run alone. Also documents the 50m etcd CPU request (cpu.shares 2 -> ~51, request not limit so zero throttling) with the honest caveat that the 10-core VM could not reproduce 2-vCPU contention, and a core-constrained re-run is needed to demonstrate the QoS benefit. No throttling observed in any run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
1 parent ed87bac commit 40cd013

1 file changed

Lines changed: 103 additions & 0 deletions

File tree

test/e2e/STRESS.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Stress e2e: three batches, and the spinup-burst budget behind them
2+
3+
The stress e2e suite is deliberately split into **three batches that must not be mixed**. The
4+
split is not cosmetic — it falls out of how much CPU an etcd bring-up actually burns, of how
5+
gofail failpoints are scoped, and of what a 2-vCPU CI runner can physically do at once. This
6+
doc records the measured numbers and the reasoning so the batching is not re-litigated.
7+
8+
## TL;DR
9+
10+
| Tier | What | Concurrency | Why |
11+
|------|------|-------------|-----|
12+
| **1 — cheap-parallel** | size-1 / size-3 bring-ups | parallelize freely | each bring-up is a few CPU-seconds; bootstrap member dominates |
13+
| **2 — heavy-throttled** | size-7 bring-ups, scale churn | low width (~1 size-7 per 2 vCPU) | one size-7 spinup peaks **~1.4–3 cores**; 4 of them saturate a **10-core** VM |
14+
| **3 — crash-exclusive** | `TestStressCrashDuringScale` | run alone | gofail failpoints are **operator-global**; arming one panics the single operator pod for *every* cluster |
15+
16+
The real lever for overlapping Tier-2 work is the reconcile worker pool
17+
(`--max-concurrent-reconciles`, default 5), **not** namespace isolation — namespaces isolate
18+
state, they do not buy you CPU.
19+
20+
## Measured on (honesty box)
21+
22+
> **All numbers below were measured on a Docker Desktop kind cluster with 10 CPUs / 7.75 GiB
23+
> (`00_docker_envelope.txt`), NOT a 2-vCPU CI runner.** They are **spinup-cost measurements +
24+
> extrapolation**. The 2-vCPU starvation burst this tiering is designed around was *not*
25+
> reproduced at the CI core count — the local VM had too much headroom (which is exactly why,
26+
> see Tier-2 note, BestEffort etcd looked fine here). Treat the per-size-7 core peak as a
27+
> measured cost and the "how many fit on 2 vCPU" as an **extrapolation, confidence medium**.
28+
29+
Method: W concurrent size-7 `EtcdCluster`s applied simultaneously, operator at
30+
`--max-concurrent-reconciles=5`, polled to "7 voting members healthy". Per-etcd
31+
`usage_usec` read from cgroup `cpu.stat` at end of spinup (clusters start from zero, so it
32+
≈ CPU-seconds to reach healthy). W6 excluded as over-escalation beyond the envelope.
33+
34+
| W | per-cluster CPU-s (×7) | time-to-healthy s (min/med/max) | node peak busy-cores (of 10) | peak mem | throttled |
35+
|---|------------------------|---------------------------------|------------------------------|----------|-----------|
36+
| 1 | 8.1 | 70 / 70 / 70 | ~1.4 (coarse) | 1.14 GiB | 0 |
37+
| 2 | ~21.8 | 75 / 75 / 75 || 1.52 GiB | 0 |
38+
| 3 | ~14.9 | 77 / 77 / 81 || 1.90 GiB | 0 |
39+
| 4 | ~62.8 | 102 / 121 / 142 | **12.34** | 2.28 GiB | 0 |
40+
41+
(Full data + the `docker stats` CPU% caveat: `/tmp/etcd-burst-stats/SUMMARY.md`. The hi-res
42+
busy-core sampler exists only for W4; the `docker stats` CPU% column is jittery and not used
43+
for load-bearing claims.)
44+
45+
## Tier 1 — cheap-parallel (size 1–3)
46+
47+
A whole **size-7** bring-up in isolation costs only **8.1 CPU-seconds** total, and it is
48+
heavily front-loaded on the bootstrap member (ec-0 = 3.3 CPU-s, ~40% of the cluster) with each
49+
later-joined member costing less (down to 0.25 CPU-s). A size-1 is therefore roughly one
50+
member's worth of work and a size-3 roughly three — a few CPU-seconds each, spread over a
51+
~70s window. These never come close to saturating a runner. **Parallelize them freely**; the
52+
limit is test-harness bookkeeping, not CPU.
53+
54+
## Tier 2 — heavy-throttled (size 7, churn)
55+
56+
This is where the budget bites. The hi-res node sampler shows **4 simultaneous size-7 spinups
57+
peaking at 12.34 busy cores on a 10-core VM** — i.e. they oversubscribe a 10-core machine.
58+
Dividing the overlapped peak by 4 gives **~3 cores per concurrent size-7 spinup** at the burst;
59+
a single isolated size-7 peaked ~1.4 cores. So budget **~1.4–3 cores of instantaneous peak per
60+
size-7 bring-up.**
61+
62+
Consequence for a **2-vCPU** CI runner (extrapolation, confidence medium): **only ~1 size-7
63+
spinup fits.** A second concurrent size-7 pushes instantaneous demand well past 2 cores; the
64+
spinups don't fail (no CPU *limit* is set, so nothing is CFS-throttled — `nr_throttled=0` in
65+
every run) but they self-throttle on available CPU and time-to-healthy stretches. So Tier 2
66+
runs **at low width / throttled**, and the way you safely overlap a *little* Tier-2 work is by
67+
sizing the reconcile worker pool — **`--max-concurrent-reconciles`** — to match the cores you
68+
have, rather than relying on namespaces to "isolate" load. Namespaces isolate Kubernetes state;
69+
they do nothing for CPU contention.
70+
71+
## Tier 3 — crash-exclusive (`TestStressCrashDuringScale`)
72+
73+
gofail failpoints are armed over HTTP on the **single operator pod** (`enableGoFailPoint` in
74+
`helpers_test.go`, hitting the operator's gofail port). There is one operator reconciling every
75+
cluster, so **arming a failpoint panics that one pod for all clusters at once** — it is a global
76+
switch, not per-cluster. Any other stress cluster sharing the operator during a crash test gets
77+
collateral reconcile failures and corrupts the result. `TestStressCrashDuringScale` therefore
78+
**must run alone**, with no Tier-1 or Tier-2 work in flight.
79+
80+
## Why etcd gets a 50m CPU request
81+
82+
The operator gives the etcd container a default **50m CPU *request*** (`--etcd-cpu-request`,
83+
default `50m`; set `0` for BestEffort). Rationale:
84+
85+
- A BestEffort pod gets cgroup **`cpu.shares=2`** — the floor. Under CPU contention the kernel
86+
hands it almost no weight, which during a multi-member bring-up means missed heartbeats →
87+
**spurious leader elections / election churn** and stretched time-to-healthy.
88+
- A 50m request lifts shares to **~51** (50m/1000m × 1024) — a **~25× weight improvement** — and
89+
also gives the scheduler a real request to place against (a scheduling floor). QoS goes
90+
BestEffort → Burstable.
91+
- It is a **request, not a limit**, so there is no CFS quota and **no throttling** (confirmed:
92+
`nr_throttled=0` across all runs). It costs nothing when CPU is free.
93+
94+
**Honest caveat:** the local 10-core run **could not demonstrate this benefit**. With 10 cores
95+
idle there is no contention, so the 50m-vs-BestEffort A/B at W4 came out as *noise* — BestEffort
96+
even finished marginally faster (median TTH 115s vs 121s) and showed cleaner elections. That
97+
does **not** disprove the request: the only thing it buys (a higher shares/scheduling floor)
98+
only manifests when cores are oversubscribed, which this VM never was. To prove it, re-run
99+
**core-constrained** — pin the kind node to 2 cores (`docker update --cpus=2`) or use a 2-vCPU
100+
runner — and repeat the 50m-vs-BestEffort A/B at W2–W4. Under that constraint the BestEffort
101+
`cpu.shares=2` members should show measurably longer time-to-healthy and/or spurious leader
102+
changes. Until that run exists, the 50m default rests on the cgroup-shares argument plus the
103+
spinup-cost measurements here, not on a reproduced contention burst.

0 commit comments

Comments
 (0)