|
| 1 | +# Stress e2e: three batches, and the spinup-burst budget behind them |
| 2 | + |
| 3 | +The stress e2e suite is deliberately split into **three batches that must not be mixed**. The |
| 4 | +split is not cosmetic — it falls out of how much CPU an etcd bring-up actually burns, of how |
| 5 | +gofail failpoints are scoped, and of what a 2-vCPU CI runner can physically do at once. This |
| 6 | +doc records the measured numbers and the reasoning so the batching is not re-litigated. |
| 7 | + |
| 8 | +## TL;DR |
| 9 | + |
| 10 | +| Tier | What | Concurrency | Why | |
| 11 | +|------|------|-------------|-----| |
| 12 | +| **1 — cheap-parallel** | size-1 / size-3 bring-ups | parallelize freely | each bring-up is a few CPU-seconds; bootstrap member dominates | |
| 13 | +| **2 — heavy-throttled** | size-7 bring-ups, scale churn | low width (~1 size-7 per 2 vCPU) | one size-7 spinup peaks **~1.4–3 cores**; 4 of them saturate a **10-core** VM | |
| 14 | +| **3 — crash-exclusive** | `TestStressCrashDuringScale` | run alone | gofail failpoints are **operator-global**; arming one panics the single operator pod for *every* cluster | |
| 15 | + |
| 16 | +The real lever for overlapping Tier-2 work is the reconcile worker pool |
| 17 | +(`--max-concurrent-reconciles`, default 5), **not** namespace isolation — namespaces isolate |
| 18 | +state, they do not buy you CPU. |
| 19 | + |
| 20 | +## Measured on (honesty box) |
| 21 | + |
| 22 | +> **All numbers below were measured on a Docker Desktop kind cluster with 10 CPUs / 7.75 GiB |
| 23 | +> (`00_docker_envelope.txt`), NOT a 2-vCPU CI runner.** They are **spinup-cost measurements + |
| 24 | +> extrapolation**. The 2-vCPU starvation burst this tiering is designed around was *not* |
| 25 | +> reproduced at the CI core count — the local VM had too much headroom (which is exactly why, |
| 26 | +> see Tier-2 note, BestEffort etcd looked fine here). Treat the per-size-7 core peak as a |
| 27 | +> measured cost and the "how many fit on 2 vCPU" as an **extrapolation, confidence medium**. |
| 28 | +
|
| 29 | +Method: W concurrent size-7 `EtcdCluster`s applied simultaneously, operator at |
| 30 | +`--max-concurrent-reconciles=5`, polled to "7 voting members healthy". Per-etcd |
| 31 | +`usage_usec` read from cgroup `cpu.stat` at end of spinup (clusters start from zero, so it |
| 32 | +≈ CPU-seconds to reach healthy). W6 excluded as over-escalation beyond the envelope. |
| 33 | + |
| 34 | +| W | per-cluster CPU-s (×7) | time-to-healthy s (min/med/max) | node peak busy-cores (of 10) | peak mem | throttled | |
| 35 | +|---|------------------------|---------------------------------|------------------------------|----------|-----------| |
| 36 | +| 1 | 8.1 | 70 / 70 / 70 | ~1.4 (coarse) | 1.14 GiB | 0 | |
| 37 | +| 2 | ~21.8 | 75 / 75 / 75 | — | 1.52 GiB | 0 | |
| 38 | +| 3 | ~14.9 | 77 / 77 / 81 | — | 1.90 GiB | 0 | |
| 39 | +| 4 | ~62.8 | 102 / 121 / 142 | **12.34** | 2.28 GiB | 0 | |
| 40 | + |
| 41 | +(Full data + the `docker stats` CPU% caveat: `/tmp/etcd-burst-stats/SUMMARY.md`. The hi-res |
| 42 | +busy-core sampler exists only for W4; the `docker stats` CPU% column is jittery and not used |
| 43 | +for load-bearing claims.) |
| 44 | + |
| 45 | +## Tier 1 — cheap-parallel (size 1–3) |
| 46 | + |
| 47 | +A whole **size-7** bring-up in isolation costs only **8.1 CPU-seconds** total, and it is |
| 48 | +heavily front-loaded on the bootstrap member (ec-0 = 3.3 CPU-s, ~40% of the cluster) with each |
| 49 | +later-joined member costing less (down to 0.25 CPU-s). A size-1 is therefore roughly one |
| 50 | +member's worth of work and a size-3 roughly three — a few CPU-seconds each, spread over a |
| 51 | +~70s window. These never come close to saturating a runner. **Parallelize them freely**; the |
| 52 | +limit is test-harness bookkeeping, not CPU. |
| 53 | + |
| 54 | +## Tier 2 — heavy-throttled (size 7, churn) |
| 55 | + |
| 56 | +This is where the budget bites. The hi-res node sampler shows **4 simultaneous size-7 spinups |
| 57 | +peaking at 12.34 busy cores on a 10-core VM** — i.e. they oversubscribe a 10-core machine. |
| 58 | +Dividing the overlapped peak by 4 gives **~3 cores per concurrent size-7 spinup** at the burst; |
| 59 | +a single isolated size-7 peaked ~1.4 cores. So budget **~1.4–3 cores of instantaneous peak per |
| 60 | +size-7 bring-up.** |
| 61 | + |
| 62 | +Consequence for a **2-vCPU** CI runner (extrapolation, confidence medium): **only ~1 size-7 |
| 63 | +spinup fits.** A second concurrent size-7 pushes instantaneous demand well past 2 cores; the |
| 64 | +spinups don't fail (no CPU *limit* is set, so nothing is CFS-throttled — `nr_throttled=0` in |
| 65 | +every run) but they self-throttle on available CPU and time-to-healthy stretches. So Tier 2 |
| 66 | +runs **at low width / throttled**, and the way you safely overlap a *little* Tier-2 work is by |
| 67 | +sizing the reconcile worker pool — **`--max-concurrent-reconciles`** — to match the cores you |
| 68 | +have, rather than relying on namespaces to "isolate" load. Namespaces isolate Kubernetes state; |
| 69 | +they do nothing for CPU contention. |
| 70 | + |
| 71 | +## Tier 3 — crash-exclusive (`TestStressCrashDuringScale`) |
| 72 | + |
| 73 | +gofail failpoints are armed over HTTP on the **single operator pod** (`enableGoFailPoint` in |
| 74 | +`helpers_test.go`, hitting the operator's gofail port). There is one operator reconciling every |
| 75 | +cluster, so **arming a failpoint panics that one pod for all clusters at once** — it is a global |
| 76 | +switch, not per-cluster. Any other stress cluster sharing the operator during a crash test gets |
| 77 | +collateral reconcile failures and corrupts the result. `TestStressCrashDuringScale` therefore |
| 78 | +**must run alone**, with no Tier-1 or Tier-2 work in flight. |
| 79 | + |
| 80 | +## Why etcd gets a 50m CPU request |
| 81 | + |
| 82 | +The operator gives the etcd container a default **50m CPU *request*** (`--etcd-cpu-request`, |
| 83 | +default `50m`; set `0` for BestEffort). Rationale: |
| 84 | + |
| 85 | +- A BestEffort pod gets cgroup **`cpu.shares=2`** — the floor. Under CPU contention the kernel |
| 86 | + hands it almost no weight, which during a multi-member bring-up means missed heartbeats → |
| 87 | + **spurious leader elections / election churn** and stretched time-to-healthy. |
| 88 | +- A 50m request lifts shares to **~51** (50m/1000m × 1024) — a **~25× weight improvement** — and |
| 89 | + also gives the scheduler a real request to place against (a scheduling floor). QoS goes |
| 90 | + BestEffort → Burstable. |
| 91 | +- It is a **request, not a limit**, so there is no CFS quota and **no throttling** (confirmed: |
| 92 | + `nr_throttled=0` across all runs). It costs nothing when CPU is free. |
| 93 | + |
| 94 | +**Honest caveat:** the local 10-core run **could not demonstrate this benefit**. With 10 cores |
| 95 | +idle there is no contention, so the 50m-vs-BestEffort A/B at W4 came out as *noise* — BestEffort |
| 96 | +even finished marginally faster (median TTH 115s vs 121s) and showed cleaner elections. That |
| 97 | +does **not** disprove the request: the only thing it buys (a higher shares/scheduling floor) |
| 98 | +only manifests when cores are oversubscribed, which this VM never was. To prove it, re-run |
| 99 | +**core-constrained** — pin the kind node to 2 cores (`docker update --cpus=2`) or use a 2-vCPU |
| 100 | +runner — and repeat the 50m-vs-BestEffort A/B at W2–W4. Under that constraint the BestEffort |
| 101 | +`cpu.shares=2` members should show measurably longer time-to-healthy and/or spurious leader |
| 102 | +changes. Until that run exists, the 50m default rests on the cgroup-shares argument plus the |
| 103 | +spinup-cost measurements here, not on a reproduced contention burst. |
0 commit comments