|
| 1 | +--- |
| 2 | +title: "Scheduler Stress Testing" |
| 3 | +nav_order: 102 |
| 4 | +parent: Reference |
| 5 | +layout: default |
| 6 | +linkTitle: "Scheduler Stress Testing" |
| 7 | +date: 2026-06-12 |
| 8 | +description: > |
| 9 | + How to run, tune, and interpret the Rust scheduler's booking and accounting |
| 10 | + stress suite, locally and in CI |
| 11 | +--- |
| 12 | + |
| 13 | +# Scheduler Stress Testing |
| 14 | + |
| 15 | +### Running and interpreting the booking + accounting stress suite |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## Overview |
| 20 | + |
| 21 | +The stress suite (`rust/crates/scheduler/tests/stress_tests.rs`) exercises the |
| 22 | +[Rust scheduler](/docs/developer-guide/scheduler/)'s full production dispatch |
| 23 | +path at scale — `pipeline::run` end to end: Redis accounting bootstrap → |
| 24 | +cluster feed → pending-job query → host matching → dispatch (proc insert, host |
| 25 | +ledger decrement, frame start) — against a deterministic, bulk-seeded farm. |
| 26 | + |
| 27 | +It is both a **correctness gate** and a **benchmark harness**: |
| 28 | + |
| 29 | +- **Correctness**: after each phase an audit cross-checks every Redis `acct:*` |
| 30 | + hash the run touched against `SUM(proc)` in Postgres (the canonical record — |
| 31 | + see the [Redis-Backed Accounting Reference](/docs/developer-guide/redis-accounting/)), |
| 32 | + and verifies cap enforcement and ledger invariants. |
| 33 | +- **Benchmark**: it reports booking throughput (frames/s over the active |
| 34 | + booking window), host-matching efficiency (wasted attempt %), host-cache hit |
| 35 | + ratio, and Redis Lua op counts. |
| 36 | + |
| 37 | +The suite runs two phases in one process: |
| 38 | + |
| 39 | +| Phase | Shape | What it proves | |
| 40 | +|---|---|---| |
| 41 | +| **drain** | Farm capacity comfortably exceeds demand (default: 1,200 hosts, 6,000 frames) | ≥90% of frames book; throughput measured; accounting stays exact under concurrency, including the force-rollback compensation path | |
| 42 | +| **saturation** | Demand vastly exceeds tight subscription bursts and per-job core caps (default: 400 hosts, 3,000 frames, 150-core bursts) | The Redis Lua cap check is the binding constraint: bookings stop exactly at burst, caps are never breached, rejections flow through the hot path | |
| 43 | + |
| 44 | +### Invariants the audit asserts |
| 45 | + |
| 46 | +1. Every `acct:{sub,folder,job,layer,point}` hash holds exactly |
| 47 | + `SUM(proc.int_cores_reserved)/100` cores and `SUM(proc.int_gpus_reserved)` |
| 48 | + GPUs for its grouping — the same 5-dimension grouping and centicore→core |
| 49 | + conversion the recompute loop uses. The suite pushes the recompute and |
| 50 | + limit-reseed loops out to a 1-hour interval, so agreement here proves the |
| 51 | + *dispatch hot path alone* (Lua book + force-rollback) kept Redis exact — |
| 52 | + reconciliation never got a chance to paper over drift. |
| 53 | +2. Jobs with no bookings have no leaked Redis counters. |
| 54 | +3. Per-(show, alloc) booked cores never exceed the subscription burst. |
| 55 | +4. Per-job booked cores never exceed `job_resource.int_max_cores`. |
| 56 | +5. Host ledger: `int_cores - int_cores_idle == SUM(proc)` per host, never negative. |
| 57 | +6. One `RUNNING` frame per proc row. |
| 58 | +7. Trigger-maintained `job_stat.int_waiting_count` matches the frame table. |
| 59 | +8. After teardown, zero `stress_%` rows remain in any table the suite touches. |
| 60 | + |
| 61 | +## Running locally |
| 62 | + |
| 63 | +### Prerequisites |
| 64 | + |
| 65 | +- A migrated Postgres on `localhost:5432` (`cuebot` / `cuebot_password`). |
| 66 | + From the repo root: `docker compose up -d flyway` (brings up `db` and applies |
| 67 | + migrations). If the Flyway image won't build in your environment (e.g. |
| 68 | + SSL-inspecting proxies break its package mirrors), apply the migrations |
| 69 | + directly — they are plain SQL: |
| 70 | + |
| 71 | + ```bash |
| 72 | + cd cuebot/src/main/resources/conf/ddl/postgres/migrations |
| 73 | + for f in $(ls V*.sql | sort -t V -k2 -n); do |
| 74 | + docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f" |
| 75 | + done |
| 76 | + ``` |
| 77 | + |
| 78 | +- A running Docker daemon. The suite starts its own throwaway Redis container |
| 79 | + via testcontainers; all accounting state dies with it. |
| 80 | + |
| 81 | +### Run |
| 82 | + |
| 83 | +```bash |
| 84 | +cd rust |
| 85 | +cargo test -p scheduler --features stress-tests --test stress_tests -- --nocapture |
| 86 | +``` |
| 87 | + |
| 88 | +For meaningful benchmark numbers, use a release build: |
| 89 | + |
| 90 | +```bash |
| 91 | +cargo test -p scheduler --release --features stress-tests --test stress_tests -- --nocapture |
| 92 | +``` |
| 93 | + |
| 94 | +### Tuning |
| 95 | + |
| 96 | +| Env var | Default | Meaning | |
| 97 | +|---|---|---| |
| 98 | +| `STRESS_JOBS` | 300 | drain-phase job count | |
| 99 | +| `STRESS_LAYERS` | 4 | drain-phase layers per job | |
| 100 | +| `STRESS_FRAMES_PER_LAYER` | 5 | drain-phase frames per layer | |
| 101 | +| `STRESS_HOSTS` | 1200 | drain-phase host count | |
| 102 | +| `STRESS_TAGS` | 8 | drain-phase manual tag count | |
| 103 | +| `STRESS_SAT_JOBS` | 150 | saturation-phase job count | |
| 104 | +| `STRESS_SAT_HOSTS` | 400 | saturation-phase host count | |
| 105 | +| `STRESS_DRAIN_TARGET` | 0.9 | fraction of drain frames that must book | |
| 106 | +| `STRESS_STALL_SECS` | 30 | watchdog: pause jobs after this long without a new booking | |
| 107 | +| `STRESS_TIMEOUT_SECS` | 600 | watchdog: per-phase hard timeout | |
| 108 | + |
| 109 | +Seeding is deterministic for a given scale (fixed RNG seed), so consecutive |
| 110 | +runs at the same scale book the same workload — diffs in throughput between |
| 111 | +runs reflect the code, not the data. |
| 112 | + |
| 113 | +### Reading the report |
| 114 | + |
| 115 | +``` |
| 116 | +================ phase: drain ================ |
| 117 | +frames : 6000 seeded, 5988 dispatched (99.8%), waiting 6000 -> 12 |
| 118 | +throughput : 975.1 frames/s over a 6.1s booking window (wall 43.3s) |
| 119 | +matching : 3175 host attempts (41.9% wasted), 39 cluster rounds, host-cache hit 98% |
| 120 | +accounting : 7452 redis lua ops, 5988 dispatches (metrics), 24040 booked cores, rejections [...] |
| 121 | +audit : OK |
| 122 | +``` |
| 123 | + |
| 124 | +- **throughput** is measured from the first to the last `proc.ts_booked`, so it |
| 125 | + excludes the post-drain shutdown tail of the feed (the `wall` figure includes it). |
| 126 | +- **redis lua ops** above the dispatch count means the compensation path ran: |
| 127 | + each failed dispatch costs one book plus one force-rollback. The audit |
| 128 | + passing alongside a surplus is a *positive* signal — rollbacks netted out. |
| 129 | +- In the saturation phase, expect large `subscription=` rejection counts and |
| 130 | + every subscription pinned at exactly `burst/burst` cores. |
| 131 | + |
| 132 | +### Cleanup guarantees |
| 133 | + |
| 134 | +All database rows the suite creates are prefixed `stress_`. The suite sweeps |
| 135 | +that prefix **before** seeding (so leftovers from a crashed earlier run never |
| 136 | +skew results) and **after** the run, then asserts zero residue. Redis state |
| 137 | +needs no cleanup — the container is destroyed with the test. If a run is |
| 138 | +killed hard (e.g. SIGKILL mid-phase), the next run's pre-sweep removes the |
| 139 | +leftovers. |
| 140 | + |
| 141 | +## CI integration |
| 142 | + |
| 143 | +The suite runs in the |
| 144 | +[`scheduler-stress-pipeline.yml`](https://github.com/AcademySoftwareFoundation/OpenCue/blob/master/.github/workflows/scheduler-stress-pipeline.yml) |
| 145 | +workflow. |
| 146 | + |
| 147 | +### When it runs |
| 148 | + |
| 149 | +| Trigger | Scale | Purpose | |
| 150 | +|---|---|---| |
| 151 | +| Pull request touching `rust/crates/scheduler/**`, `rust/crates/opencue-proto/**`, `rust/Cargo.toml`, the Postgres migrations, or the workflow itself | defaults | Gate scheduler changes on booking/accounting correctness | |
| 152 | +| Nightly (cron, master) | defaults | Catch drift from changes outside the paths filter; daily throughput data point | |
| 153 | +| Manual (`workflow_dispatch`) | custom via inputs | Benchmark a branch at chosen scale | |
| 154 | + |
| 155 | +### When it deliberately does not run |
| 156 | + |
| 157 | +- **PRs that don't touch the scheduler or schema** (Python, CueGUI, CueWeb, |
| 158 | + docs, …). The suite needs a migrated Postgres, a Docker daemon, and several |
| 159 | + minutes of runner time; for those changes it produces zero signal. |
| 160 | +- **As a performance gate.** Shared CI runners have noisy CPU/IO, so the |
| 161 | + workflow never asserts on frames/s — throughput is published in the job's |
| 162 | + step summary (and the full log as an artifact) for humans to eyeball trends. |
| 163 | + Benchmark conclusions should come from local release-mode runs on quiet |
| 164 | + hardware. |
| 165 | + |
| 166 | +### What fails the job |
| 167 | + |
| 168 | +Only correctness regressions: accounting drift between Redis and Postgres, a |
| 169 | +cap breach, booking liveness failures (drain below target, or a saturated farm |
| 170 | +producing no Redis rejections), a phase that never converges (hard-timeout), |
| 171 | +or test data left behind after cleanup. |
| 172 | + |
| 173 | +### Launching a manual benchmark run |
| 174 | + |
| 175 | +GitHub → Actions → *OpenCue Scheduler Stress Pipeline* → *Run workflow*, then |
| 176 | +optionally override the job/host/frame counts and timeout. Results appear in |
| 177 | +the run's step summary; the complete log is attached as the |
| 178 | +`scheduler-stress-output` artifact (kept 30 days). |
| 179 | + |
| 180 | +## Scope and limitations |
| 181 | + |
| 182 | +- **RQD is not exercised.** The suite runs in `dry_run_mode`: the full booking |
| 183 | + path executes (Redis Lua, proc insert, host ledger, frame start) but no gRPC |
| 184 | + launch is sent. Frame *completion* and the Cuebot release path are out of |
| 185 | + scope — see the [Redis-Backed Accounting Reference](/docs/developer-guide/redis-accounting/) |
| 186 | + for how releases are reconciled. |
| 187 | +- **Only scheduler-managed shows** (`show.b_scheduler_managed = true`) are |
| 188 | + covered; Cuebot-managed accounting is Cuebot's test territory. |
| 189 | +- The recompute / limit-reseed loops are intentionally dormant during the run |
| 190 | + (see invariant 1); their CAS semantics are covered separately by |
| 191 | + `tests/redis_integration.rs` (`--features redis-tests`). |
| 192 | + |
| 193 | +## Schema gotchas the suite encodes |
| 194 | + |
| 195 | +These bit during development and are asserted/documented in the test code — |
| 196 | +keep them in mind when extending the seeding: |
| 197 | + |
| 198 | +- `alloc.str_tag` is `VARCHAR(24)` and `host.str_name` is `VARCHAR(30)`: |
| 199 | + generated names must stay short. |
| 200 | +- The pending-job query `INNER JOIN`s `folder_resource`: a folder without that |
| 201 | + row makes every job in it silently unbookable. |
| 202 | +- The `vs_waiting` view requires `job_resource.int_max_cores - int_cores >= 100` |
| 203 | + (centicores): `int_max_cores = 0` does **not** mean "unlimited" on the query |
| 204 | + path — use a large value instead. |
0 commit comments