Feat/warmup fix and reliability stats#53
Merged
Conversation
The online and burst scenarios read `online_warmup_runs` from suite.json
but the value was never applied — every reported TTFT p99 was contaminated
by cold-engine spikes (JIT compile, CUDA-graph allocation, KV cache
priming) at the start of the timed phase. This made the first QPS level's
p99 unreliable and any submission's relative-burst ratio noisy.
Changes
- loadgen/loadgen.py: new `_warmup_requests` helper used by
`_run_online_async` and `_run_burst_async`. Fires N dummy requests
sequentially before the timed phase; results are discarded and
warmup-time exceptions are swallowed (logged via tqdm.write) so a flaky
engine cannot block a submission.
- suites/suite_{A,B,D,E,F,G}/suite.json: replace dead `online_warmup_runs: 0`
with `online_warmup_requests: 10`. Add `burst_warmup_requests: 10`
to suite_A and suite_B (the two suites that include the burst scenario).
- schema/suite.schema.json: declare the new properties with descriptions.
`online_warmup_runs` kept as deprecated alias to avoid breaking any
third-party suite still carrying it.
- DEVELOPMENT.md: mirror the new field names in the suite template.
- loadgen/tests/test_warmup.py: regression coverage. Asserts (a) warmup
fires the configured count of dummy calls, (b) fast warmup latencies
do NOT leak into recorded p50/p99 distributions, (c) zero warmup is a
no-op, (d) a failing warmup request does not abort the timed phase.
Tested locally with `pytest loadgen/tests -q` (8 passed).
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds inter-run variance metrics so leaderboard visitors can judge how reproducible each submission is, plus an opt-in vendor_details field for environment data that does not fit any cross-vendor schema. Loadgen - New helpers `_cv_pct`, `_stability_label`, `_reliability_block`, `_compute_recovery_time` (with regression tests). - offline: emit `throughput_tokens_per_sec_reliability` per concurrency level — list of per-run throughputs + CV + stability label. - online: emit `ttft_ms_p99_reliability` per QPS — per-run TTFT p99s computed independently (the headline pooled p99 is unchanged). - interactive: emit `ttft_ms_p99_reliability` across `num_runs`. - sustained: emit `throughput_post_warmup_reliability` (CV of sample intervals after warmup). Complements the existing `throttle_ratio`, which is a min/max metric and so blind to intermittent jitter. - burst: emit `recovery_time_seconds` and per-cycle list. Defined as the median time within a post-burst steady window for rolling TTFT p99 to fall back to ≤ 1.5× the long-term steady baseline. - Migrate `_run_online`, `_run_burst`, `_run_interactive` sync wrappers to `asyncio.run(...)`. `get_event_loop().run_until_complete(...)` was leaking closed loops across pytest runs, blocking the new tests. Schema - `schema/env.schema.json`: add optional `vendor_details` field (`additionalProperties: true` inside), the documented escape hatch for vendor-specific environment data that does not unify across platforms (NVML clocks, ROCm-SMI counters, etc). Leaderboard generator - `extract_viz` propagates the new reliability blocks (and burst's recovery time) to each per-suite viz dict. Offline reliability is passed as a parallel array indexed by concurrency level. - `extract_details` propagates `env.vendor_details` to the row as `env_vendor_details` for flat rendering in the modal. - Add `from __future__ import annotations` for Py 3.9 compatibility (was using `dict | None` in type hints). Frontend modal - New "Reliability" section in the Details tab. Shows worst-case CV per scenario, with stability badge and recovery time for burst. - New "Vendor-specific environment" section that flattens vendor_details into key→value rows, hiding null/empty entries. No cross-vendor unification attempted. - Small reliability pill in the modal subtitle showing the worst CV across scenarios — clickable users can drill into the new section to see per-scenario breakdown. Older results without reliability blocks render exactly as before (pill and section both hide silently). - CSS for the pill follows existing `--good/--warn/--bad` tokens. Docs - DEVELOPMENT.md: document warmup contract per scenario and the reliability block shape + stability thresholds. Tests - `loadgen/tests/test_reliability.py`: unit-tests for the helpers and one integration test per scenario verifying the block shows up and is internally consistent (n equal to `num_runs`, stability label matches CV threshold). 21 loadgen tests pass. Backward compatibility - New result fields nest into existing `additionalProperties: true` blocks in `result.schema.json`; no schema bump needed. - Existing results without reliability blocks render unchanged: the modal pill and Reliability section both gate on a numeric `cv_pct` and silently skip when absent. Older `result.json` files validate identically. Co-authored-by: Cursor <cursoragent@cursor.com>
Populates the new throughput_post_warmup_reliability block on every sustained scenario in pre-existing result.json files so the leaderboard's "Reliability" panel and subtitle pill have data to display for historical submissions. New runs going forward emit this field directly from loadgen. What got modified - 255 result.json files (suite-level + per-scenario sustained/result.json pairs across ~127 unique submissions) - Net change: one ~30-line block per file, no existing fields touched - Encoding preserved per-file: ascii-only files keep their \u escape style; files containing UTF-8 characters (Ascend submissions etc.) keep them unescaped How it was computed - For each sustained scenario, take the existing per-interval samples array under metrics.sustained.samples[], drop is_warmup samples, then compute mean / std / CV / stability over throughput_tokens_per_sec. - Thresholds: CV ≤ 2% → stable, ≤ 5% → noisy, > 5% → unstable (kept in sync with loadgen.loadgen). Tunable later if the observed distribution skews too heavily into one bucket. What could not be backfilled - offline / online / interactive / burst: per-run breakdowns were never persisted to samples.jsonl or result.json, so historical reliability cannot be recovered. The frontend silently hides the badge for these scenarios on old results. The one-shot backfill script used here was not committed — it lives in local git history if it's ever needed again (see this commit's parent hash if you need to recover it). For new sustained results, loadgen now emits the reliability block natively, so the script will not be re-invoked under normal operation. Co-authored-by: Cursor <cursoragent@cursor.com>
Tightening the labels after looking at real backfilled data. The initial ≤2%/≤5%/>5% thresholds labelled the literal median submission "noisy" and slapped ~30% of submissions with an "unstable ✗" badge, which read as a verdict on the submitter rather than an informational note about the hardware × workload pair. Empirical distribution from the May-2026 backfill (255 sustained CVs): median = 3.10 %, p90 = 13.07 %, max = 36.18 % What changed - Thresholds: ≤3% stable / ≤8% noisy / >8% high-variance (loadgen/loadgen.py `_STABILITY_THRESHOLD_*`). - Renamed the third tier "unstable" → "high-variance" everywhere (label string, modal pill class, docs). High CV does not mean the measurement is wrong — it means the hardware × workload combo has irreducible jitter (consumer-card thermal throttle, HCCL noise on ×16 Ascend topologies, speculative-decoding acceptance-rate jitter). - Dropped the ✗ glyph for the high-variance tier; only stable / noisy retain ✓ / ⚠. The CSS pill uses an orange tone, never pure red, so readers read "look closer" rather than "this is broken". - DEVELOPMENT.md explains the rebrand: high-variance submitters do not need to re-run; the badge sizes safety margins for downstream hardware shoppers. Resulting distribution (new thresholds): stable : 47.8% (n=122) noisy : 35.3% (n=90) high-variance : 16.9% (n=43) Tail check — every chip in the 13 worst-CV submissions is a legitimate flag: RTX 5090 / A6000 / RTX 6000 Ada / V100s (consumer/workstation cards lacking datacenter cooling), Ascend ×16 / ×8 distributed (real HCCL jitter), and the H20-3e (lower thermal headroom variant). Data update - Re-labelled 136 of the 255 backfilled sustained reliability blocks in place. Only the `stability` string moved; `cv_pct`, `mean`, `std`, and `runs[]` arrays are byte-identical, so the diff per file is a single line. Tests - loadgen/tests/test_reliability.py: updated boundary expectations and membership sets to track the new labels. 21/21 still pass. Co-authored-by: Cursor <cursoragent@cursor.com>
The reliability badge in the modal subtitle showed only `cv X.X%` with a plain `title` tooltip, which most readers never noticed and could not decode (what does "cv" mean? what's a good number? is high-variance bad?). Three layered fixes give every user type a path to the answer: 1. The pill is now a <button> with `cursor: help` and an inline ⓘ glyph, signalling that it is interactive. The native title text is reworded to spell out "Coefficient of variation across runs" and the three thresholds (≤3% / ≤8% / >8%), and ends with "Click for details". 2. Clicking the pill (or activating it via keyboard) scrolls the Details tab to a new Reliability anchor and briefly flashes the section, switching tabs first if needed. 3. The Reliability section in Details gains an inline caption block that explains CV as `std / mean × 100%`, lists the three thresholds with colour-coded legend chips, and clarifies that "high-variance" is informational (natural jitter, not a measurement error). This is always visible — no hover required. The detail-section helper grows two optional opts (anchor, caption) so future sections can reuse the same pattern without per-call HTML. Co-authored-by: Cursor <cursoragent@cursor.com>
✅ AccelMark Validation: All submissions validSee the workflow run for details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Type of change
Testing
# Commands used to verifyChecklist
result.jsonfiles (or I have explained the migration path)BenchmarkRunner, produces validresult.json, includes a reference resultvalidate_submission.pyupdated and all existing results still validateleaderboard/generate.pyproduces correct output on existing resultsRelated issues