Skip to content

Feat/warmup fix and reliability stats#53

Merged
JuhaoLiang1997 merged 5 commits into
mainfrom
feat/warmup-fix-and-reliability-stats
May 19, 2026
Merged

Feat/warmup fix and reliability stats#53
JuhaoLiang1997 merged 5 commits into
mainfrom
feat/warmup-fix-and-reliability-stats

Conversation

@JuhaoLiang1997
Copy link
Copy Markdown
Collaborator

Summary

Type of change

  • New platform support
  • Bug fix (runner, validator, leaderboard, or tooling)
  • Suite definition change
  • Schema change
  • Leaderboard / UI improvement
  • Documentation
  • Other:

Testing

# Commands used to verify

Checklist

  • I have read CONTRIBUTING.md
  • My change does not break existing result.json files (or I have explained the migration path)
  • If adding a new platform: runner inherits from BenchmarkRunner, produces valid result.json, includes a reference result
  • If changing the schema: validate_submission.py updated and all existing results still validate
  • If changing the leaderboard generator: leaderboard/generate.py produces correct output on existing results
  • I have updated relevant documentation

Related issues

JuhaoLiang1997 and others added 5 commits May 19, 2026 16:32
The online and burst scenarios read `online_warmup_runs` from suite.json
but the value was never applied — every reported TTFT p99 was contaminated
by cold-engine spikes (JIT compile, CUDA-graph allocation, KV cache
priming) at the start of the timed phase. This made the first QPS level's
p99 unreliable and any submission's relative-burst ratio noisy.

Changes
- loadgen/loadgen.py: new `_warmup_requests` helper used by
  `_run_online_async` and `_run_burst_async`. Fires N dummy requests
  sequentially before the timed phase; results are discarded and
  warmup-time exceptions are swallowed (logged via tqdm.write) so a flaky
  engine cannot block a submission.
- suites/suite_{A,B,D,E,F,G}/suite.json: replace dead `online_warmup_runs: 0`
  with `online_warmup_requests: 10`. Add `burst_warmup_requests: 10`
  to suite_A and suite_B (the two suites that include the burst scenario).
- schema/suite.schema.json: declare the new properties with descriptions.
  `online_warmup_runs` kept as deprecated alias to avoid breaking any
  third-party suite still carrying it.
- DEVELOPMENT.md: mirror the new field names in the suite template.
- loadgen/tests/test_warmup.py: regression coverage. Asserts (a) warmup
  fires the configured count of dummy calls, (b) fast warmup latencies
  do NOT leak into recorded p50/p99 distributions, (c) zero warmup is a
  no-op, (d) a failing warmup request does not abort the timed phase.

Tested locally with `pytest loadgen/tests -q` (8 passed).

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds inter-run variance metrics so leaderboard visitors can judge how
reproducible each submission is, plus an opt-in vendor_details field
for environment data that does not fit any cross-vendor schema.

Loadgen
- New helpers `_cv_pct`, `_stability_label`, `_reliability_block`,
  `_compute_recovery_time` (with regression tests).
- offline: emit `throughput_tokens_per_sec_reliability` per concurrency
  level — list of per-run throughputs + CV + stability label.
- online: emit `ttft_ms_p99_reliability` per QPS — per-run TTFT p99s
  computed independently (the headline pooled p99 is unchanged).
- interactive: emit `ttft_ms_p99_reliability` across `num_runs`.
- sustained: emit `throughput_post_warmup_reliability` (CV of sample
  intervals after warmup). Complements the existing `throttle_ratio`,
  which is a min/max metric and so blind to intermittent jitter.
- burst: emit `recovery_time_seconds` and per-cycle list. Defined as
  the median time within a post-burst steady window for rolling TTFT
  p99 to fall back to ≤ 1.5× the long-term steady baseline.
- Migrate `_run_online`, `_run_burst`, `_run_interactive` sync wrappers
  to `asyncio.run(...)`. `get_event_loop().run_until_complete(...)`
  was leaking closed loops across pytest runs, blocking the new tests.

Schema
- `schema/env.schema.json`: add optional `vendor_details` field
  (`additionalProperties: true` inside), the documented escape hatch
  for vendor-specific environment data that does not unify across
  platforms (NVML clocks, ROCm-SMI counters, etc).

Leaderboard generator
- `extract_viz` propagates the new reliability blocks (and burst's
  recovery time) to each per-suite viz dict. Offline reliability is
  passed as a parallel array indexed by concurrency level.
- `extract_details` propagates `env.vendor_details` to the row as
  `env_vendor_details` for flat rendering in the modal.
- Add `from __future__ import annotations` for Py 3.9 compatibility
  (was using `dict | None` in type hints).

Frontend modal
- New "Reliability" section in the Details tab. Shows worst-case CV per
  scenario, with stability badge and recovery time for burst.
- New "Vendor-specific environment" section that flattens vendor_details
  into key→value rows, hiding null/empty entries. No cross-vendor
  unification attempted.
- Small reliability pill in the modal subtitle showing the worst CV
  across scenarios — clickable users can drill into the new section to
  see per-scenario breakdown. Older results without reliability blocks
  render exactly as before (pill and section both hide silently).
- CSS for the pill follows existing `--good/--warn/--bad` tokens.

Docs
- DEVELOPMENT.md: document warmup contract per scenario and the
  reliability block shape + stability thresholds.

Tests
- `loadgen/tests/test_reliability.py`: unit-tests for the helpers and
  one integration test per scenario verifying the block shows up and
  is internally consistent (n equal to `num_runs`, stability label
  matches CV threshold). 21 loadgen tests pass.

Backward compatibility
- New result fields nest into existing `additionalProperties: true`
  blocks in `result.schema.json`; no schema bump needed.
- Existing results without reliability blocks render unchanged: the
  modal pill and Reliability section both gate on a numeric `cv_pct`
  and silently skip when absent. Older `result.json` files validate
  identically.

Co-authored-by: Cursor <cursoragent@cursor.com>
Populates the new throughput_post_warmup_reliability block on every
sustained scenario in pre-existing result.json files so the leaderboard's
"Reliability" panel and subtitle pill have data to display for historical
submissions. New runs going forward emit this field directly from loadgen.

What got modified
- 255 result.json files (suite-level + per-scenario sustained/result.json
  pairs across ~127 unique submissions)
- Net change: one ~30-line block per file, no existing fields touched
- Encoding preserved per-file: ascii-only files keep their \u escape
  style; files containing UTF-8 characters (Ascend submissions etc.)
  keep them unescaped

How it was computed
- For each sustained scenario, take the existing per-interval samples
  array under metrics.sustained.samples[], drop is_warmup samples, then
  compute mean / std / CV / stability over throughput_tokens_per_sec.
- Thresholds: CV ≤ 2% → stable, ≤ 5% → noisy, > 5% → unstable
  (kept in sync with loadgen.loadgen). Tunable later if the observed
  distribution skews too heavily into one bucket.

What could not be backfilled
- offline / online / interactive / burst: per-run breakdowns were never
  persisted to samples.jsonl or result.json, so historical reliability
  cannot be recovered. The frontend silently hides the badge for these
  scenarios on old results.

The one-shot backfill script used here was not committed — it lives in
local git history if it's ever needed again (see this commit's parent
hash if you need to recover it). For new sustained results, loadgen now
emits the reliability block natively, so the script will not be
re-invoked under normal operation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Tightening the labels after looking at real backfilled data. The initial
≤2%/≤5%/>5% thresholds labelled the literal median submission "noisy"
and slapped ~30% of submissions with an "unstable ✗" badge, which read
as a verdict on the submitter rather than an informational note about
the hardware × workload pair.

Empirical distribution from the May-2026 backfill (255 sustained CVs):
  median = 3.10 %, p90 = 13.07 %, max = 36.18 %

What changed
- Thresholds: ≤3% stable / ≤8% noisy / >8% high-variance
  (loadgen/loadgen.py `_STABILITY_THRESHOLD_*`).
- Renamed the third tier "unstable" → "high-variance" everywhere
  (label string, modal pill class, docs). High CV does not mean the
  measurement is wrong — it means the hardware × workload combo has
  irreducible jitter (consumer-card thermal throttle, HCCL noise on
  ×16 Ascend topologies, speculative-decoding acceptance-rate jitter).
- Dropped the ✗ glyph for the high-variance tier; only stable / noisy
  retain ✓ / ⚠. The CSS pill uses an orange tone, never pure red, so
  readers read "look closer" rather than "this is broken".
- DEVELOPMENT.md explains the rebrand: high-variance submitters do not
  need to re-run; the badge sizes safety margins for downstream
  hardware shoppers.

Resulting distribution (new thresholds):
  stable        : 47.8% (n=122)
  noisy         : 35.3% (n=90)
  high-variance : 16.9% (n=43)

Tail check — every chip in the 13 worst-CV submissions is a legitimate
flag: RTX 5090 / A6000 / RTX 6000 Ada / V100s (consumer/workstation
cards lacking datacenter cooling), Ascend ×16 / ×8 distributed (real
HCCL jitter), and the H20-3e (lower thermal headroom variant).

Data update
- Re-labelled 136 of the 255 backfilled sustained reliability blocks
  in place. Only the `stability` string moved; `cv_pct`, `mean`, `std`,
  and `runs[]` arrays are byte-identical, so the diff per file is a
  single line.

Tests
- loadgen/tests/test_reliability.py: updated boundary expectations and
  membership sets to track the new labels. 21/21 still pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
The reliability badge in the modal subtitle showed only `cv X.X%` with a
plain `title` tooltip, which most readers never noticed and could not
decode (what does "cv" mean? what's a good number? is high-variance
bad?). Three layered fixes give every user type a path to the answer:

1. The pill is now a <button> with `cursor: help` and an inline ⓘ glyph,
   signalling that it is interactive. The native title text is reworded
   to spell out "Coefficient of variation across runs" and the three
   thresholds (≤3% / ≤8% / >8%), and ends with "Click for details".

2. Clicking the pill (or activating it via keyboard) scrolls the Details
   tab to a new Reliability anchor and briefly flashes the section,
   switching tabs first if needed.

3. The Reliability section in Details gains an inline caption block that
   explains CV as `std / mean × 100%`, lists the three thresholds with
   colour-coded legend chips, and clarifies that "high-variance" is
   informational (natural jitter, not a measurement error). This is
   always visible — no hover required.

The detail-section helper grows two optional opts (anchor, caption) so
future sections can reuse the same pattern without per-call HTML.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

✅ AccelMark Validation: All submissions valid

See the workflow run for details.

@JuhaoLiang1997 JuhaoLiang1997 merged commit 5797fc4 into main May 19, 2026
5 checks passed
@JuhaoLiang1997 JuhaoLiang1997 deleted the feat/warmup-fix-and-reliability-stats branch May 19, 2026 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant