Feat/warmup fix and reliability stats by JuhaoLiang1997 · Pull Request #53 · FreedomIntelligence/AccelMark

JuhaoLiang1997 · 2026-05-19T10:40:33Z

Summary

Type of change

Testing

# Commands used to verify

Checklist

I have read CONTRIBUTING.md
My change does not break existing result.json files (or I have explained the migration path)
If adding a new platform: runner inherits from BenchmarkRunner, produces valid result.json, includes a reference result
If changing the schema: validate_submission.py updated and all existing results still validate
If changing the leaderboard generator: leaderboard/generate.py produces correct output on existing results
I have updated relevant documentation

Related issues

The online and burst scenarios read `online_warmup_runs` from suite.json but the value was never applied — every reported TTFT p99 was contaminated by cold-engine spikes (JIT compile, CUDA-graph allocation, KV cache priming) at the start of the timed phase. This made the first QPS level's p99 unreliable and any submission's relative-burst ratio noisy. Changes - loadgen/loadgen.py: new `_warmup_requests` helper used by `_run_online_async` and `_run_burst_async`. Fires N dummy requests sequentially before the timed phase; results are discarded and warmup-time exceptions are swallowed (logged via tqdm.write) so a flaky engine cannot block a submission. - suites/suite_{A,B,D,E,F,G}/suite.json: replace dead `online_warmup_runs: 0` with `online_warmup_requests: 10`. Add `burst_warmup_requests: 10` to suite_A and suite_B (the two suites that include the burst scenario). - schema/suite.schema.json: declare the new properties with descriptions. `online_warmup_runs` kept as deprecated alias to avoid breaking any third-party suite still carrying it. - DEVELOPMENT.md: mirror the new field names in the suite template. - loadgen/tests/test_warmup.py: regression coverage. Asserts (a) warmup fires the configured count of dummy calls, (b) fast warmup latencies do NOT leak into recorded p50/p99 distributions, (c) zero warmup is a no-op, (d) a failing warmup request does not abort the timed phase. Tested locally with `pytest loadgen/tests -q` (8 passed). Co-authored-by: Cursor <cursoragent@cursor.com>

Adds inter-run variance metrics so leaderboard visitors can judge how reproducible each submission is, plus an opt-in vendor_details field for environment data that does not fit any cross-vendor schema. Loadgen - New helpers `_cv_pct`, `_stability_label`, `_reliability_block`, `_compute_recovery_time` (with regression tests). - offline: emit `throughput_tokens_per_sec_reliability` per concurrency level — list of per-run throughputs + CV + stability label. - online: emit `ttft_ms_p99_reliability` per QPS — per-run TTFT p99s computed independently (the headline pooled p99 is unchanged). - interactive: emit `ttft_ms_p99_reliability` across `num_runs`. - sustained: emit `throughput_post_warmup_reliability` (CV of sample intervals after warmup). Complements the existing `throttle_ratio`, which is a min/max metric and so blind to intermittent jitter. - burst: emit `recovery_time_seconds` and per-cycle list. Defined as the median time within a post-burst steady window for rolling TTFT p99 to fall back to ≤ 1.5× the long-term steady baseline. - Migrate `_run_online`, `_run_burst`, `_run_interactive` sync wrappers to `asyncio.run(...)`. `get_event_loop().run_until_complete(...)` was leaking closed loops across pytest runs, blocking the new tests. Schema - `schema/env.schema.json`: add optional `vendor_details` field (`additionalProperties: true` inside), the documented escape hatch for vendor-specific environment data that does not unify across platforms (NVML clocks, ROCm-SMI counters, etc). Leaderboard generator - `extract_viz` propagates the new reliability blocks (and burst's recovery time) to each per-suite viz dict. Offline reliability is passed as a parallel array indexed by concurrency level. - `extract_details` propagates `env.vendor_details` to the row as `env_vendor_details` for flat rendering in the modal. - Add `from __future__ import annotations` for Py 3.9 compatibility (was using `dict | None` in type hints). Frontend modal - New "Reliability" section in the Details tab. Shows worst-case CV per scenario, with stability badge and recovery time for burst. - New "Vendor-specific environment" section that flattens vendor_details into key→value rows, hiding null/empty entries. No cross-vendor unification attempted. - Small reliability pill in the modal subtitle showing the worst CV across scenarios — clickable users can drill into the new section to see per-scenario breakdown. Older results without reliability blocks render exactly as before (pill and section both hide silently). - CSS for the pill follows existing `--good/--warn/--bad` tokens. Docs - DEVELOPMENT.md: document warmup contract per scenario and the reliability block shape + stability thresholds. Tests - `loadgen/tests/test_reliability.py`: unit-tests for the helpers and one integration test per scenario verifying the block shows up and is internally consistent (n equal to `num_runs`, stability label matches CV threshold). 21 loadgen tests pass. Backward compatibility - New result fields nest into existing `additionalProperties: true` blocks in `result.schema.json`; no schema bump needed. - Existing results without reliability blocks render unchanged: the modal pill and Reliability section both gate on a numeric `cv_pct` and silently skip when absent. Older `result.json` files validate identically. Co-authored-by: Cursor <cursoragent@cursor.com>

Populates the new throughput_post_warmup_reliability block on every sustained scenario in pre-existing result.json files so the leaderboard's "Reliability" panel and subtitle pill have data to display for historical submissions. New runs going forward emit this field directly from loadgen. What got modified - 255 result.json files (suite-level + per-scenario sustained/result.json pairs across ~127 unique submissions) - Net change: one ~30-line block per file, no existing fields touched - Encoding preserved per-file: ascii-only files keep their \u escape style; files containing UTF-8 characters (Ascend submissions etc.) keep them unescaped How it was computed - For each sustained scenario, take the existing per-interval samples array under metrics.sustained.samples[], drop is_warmup samples, then compute mean / std / CV / stability over throughput_tokens_per_sec. - Thresholds: CV ≤ 2% → stable, ≤ 5% → noisy, > 5% → unstable (kept in sync with loadgen.loadgen). Tunable later if the observed distribution skews too heavily into one bucket. What could not be backfilled - offline / online / interactive / burst: per-run breakdowns were never persisted to samples.jsonl or result.json, so historical reliability cannot be recovered. The frontend silently hides the badge for these scenarios on old results. The one-shot backfill script used here was not committed — it lives in local git history if it's ever needed again (see this commit's parent hash if you need to recover it). For new sustained results, loadgen now emits the reliability block natively, so the script will not be re-invoked under normal operation. Co-authored-by: Cursor <cursoragent@cursor.com>

Tightening the labels after looking at real backfilled data. The initial ≤2%/≤5%/>5% thresholds labelled the literal median submission "noisy" and slapped ~30% of submissions with an "unstable ✗" badge, which read as a verdict on the submitter rather than an informational note about the hardware × workload pair. Empirical distribution from the May-2026 backfill (255 sustained CVs): median = 3.10 %, p90 = 13.07 %, max = 36.18 % What changed - Thresholds: ≤3% stable / ≤8% noisy / >8% high-variance (loadgen/loadgen.py `_STABILITY_THRESHOLD_*`). - Renamed the third tier "unstable" → "high-variance" everywhere (label string, modal pill class, docs). High CV does not mean the measurement is wrong — it means the hardware × workload combo has irreducible jitter (consumer-card thermal throttle, HCCL noise on ×16 Ascend topologies, speculative-decoding acceptance-rate jitter). - Dropped the ✗ glyph for the high-variance tier; only stable / noisy retain ✓ / ⚠. The CSS pill uses an orange tone, never pure red, so readers read "look closer" rather than "this is broken". - DEVELOPMENT.md explains the rebrand: high-variance submitters do not need to re-run; the badge sizes safety margins for downstream hardware shoppers. Resulting distribution (new thresholds): stable : 47.8% (n=122) noisy : 35.3% (n=90) high-variance : 16.9% (n=43) Tail check — every chip in the 13 worst-CV submissions is a legitimate flag: RTX 5090 / A6000 / RTX 6000 Ada / V100s (consumer/workstation cards lacking datacenter cooling), Ascend ×16 / ×8 distributed (real HCCL jitter), and the H20-3e (lower thermal headroom variant). Data update - Re-labelled 136 of the 255 backfilled sustained reliability blocks in place. Only the `stability` string moved; `cv_pct`, `mean`, `std`, and `runs[]` arrays are byte-identical, so the diff per file is a single line. Tests - loadgen/tests/test_reliability.py: updated boundary expectations and membership sets to track the new labels. 21/21 still pass. Co-authored-by: Cursor <cursoragent@cursor.com>

The reliability badge in the modal subtitle showed only `cv X.X%` with a plain `title` tooltip, which most readers never noticed and could not decode (what does "cv" mean? what's a good number? is high-variance bad?). Three layered fixes give every user type a path to the answer: 1. The pill is now a <button> with `cursor: help` and an inline ⓘ glyph, signalling that it is interactive. The native title text is reworded to spell out "Coefficient of variation across runs" and the three thresholds (≤3% / ≤8% / >8%), and ends with "Click for details". 2. Clicking the pill (or activating it via keyboard) scrolls the Details tab to a new Reliability anchor and briefly flashes the section, switching tabs first if needed. 3. The Reliability section in Details gains an inline caption block that explains CV as `std / mean × 100%`, lists the three thresholds with colour-coded legend chips, and clarifies that "high-variance" is informational (natural jitter, not a measurement error). This is always visible — no hover required. The detail-section helper grows two optional opts (anchor, caption) so future sections can reuse the same pattern without per-call HTML. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-19T10:40:58Z

✅ AccelMark Validation: All submissions valid

See the workflow run for details.

JuhaoLiang1997 and others added 5 commits May 19, 2026 16:32

JuhaoLiang1997 merged commit 5797fc4 into main May 19, 2026
5 checks passed

JuhaoLiang1997 deleted the feat/warmup-fix-and-reliability-stats branch May 19, 2026 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/warmup fix and reliability stats#53

Feat/warmup fix and reliability stats#53
JuhaoLiang1997 merged 5 commits into
mainfrom
feat/warmup-fix-and-reliability-stats

JuhaoLiang1997 commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JuhaoLiang1997 commented May 19, 2026

Summary

Type of change

Testing

Checklist

Related issues

Uh oh!

github-actions Bot commented May 19, 2026

✅ AccelMark Validation: All submissions valid

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant