You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #3168 downgraded the benchmark workflow's "fail on main regression" step (.github/workflows/benchmark.yml step 7c) to a ::warning:: annotation. That stops main from being permanently red, but the underlying flakiness remains: single-run Bencher alerts on GitHub-hosted runners are dominated by environmental noise.
Evidence collected from the Bencher API (across the 6 most recent main reports):
Each run flags a near-disjoint random subset of (benchmark, measure) pairs. The failing run on 4eb83648b alone fired 43 alerts across Core and Pro, including routes wholly unrelated to the committed change.
Goal
Get the Bencher gate to a state where a fired alert is ≥ order-of-magnitude more likely to reflect a real regression than runner noise, so main can be re-gated without the current churn.
Options to evaluate
Pick one, then combine with option 4:
Widen BOUNDARY from 0.95 → 0.99 in .github/workflows/benchmark.yml (steps 7a — several --threshold-*-boundary args). Cheapest, reduces false-positive rate, but also reduces sensitivity to small real regressions.
Require N-consecutive-run alerts before auto-opening an issue and before the gate fails. Implemented in the workflow script (step 7b) by inspecting history of recent Bencher alerts for the same (benchmark, measure) pair. Better signal; more code.
Add --threshold-min-sample-size (e.g., 10–20) so the t-test only fires once enough history exists per (branch, benchmark, measure). Pairs well with option 1.
Move benchmark job to self-hosted or larger GHA runners (ubuntu-latest-16-core or self-hosted). Addresses the root cause — CPU contention and noisy-neighbor effects on shared runners. Cost: infra + spend.
PR #3148 fixes Bencher reporting on pushes to main so baselines are no longer lost. Until that merges, many recent main runs are missing from history and any threshold tuning is working against incomplete data. Do not re-enable the hard gate until #3148 lands and at least ~30 post-merge runs have built a clean baseline.
Re-enabling the gate
Once #3148 has been merged for long enough to establish a stable baseline, restore the hard fail in step 7c and verify:
≥ 5 consecutive non-docs pushes to main pass the gate with current code.
A deliberately-introduced perf regression (e.g., a sleep in a controller) is still caught.
Acceptance criteria
Benchmark workflow on main does not fire alerts on unchanged-perf commits more than ~1 in N runs (pick a target, e.g., 1 in 20).
Background
PR #3168 downgraded the benchmark workflow's "fail on main regression" step (
.github/workflows/benchmark.ymlstep 7c) to a::warning::annotation. That stops main from being permanently red, but the underlying flakiness remains: single-run Bencher alerts on GitHub-hosted runners are dominated by environmental noise.Evidence collected from the Bencher API (across the 6 most recent main reports):
Each run flags a near-disjoint random subset of (benchmark, measure) pairs. The failing run on
4eb83648balone fired 43 alerts across Core and Pro, including routes wholly unrelated to the committed change.Goal
Get the Bencher gate to a state where a fired alert is ≥ order-of-magnitude more likely to reflect a real regression than runner noise, so main can be re-gated without the current churn.
Options to evaluate
Pick one, then combine with option 4:
BOUNDARYfrom 0.95 → 0.99 in.github/workflows/benchmark.yml(steps 7a — several--threshold-*-boundaryargs). Cheapest, reduces false-positive rate, but also reduces sensitivity to small real regressions.(benchmark, measure)pair. Better signal; more code.--threshold-min-sample-size(e.g., 10–20) so the t-test only fires once enough history exists per(branch, benchmark, measure). Pairs well with option 1.ubuntu-latest-16-coreor self-hosted). Addresses the root cause — CPU contention and noisy-neighbor effects on shared runners. Cost: infra + spend.Dependency: PR #3148
PR #3148 fixes Bencher reporting on pushes to main so baselines are no longer lost. Until that merges, many recent main runs are missing from history and any threshold tuning is working against incomplete data. Do not re-enable the hard gate until #3148 lands and at least ~30 post-merge runs have built a clean baseline.
Re-enabling the gate
Once #3148 has been merged for long enough to establish a stable baseline, restore the hard fail in step 7c and verify:
sleepin a controller) is still caught.Acceptance criteria
Related