Skip to content

Benchmark CI: tune thresholds / reduce flakiness so the main gate can be re-enabled #3169

@justin808

Description

@justin808

Background

PR #3168 downgraded the benchmark workflow's "fail on main regression" step (.github/workflows/benchmark.yml step 7c) to a ::warning:: annotation. That stops main from being permanently red, but the underlying flakiness remains: single-run Bencher alerts on GitHub-hosted runners are dominated by environmental noise.

Evidence collected from the Bencher API (across the 6 most recent main reports):

Adjacent main runs Shared (benchmark, measure) alerts Total alerts Jaccard
bb0748c59f966a 0 42 0.00
bb0748c4eb8364 1 54 0.02
59f966a4eb8364 1 72 0.01
4eb8364bc2b9eb 3 70 0.04
17173874eb8364 4 66 0.06

Each run flags a near-disjoint random subset of (benchmark, measure) pairs. The failing run on 4eb83648b alone fired 43 alerts across Core and Pro, including routes wholly unrelated to the committed change.

Goal

Get the Bencher gate to a state where a fired alert is ≥ order-of-magnitude more likely to reflect a real regression than runner noise, so main can be re-gated without the current churn.

Options to evaluate

Pick one, then combine with option 4:

  1. Widen BOUNDARY from 0.95 → 0.99 in .github/workflows/benchmark.yml (steps 7a — several --threshold-*-boundary args). Cheapest, reduces false-positive rate, but also reduces sensitivity to small real regressions.
  2. Require N-consecutive-run alerts before auto-opening an issue and before the gate fails. Implemented in the workflow script (step 7b) by inspecting history of recent Bencher alerts for the same (benchmark, measure) pair. Better signal; more code.
  3. Add --threshold-min-sample-size (e.g., 10–20) so the t-test only fires once enough history exists per (branch, benchmark, measure). Pairs well with option 1.
  4. Move benchmark job to self-hosted or larger GHA runners (ubuntu-latest-16-core or self-hosted). Addresses the root cause — CPU contention and noisy-neighbor effects on shared runners. Cost: infra + spend.

Dependency: PR #3148

PR #3148 fixes Bencher reporting on pushes to main so baselines are no longer lost. Until that merges, many recent main runs are missing from history and any threshold tuning is working against incomplete data. Do not re-enable the hard gate until #3148 lands and at least ~30 post-merge runs have built a clean baseline.

Re-enabling the gate

Once #3148 has been merged for long enough to establish a stable baseline, restore the hard fail in step 7c and verify:

  • ≥ 5 consecutive non-docs pushes to main pass the gate with current code.
  • A deliberately-introduced perf regression (e.g., a sleep in a controller) is still caught.

Acceptance criteria

  • Benchmark workflow on main does not fire alerts on unchanged-perf commits more than ~1 in N runs (pick a target, e.g., 1 in 20).
  • Tracking issue (currently Performance Regression Detected on main (6a399b0) #3116) accumulates only when the alert set overlaps with the prior run's (i.e., actual sustained regressions).
  • Hard gate restored in step 7c.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions