Benchmark CI: tune thresholds / reduce flakiness so the main gate can be re-enabled

## Background

PR #3168 downgraded the benchmark workflow's "fail on main regression" step (`.github/workflows/benchmark.yml` step 7c) to a `::warning::` annotation. That stops main from being permanently red, but the underlying flakiness remains: single-run Bencher alerts on GitHub-hosted runners are dominated by environmental noise.

Evidence collected from the Bencher API (across the 6 most recent main reports):

| Adjacent main runs   | Shared (benchmark, measure) alerts | Total alerts | Jaccard |
| -------------------- | ---------------------------------- | ------------ | ------- |
| bb0748c0 ↔ 59f966a2 | 0                                  | 42           | 0.00    |
| bb0748c0 ↔ 4eb83648 | 1                                  | 54           | 0.02    |
| 59f966a2 ↔ 4eb83648 | 1                                  | 72           | 0.01    |
| 4eb83648 ↔ bc2b9eb3 | 3                                  | 70           | 0.04    |
| 17173874 ↔ 4eb83648 | 4                                  | 66           | 0.06    |

Each run flags a near-disjoint random subset of (benchmark, measure) pairs. The failing run on [`4eb83648b`](https://github.com/shakacode/react_on_rails/commit/4eb83648b) alone fired 43 alerts across Core and Pro, including routes wholly unrelated to the committed change.

## Goal

Get the Bencher gate to a state where a fired alert is ≥ order-of-magnitude more likely to reflect a real regression than runner noise, so main can be re-gated without the current churn.

## Options to evaluate

Pick one, then combine with option 4:

1. **Widen `BOUNDARY` from 0.95 → 0.99** in `.github/workflows/benchmark.yml` (steps 7a — several `--threshold-*-boundary` args). Cheapest, reduces false-positive rate, but also reduces sensitivity to small real regressions.
2. **Require N-consecutive-run alerts** before auto-opening an issue and before the gate fails. Implemented in the workflow script (step 7b) by inspecting history of recent Bencher alerts for the same `(benchmark, measure)` pair. Better signal; more code.
3. **Add `--threshold-min-sample-size`** (e.g., 10–20) so the t-test only fires once enough history exists per `(branch, benchmark, measure)`. Pairs well with option 1.
4. **Move benchmark job to self-hosted or larger GHA runners** (`ubuntu-latest-16-core` or self-hosted). Addresses the root cause — CPU contention and noisy-neighbor effects on shared runners. Cost: infra + spend.

## Dependency: PR #3148

PR [#3148](https://github.com/shakacode/react_on_rails/pull/3148) fixes Bencher reporting on pushes to main so baselines are no longer lost. Until that merges, many recent main runs are missing from history and any threshold tuning is working against incomplete data. **Do not re-enable the hard gate until #3148 lands and at least ~30 post-merge runs have built a clean baseline.**

## Re-enabling the gate

Once #3148 has been merged for long enough to establish a stable baseline, restore the hard fail in step 7c and verify:

- [ ] ≥ 5 consecutive non-docs pushes to main pass the gate with current code.
- [ ] A deliberately-introduced perf regression (e.g., a `sleep` in a controller) is still caught.

## Acceptance criteria

- Benchmark workflow on main does not fire alerts on unchanged-perf commits more than ~1 in N runs (pick a target, e.g., 1 in 20).
- Tracking issue (currently #3116) accumulates only when the alert set overlaps with the prior run's (i.e., actual sustained regressions).
- Hard gate restored in step 7c.

## Related

- PR #3168 (this change) — warn-not-fail.
- PR #3148 — Bencher reporting baseline fix.
- Issue #3116 — current auto-opened regression tracker; close once this work lands.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark CI: tune thresholds / reduce flakiness so the main gate can be re-enabled #3169

Background

Goal

Options to evaluate

Dependency: PR #3148

Re-enabling the gate

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adjacent main runs	Shared (benchmark, measure) alerts	Total alerts	Jaccard
`bb0748c` ↔ `59f966a`	0	42	0.00
`bb0748c` ↔ `4eb8364`	1	54	0.02
`59f966a` ↔ `4eb8364`	1	72	0.01
`4eb8364` ↔ `bc2b9eb`	3	70	0.04
`1717387` ↔ `4eb8364`	4	66	0.06

Uh oh!

Benchmark CI: tune thresholds / reduce flakiness so the main gate can be re-enabled #3169

Description

Background

Goal

Options to evaluate

Dependency: PR #3148

Re-enabling the gate

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions