Filter stale Bencher alerts before reporting#3822
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughBencherReport now distinguishes current regression alerts from filtered/stale alerts during initialization by partitioning active alerts into separate collections. A new ChangesBencher Alert Classification and Exit Normalization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Review — Filter stale Bencher alerts before reportingOverviewThis PR correctly addresses a false-positive regression problem: when Bencher carries an The fail-safe philosophy (missing benchmark name, invalid direction, no boundary on the expected side → treat as regression) is the right default for a reporting system where a silent miss is worse than a noisy false-positive. What's solid
Issues1.
|
Greptile SummaryThis PR adds stale-alert filtering to the Bencher benchmark regression reporter: active alerts from Bencher are now cross-referenced against the current report's boundaries before being treated as regressions, and a
Confidence Score: 4/5Safe to merge; the filtering logic is well-guarded with fail-safe defaults and is thoroughly covered by specs. The core No files require special attention; the two style suggestions in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Bencher CLI exits with JSON report] --> B{Parse JSON\nBencherReport.parse}
B -->|FormatError| C[exit 1 — fail loud]
B -->|OK| D[index_boundaries from results]
D --> E[parse_alerts → partition via current_regression_alert?]
E --> F["@alerts (current regressions)"]
E --> G["@filtered_alerts (stale alerts)"]
F --> H{regression?}
H -->|true| J[Real regression — keep exit code]
H -->|false| K{retry_without_start_point_hash?}
K -->|exit≠0 + Head Version not found| L[Retry without --start-point-hash]
L --> M[Re-run Bencher → new report]
M --> N[normalized_bencher_exit_code]
K -->|no| N
N -->|exit≠0 + filtered_alert? + !regression?| O[emit notice → return 0]
N -->|otherwise| P[Return original exit code]
J --> Q[Downstream regression reporting]
O --> Q
P --> Q
Reviews (1): Last reviewed commit: "Filter stale Bencher alerts before repor..." | Re-trigger Greptile |
Pro Node Renderer Benchmark Summary
▲/▼ non-zero change vs baseline · 0.0% exact/near-zero match · 🔴 significant regression · 🟢 significant improvement (tracked measures) · (n) = baseline |
Core Benchmark Summary
▲/▼ non-zero change vs baseline · 0.0% exact/near-zero match · 🔴 significant regression · 🟢 significant improvement (tracked measures) · (n) = baseline |
Pro (shard 1/2) Benchmark Summary
▲/▼ non-zero change vs baseline · 0.0% exact/near-zero match · 🔴 significant regression · 🟢 significant improvement (tracked measures) · (n) = baseline |
Pro (shard 2/2) Benchmark Summary
▲/▼ non-zero change vs baseline · 0.0% exact/near-zero match · 🔴 significant regression · 🟢 significant improvement (tracked measures) · (n) = baseline |
|
Worker B review-thread triage for Unresolved review threads triaged:
Validation run from the PR head:
Live CI/review state observed:
Full-CI decision: not requested. This PR is benchmark-related, the Worker B lane result: merge-qualified from this lane after waiving optional/noise advisory comments. Coordinator owns final merge sequencing. |
…sitives) (#3829) ## Problem ~99% of recent benchmark "performance-regression" alerts on `main` are false positives from a single **orphaned server-side Bencher threshold**. `p90_latency` (and `p99_latency`) were intentionally dropped from `track_benchmarks.rb` `THRESHOLDS` as too noisy ("tail noise can't meet the 1/20 target"). But Bencher thresholds are persistent server-side objects keyed on (branch, testbed, measure) — removing a measure from the CLI `--threshold-*` args stops *updating* the threshold, it never *deletes* it. The p90 metric is still submitted (for the dashboard), so the orphaned p90 threshold keeps evaluating p90 tail noise and firing alerts on nearly every run. These alerts are doubly bad: - They file a `performance-regression` issue (exit≠0), and - They are **invisible in the summary table** — the p90 column has no `:direction`, so it is never flagged 🔴. The filed issue names nothing actionable (this is the #3782 / #3795 "🔴 only appears in the legend" symptom). ### Evidence (public Bencher API, project `react-on-rails-t8a9ncxo`) - 255 most-recent active alerts: **248 p90-latency, 4 rps, 3 p50-latency**. - `main` branch: **164 active alerts → 162 p90-latency, 2 rps** (98.8% p90). - #3782 (`dec4b8c`) filed on exactly two p90 alerts — `/client_side_hello_world: Core` p90 `24.20 > upper 23.61` and `/rendered_html: Core` p90 — with no rps/p50/failed_pct alert. - `main`/`github-actions` carries thresholds for `rps, p50-latency, p90-latency, p99-latency, failed-pct` — two more than the code manages. (p99 is dormant: no p99 metric is submitted.) ## Fix Thread the measures we actually track (`THRESHOLDS` names) into `BencherReport`. An active alert on any **other** measure is classified as *filtered* rather than a regression. This reuses #3822's existing `filtered_alert?` exit-normalization, so a p90-only run no longer writes a candidate or files an issue. The fix is fully in-repo and makes the orphaned threshold harmless regardless of its server-side state. - Tracked measures (`rps`, `p50_latency`, `failed_pct`) are unaffected — including the "hidden" `failed_pct` regressions #3822 added coverage for. - Measure-less and benchmark-less alerts keep their existing fail-safe (still counted). - `tracked_measures` defaults to `nil` (track every measure) so non-production callers (`BenchmarkTable`, specs) are unchanged. ## Validation - `bundle exec rspec benchmarks/spec` — **234 examples, 0 failures** (5 new specs for tracked-measure filtering). - `bundle exec rubocop benchmarks/...` — no offenses. - **End-to-end against the real #3782 reports** (fetched from the Bencher API): `regression?` flips `true → false` (`filtered_alert? = true`) for both the Core and Pro p90 alerts, while rps/p50 alerts are untouched. - `script/ci-changes-detector origin/main` → Benchmark scripts → Lint (Ruby + JS). ## Companion cleanup (separate, manual — not in this PR) This neutralizes the orphaned threshold in code. To also stop it polluting the Bencher dashboard (162 cosmetic active alerts) and being cloned to every new branch, delete the server-side thresholds (needs `BENCHER_API_TOKEN`): - `main` p90 `51fb6a47-0083-4e84-a745-60ee42e3bba4`, p99 `6faa7a68-1835-4cd7-96dc-959220737172` - `master` p90 `d4ad2066-74cb-41f7-93bb-f0885358c56c`, p99 `61bf5ae6-77fa-474f-b6d8-ded630bd0c20` ## Relationship to other work Completes the noise fix that #3810 (fresh-runner confirmation) and #3822 (stale-alert filtering) started: #3822 only filters alerts whose metric *recovered*; a **live** p90 crossing (the dominant case) still passed through. Substantially removes the p90 tail noise behind #3169 and the issue explosion researched in #3755. Refs #3755, #3169, #3795 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes which Bencher alerts count as regressions and can affect main-push candidate filing and CI exit behavior, though scoped to benchmark reporting with explicit specs and a nil default for other callers. > > **Overview** > Adds optional **`tracked_measures`** to `BencherReport.parse` / `#initialize` so active Bencher alerts on measures the repo no longer tracks (e.g. orphaned server-side **p90_latency** thresholds) are moved to **`filtered_alert?`** instead of **`regression?`**, reusing the existing exit-code normalization for filtered-only runs. > > **`track_benchmarks.rb`** now passes `THRESHOLDS.map(&:first)` when parsing the JSON report, so p90-only false positives no longer write regression candidates or file issues the summary table cannot flag. Tracked measures, measure-less fail-safe alerts, and callers that omit `tracked_measures` stay backward compatible. > > Specs cover orphaned p90 filtering, slug/name normalization, and a small fix for `BencherReport.new` with a Hash root in a perf-links test. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 7693f64. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Bug Fixes** * Fixed false regression alerts for measures no longer tracked in performance benchmarks. The system now filters regression reports to only flag alerts for actively monitored metrics, preventing issue creation based on orphaned or retired threshold measurements that are no longer part of the current tracking configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* origin/main: Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) Add issue evaluation skill (#3816) Confirm benchmark regressions on a fresh runner before filing the main issue (#3810) Define agent scope and accelerated RC auto-merge policy (#3808) Replace custom MockClient with async-http Mock::Endpoint (#3703) Docs: per-request data sharing in RSC with React.cache() (#3769) Pro RSC: share unstable_cache across renderer workers via Redis (#3705) [codex] Add PR batch planning skill (#3792) Docs: document PR batch operational lessons (#3789) Document dummy Redux state indexing rationale (#3781) Pro RSC: avoid caching failed Flight renders (#3775) # Conflicts: # packages/react-on-rails-pro/tests/getReactServerComponent.client.test.ts
…o-rsc-rspack-ci * origin/main: Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) Add issue evaluation skill (#3816) # Conflicts: # react_on_rails_pro/spec/dummy/config/webpack/clientWebpackConfig.js
* origin/main: (23 commits) Enforce Pro license headers in CI and pre-commit (#3821) Add RSC payload route-data helper (#3783) [Pro] Fix React.cache request dedupe in generated RSC configs (#3813) Docs: clarify RuboCop autofix ownership (#3827) Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) Add issue evaluation skill (#3816) Confirm benchmark regressions on a fresh runner before filing the main issue (#3810) Define agent scope and accelerated RC auto-merge policy (#3808) Replace custom MockClient with async-http Mock::Endpoint (#3703) Docs: per-request data sharing in RSC with React.cache() (#3769) Pro RSC: share unstable_cache across renderer workers via Redis (#3705) [codex] Add PR batch planning skill (#3792) ...
…-floor-fix * origin/main: (29 commits) Docs: align pr-batch closeout confidence handoff (#3835) Align adversarial review CI polling guidance (#3794) CI: add Pro RSC rspack runtime gate (#3817) Make RSCRoute refetch failures recoverable in production (#3786) Fix Pro node renderer license headers (#3834) Docs: fix anti-patterns in RSC tutorials (#3801) fix(pro): add RSC peer compatibility gate (#3831) Enforce Pro license headers in CI and pre-commit (#3821) Add RSC payload route-data helper (#3783) [Pro] Fix React.cache request dedupe in generated RSC configs (#3813) Docs: clarify RuboCop autofix ownership (#3827) Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) ... # Conflicts: # .github/workflows/benchmark.yml
Summary
Fixes #3795.
Refs #3798, #3799, #3807.
Validation
bundle exec rubocop benchmarks/lib/bencher_report.rb benchmarks/track_benchmarks.rb benchmarks/spec/bencher_report_spec.rb benchmarks/spec/track_benchmarks_spec.rb-> 4 files inspected, no offensesbundle exec rspec benchmarks/spec/bencher_report_spec.rb benchmarks/spec/track_benchmarks_spec.rb benchmarks/spec/report_table_integration_spec.rb-> 83 examples, 0 failuresbundle exec rspec benchmarks/spec-> 229 examples, 0 failuresgit diff --check-> passedscript/ci-changes-detector origin/main-> Benchmark scripts; recommendsLint (Ruby + JS)(cd react_on_rails && bundle exec rubocop)-> 214 files inspected, no offensescodex review --base origin/main-> no actionable correctness issues foundgit pushpre-push hook -> branch Ruby RuboCop passed on changed Ruby files; Markdown link check had no Markdown filesLabels: benchmark. Benchmark reporting is performance-sensitive;
full-ciis not recommended because this is a focused benchmark script/parser/spec change and the CI detector only recommends lint.Agent Merge Confidence
Mode: development (no open
Release gate:tracker found by title search; batch assignment:rc-accelerated-2026-06-08-two-machine)Score: 8/10
Auto-merge recommendation: no (user requested no auto-merge; no independent finalizer)
Affected areas: benchmark regression reporting, Bencher report parsing, benchmark confirmation handoff
CI detector:
script/ci-changes-detector origin/main-> Benchmark scripts; recommendsLint (Ruby + JS)Validation run:
bundle exec rubocop benchmarks/lib/bencher_report.rb benchmarks/track_benchmarks.rb benchmarks/spec/bencher_report_spec.rb benchmarks/spec/track_benchmarks_spec.rb-> 4 files inspected, no offensesbundle exec rspec benchmarks/spec/bencher_report_spec.rb benchmarks/spec/track_benchmarks_spec.rb benchmarks/spec/report_table_integration_spec.rb-> 83 examples, 0 failuresbundle exec rspec benchmarks/spec-> 229 examples, 0 failuresgit diff --check-> passed(cd react_on_rails && bundle exec rubocop)-> 214 files inspected, no offensescodex review --base origin/main-> no actionable correctness issues foundReview/check gate:
c84a4be3, no actionable correctness issuesKnown residual risk: No live Bencher service run was available locally; coverage uses pinned JSON fixtures/specs for the Bencher CLI v0.6.2 report shape.
Finalized by: not finalized; authoring agent only
Note
Medium Risk
Changes benchmark regression gating and CI exit codes; incorrect filtering could hide real regressions or clear jobs that should fail, though fail-safe paths and extensive specs mitigate this.
Overview
Bencher regression detection now cross-checks active
alerts[]against current report boundaries before counting a regression. Stale alerts (metrics back within limits) are dropped from#alerts/#regression?but tracked via new#filtered_alert?.Fail-safe behavior is unchanged for ambiguous cases: missing benchmark, unmatchable boundary, unknown limit side, or measure-less alerts with no regression on the alert side still count as regressions so schema drift stays visible.
CI exit handling adds
normalized_bencher_exit_code: after start-point-hash retry, a non-zero Bencher exit with only filtered (stale) alerts and no real regression is normalized to 0 with a::notice::, avoiding false regression filing while preserving the raw exit for retry logic first.Specs cover stale vs current alerts, measure-less alerts, fail-safe paths, and exit normalization.
Reviewed by Cursor Bugbot for commit c84a4be. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests