Filter Bencher alerts to tracked measures (drop orphaned p90 false positives)#3829
Conversation
…sitives) An orphaned server-side p90_latency threshold in Bencher (dropped from track_benchmarks.rb THRESHOLDS but never deleted) keeps firing alerts on p90 tail noise. Those alerts file a "performance regression" issue on main yet are invisible in the summary table (the p90 column has no :direction), so the issue names nothing actionable. They account for ~99% of recent main alerts (162/164). Thread the tracked-measure set (THRESHOLDS names) into BencherReport so an active alert on any other measure is classified as filtered, not a regression. This reuses the #3822 filtered-alert exit-normalization, so a p90-only run no longer writes a candidate or files an issue. Tracked measures (rps, p50, failed_pct) are unaffected; measure-less and benchmark-less alerts keep their existing fail-safe. Verified against the real #3782 reports: regression? flips true -> false (filtered) for both the Core and Pro p90 alerts. Refs #3755, #3169, #3795 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughBencherReport now accepts an optional ChangesTracked Measures Filtering Feature
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR fixes a persistent false-positive alert problem where orphaned server-side Bencher thresholds for
Confidence Score: 5/5Safe to merge — the change is additive and backward-compatible; callers that don't pass The filter is additive: No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Bencher CLI emits JSON report] --> B[BencherReport.parse\ntracked_measures: THRESHOLDS.map first]
B --> C[parse_alerts: collect active alerts]
C --> D{For each alert:\ncurrent_regression_alert?}
D -->|no benchmark name| E[→ @alerts\nfail-safe regression]
D -->|has benchmark name| F{untracked_measure_alert?}
F -->|tracked_measures nil| G{Existing boundary math}
F -->|measure is nil| G
F -->|measure NOT in tracked_measures| H[→ @filtered_alerts\norphaned threshold]
F -->|measure in tracked_measures| G
G -->|no direction / no boundary| E
G -->|boundary confirms regression| E
G -->|boundary says recovered| H
E --> I[regression? = true]
H --> J[regression? = false\nfiltered_alert? = true]
I --> K[Write candidate / file issue]
J --> L[normalized_bencher_exit_code → 0]
Reviews (1): Last reviewed commit: "Filter Bencher alerts to tracked measure..." | Re-trigger Greptile |
Code ReviewSummary: This PR fixes a real, well-evidenced problem — 98%+ of false-positive benchmark alerts were coming from an orphaned server-side Bencher threshold for Code quality ✅The implementation is clean and well-scoped:
Tests ✅Five new specs cover the key scenarios well:
The The fix at One edge case to watch
|
| # An active alert on a measure the caller does not track (an orphaned server-side | ||
| # threshold). Only applies when tracked_measures was given; a measure-less alert can't be | ||
| # classified here, so it falls through to the existing benchmark-level fail-safe. | ||
| def untracked_measure_alert?(alert) |
There was a problem hiding this comment.
Minor footgun: if tracked_measures: is ever passed as [], @tracked_measures is [] (truthy), so !@tracked_measures.include?(...) is true for every measure — all alerts get silently filtered and regression? is always false. The current caller can't hit this (THRESHOLDS has 3 entries), but a future caller could pass an empty list accidentally.
A small guard in initialize would make the invariant explicit:
| def untracked_measure_alert?(alert) | |
| def untracked_measure_alert?(alert) | |
| return false if @tracked_measures.nil? || @tracked_measures.empty? | |
| return false unless alert.measure | |
| !@tracked_measures.include?(normalize(alert.measure)) | |
| end |
Or equivalently, coerce an empty list to nil at construction time:
@tracked_measures = tracked_measures&.then { |m| m.empty? ? nil : m.map { |k| normalize(k) } }Either makes it impossible to silently disable regression detection by passing [].
## Problem A confirmed `performance-regression` issue currently only shows the per-suite summary tables. Those tables flag 🔴 on **rps** and **p50** but never on **failed_pct** (it has no column — it lives in the "Status" cell) and never on any untracked measure. So a reader has to open the tables to see what regressed, and a `failed_pct`-only regression renders with no visible 🔴 at all — the same "the issue names nothing actionable" problem behind #3782/#3795. ## Fix Add a **"What regressed"** section to the issue body, built from the confirmed `ALERTS` payload (#3810's structured `{benchmark, measure}` pairs), aggregated across all suites and computed once in `report_regressions`: ``` ### What regressed The fresh-runner confirmation re-alerted on these benchmark + measure pairs: - `/client_side_log_throw: Pro` — **rps** - `/server_side_redux_app: Core` — **failed_pct** ``` - Threaded through `report_suite` → `RegressionIssueReporter`; only the reporter that *creates* the issue renders the body, so the cross-suite list is complete. - `regressed_overview` defaults to `""` → the section is omitted and the body is byte-for-byte unchanged, so it composes with the existing flow and with older payloads that lack `ALERTS`. ## Validation - `bundle exec rspec benchmarks/spec` — **233 examples, 0 failures** (4 new: helper rendering/dedup + issue-body inclusion/omission). - `bundle exec rubocop benchmarks/...` — no offenses. - Rendered the body locally with and without pairs to confirm clean Markdown spacing and that the no-pairs body is unchanged. ## Relationship to other work Complements #3829 (filter alerts to tracked measures): #3829 stops the p90 false positives from filing at all; this makes the *real* confirmed regressions self-explanatory in the issue body. Refs #3755. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > CI reporting and GitHub issue markdown only; no runtime app or auth changes. > > **Overview** > Adds a **"What regressed"** section to auto-filed `performance-regression` issue bodies, listing confirmed benchmark + measure pairs as markdown bullets (including measures like **failed_pct** that the per-suite tables do not surface clearly). > > `report_regressions` builds the list once from each confirmed payload's structured `ALERTS` via new `regressed_overview_markdown` (deduped across suites) and passes `regressed_overview` into `RegressionIssueReporter`. The section is omitted when the overview is empty, so older payloads without `ALERTS` keep the previous body shape. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 3068c93. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Performance regression reports now include a "What regressed" overview section, displaying all confirmed benchmark and measure pairs that regressed across test suites in a single consolidated view. * **Tests** * Added comprehensive test coverage for the new regression overview markdown rendering logic and deduplication. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* origin/main: Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) Add issue evaluation skill (#3816) Confirm benchmark regressions on a fresh runner before filing the main issue (#3810) Define agent scope and accelerated RC auto-merge policy (#3808) Replace custom MockClient with async-http Mock::Endpoint (#3703) Docs: per-request data sharing in RSC with React.cache() (#3769) Pro RSC: share unstable_cache across renderer workers via Redis (#3705) [codex] Add PR batch planning skill (#3792) Docs: document PR batch operational lessons (#3789) Document dummy Redux state indexing rationale (#3781) Pro RSC: avoid caching failed Flight renders (#3775) # Conflicts: # packages/react-on-rails-pro/tests/getReactServerComponent.client.test.ts
…o-rsc-rspack-ci * origin/main: Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) Add issue evaluation skill (#3816) # Conflicts: # react_on_rails_pro/spec/dummy/config/webpack/clientWebpackConfig.js
* origin/main: (23 commits) Enforce Pro license headers in CI and pre-commit (#3821) Add RSC payload route-data helper (#3783) [Pro] Fix React.cache request dedupe in generated RSC configs (#3813) Docs: clarify RuboCop autofix ownership (#3827) Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) Add issue evaluation skill (#3816) Confirm benchmark regressions on a fresh runner before filing the main issue (#3810) Define agent scope and accelerated RC auto-merge policy (#3808) Replace custom MockClient with async-http Mock::Endpoint (#3703) Docs: per-request data sharing in RSC with React.cache() (#3769) Pro RSC: share unstable_cache across renderer workers via Redis (#3705) [codex] Add PR batch planning skill (#3792) ...
…-floor-fix * origin/main: (29 commits) Docs: align pr-batch closeout confidence handoff (#3835) Align adversarial review CI polling guidance (#3794) CI: add Pro RSC rspack runtime gate (#3817) Make RSCRoute refetch failures recoverable in production (#3786) Fix Pro node renderer license headers (#3834) Docs: fix anti-patterns in RSC tutorials (#3801) fix(pro): add RSC peer compatibility gate (#3831) Enforce Pro license headers in CI and pre-commit (#3821) Add RSC payload route-data helper (#3783) [Pro] Fix React.cache request dedupe in generated RSC configs (#3813) Docs: clarify RuboCop autofix ownership (#3827) Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) ... # Conflicts: # .github/workflows/benchmark.yml
Problem
~99% of recent benchmark "performance-regression" alerts on
mainare false positives from a single orphaned server-side Bencher threshold.p90_latency(andp99_latency) were intentionally dropped fromtrack_benchmarks.rbTHRESHOLDSas too noisy ("tail noise can't meet the 1/20 target"). But Bencher thresholds are persistent server-side objects keyed on (branch, testbed, measure) — removing a measure from the CLI--threshold-*args stops updating the threshold, it never deletes it. The p90 metric is still submitted (for the dashboard), so the orphaned p90 threshold keeps evaluating p90 tail noise and firing alerts on nearly every run.These alerts are doubly bad:
performance-regressionissue (exit≠0), and:direction, so it is never flagged 🔴. The filed issue names nothing actionable (this is the Performance Regression Detected on main (dec4b8c) #3782 / performance-regression issues are being posted when no significant regressions are detected #3795 "🔴 only appears in the legend" symptom).Evidence (public Bencher API, project
react-on-rails-t8a9ncxo)mainbranch: 164 active alerts → 162 p90-latency, 2 rps (98.8% p90).dec4b8c) filed on exactly two p90 alerts —/client_side_hello_world: Corep9024.20 > upper 23.61and/rendered_html: Corep90 — with no rps/p50/failed_pct alert.main/github-actionscarries thresholds forrps, p50-latency, p90-latency, p99-latency, failed-pct— two more than the code manages. (p99 is dormant: no p99 metric is submitted.)Fix
Thread the measures we actually track (
THRESHOLDSnames) intoBencherReport. An active alert on any other measure is classified as filtered rather than a regression. This reuses #3822's existingfiltered_alert?exit-normalization, so a p90-only run no longer writes a candidate or files an issue. The fix is fully in-repo and makes the orphaned threshold harmless regardless of its server-side state.rps,p50_latency,failed_pct) are unaffected — including the "hidden"failed_pctregressions Filter stale Bencher alerts before reporting #3822 added coverage for.tracked_measuresdefaults tonil(track every measure) so non-production callers (BenchmarkTable, specs) are unchanged.Validation
bundle exec rspec benchmarks/spec— 234 examples, 0 failures (5 new specs for tracked-measure filtering).bundle exec rubocop benchmarks/...— no offenses.regression?flipstrue → false(filtered_alert? = true) for both the Core and Pro p90 alerts, while rps/p50 alerts are untouched.script/ci-changes-detector origin/main→ Benchmark scripts → Lint (Ruby + JS).Companion cleanup (separate, manual — not in this PR)
This neutralizes the orphaned threshold in code. To also stop it polluting the Bencher dashboard (162 cosmetic active alerts) and being cloned to every new branch, delete the server-side thresholds (needs
BENCHER_API_TOKEN):mainp9051fb6a47-0083-4e84-a745-60ee42e3bba4, p996faa7a68-1835-4cd7-96dc-959220737172masterp90d4ad2066-74cb-41f7-93bb-f0885358c56c, p9961bf5ae6-77fa-474f-b6d8-ded630bd0c20Relationship to other work
Completes the noise fix that #3810 (fresh-runner confirmation) and #3822 (stale-alert filtering) started: #3822 only filters alerts whose metric recovered; a live p90 crossing (the dominant case) still passed through. Substantially removes the p90 tail noise behind #3169 and the issue explosion researched in #3755.
Refs #3755, #3169, #3795
🤖 Generated with Claude Code
Note
Medium Risk
Changes which Bencher alerts count as regressions and can affect main-push candidate filing and CI exit behavior, though scoped to benchmark reporting with explicit specs and a nil default for other callers.
Overview
Adds optional
tracked_measurestoBencherReport.parse/#initializeso active Bencher alerts on measures the repo no longer tracks (e.g. orphaned server-side p90_latency thresholds) are moved tofiltered_alert?instead ofregression?, reusing the existing exit-code normalization for filtered-only runs.track_benchmarks.rbnow passesTHRESHOLDS.map(&:first)when parsing the JSON report, so p90-only false positives no longer write regression candidates or file issues the summary table cannot flag. Tracked measures, measure-less fail-safe alerts, and callers that omittracked_measuresstay backward compatible.Specs cover orphaned p90 filtering, slug/name normalization, and a small fix for
BencherReport.newwith a Hash root in a perf-links test.Reviewed by Cursor Bugbot for commit 7693f64. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Release Notes