Skip to content

Confirm benchmark regressions on a fresh runner before filing the main issue#3810

Merged
justin808 merged 2 commits into
mainfrom
justin808/fix-issue-3670
Jun 8, 2026
Merged

Confirm benchmark regressions on a fresh runner before filing the main issue#3810
justin808 merged 2 commits into
mainfrom
justin808/fix-issue-3670

Conversation

@justin808

@justin808 justin808 commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #3670. A single noisy Bencher alert on main could previously create/update a performance-regression issue. This adds a fresh-runner confirmation step between the initial benchmark run and the main issue: a main-push alert is now a non-fatal candidate that must re-alert on an independent rerun before any issue is filed. Requiring a second, independent alert turns the per-commit false-positive rate from ~p into ~p² for runner noise.

Scope is exactly as the issue specifies: only main issue creation is gated. PR summary comments still post unconditionally (initial mode, unchanged).

Pipeline (benchmark.yml)

benchmark (initial)   on a main alert -> write a NON-FATAL candidate artifact
                      (regression-candidate-*); the suite stays GREEN
        │
plan-confirmation     read candidates; short-circuit IGNORED_REGRESSION_BENCHMARKS
                      BEFORE spending a rerun; emit a matrix of just the alerting suite/shard(s)
        │
confirm-regressions   rerun those on fresh runners against main's baseline WITHOUT polluting
                      main's series; write a confirmed artifact only if the SAME
                      benchmark+measure re-alerts
        │
report-regressions    consume confirmed artifacts; file one issue with the first-run and
                      confirmation summaries side by side; fail the workflow (it owns pass/fail)

The per-suite execution is extracted into a reusable workflow (benchmark-suite.yml, modes initial/confirm) so the initial run and the confirmation rerun share one set of steps with no drift.

Behavior

Situation Workflow result Issue filed?
Alert does not reproduce (noise) 🟢 green no
Same benchmark+measure re-alerts 🔴 red yes (first-run + confirmation evidence)
Inconclusive rerun (bench/Bencher failure, corrupt/unmatched candidate) 🔴 red no
Every alert is in IGNORED_REGRESSION_BENCHMARKS 🟢 green (notice) no

The workflow summary states confirmed / cleared-as-noise / inconclusive per alert.

Key implementation notes

  • Don't pollute the baseline: the confirmation submits to a unique synthetic Bencher branch confirm-main-<run_id>-<suite-slug> (never main) with --start-point main --start-point-clone-thresholds --start-point-reset, so the confirmation sample is compared against main's baseline but never appended to main's series — exactly the anchoring the PR path already uses.
  • Same benchmark+measure matching uses the structured alert pairs now carried in the candidate payload (BencherReport::Alert already exposes the measure slug). Falls back to benchmark-name-only when a measure is absent, as the issue permits.
  • Exit-code contract change: the first-run alert no longer fails the suite job; report-regressions owns the final pass/fail. A confirmed regression fails the workflow there.
  • IGNORED_REGRESSION_BENCHMARKS (Suppress regression issues when only /posts_page: Pro regresses #3668) moved to the shared benchmarks/lib/regression_report.rb and is now applied in plan-confirmation, short-circuiting a rerun for benchmarks that would be suppressed anyway.
  • Candidate vs confirmed artifacts/payloads are kept distinct (regression-candidate-* / regression-confirmed-*).

Out of scope (sequenced follow-up, per #3670)

Loosening the THRESHOLDS to buy back the statistical power that double-gating costs is intentionally a separate, empirical follow-up: it requires measuring the post-confirmation false-positive and missed-regression rates on real main data, so it cannot be tuned blindly here.

Known tradeoff (acknowledged in the issue)

The confirmation clones main's baseline after the initial run appended its (regressing) sample to main's series, so a borderline regression can be slightly diluted by its own already-merged sample. This is inherent to confirming post-merge and is what the threshold-loosening follow-up addresses. A future refinement could pin the clone to the commit's parent via --start-point-hash.

Testing

  • bundle exec rspec benchmarks/spec212 examples, 0 failures (26 new: regression_report_spec, plan_confirmation_spec, confirmation-mode cases in track_benchmarks_spec; report_regressions_spec reworked for the confirmed-artifact contract).
  • actionlint clean on both benchmark.yml and benchmark-suite.yml.
  • rubocop clean on all changed/new Ruby files.
  • End-to-end candidate→confirm→report file round-trip verified locally.

Live validation gap: the multi-job GitHub Actions DAG (reusable-workflow calls, dynamic confirmation matrix, fresh-runner reruns, Bencher submissions) cannot be exercised locally — it needs a real main push with a Bencher token. A maintainer should watch the first alerting main run to confirm the candidate→confirm→report handoff end to end. No CHANGELOG entry: per the repo's changelog guidelines this is internal CI/benchmark tooling, not a user-visible source change.

🤖 Generated with Claude Code


Note

Medium Risk
Large CI workflow refactor changes when benchmark workflows fail and when regression issues are filed; mistakes could miss real regressions or spam issues, but scope is internal benchmark tooling only.

Overview
Adds a two-stage main regression pipeline: Bencher alerts on the first main run are non-fatal candidates (regression-candidate.json artifacts) instead of failing the suite or filing immediately. New jobs plan-confirmation (filter ignored benchmarks, build rerun matrix) and confirm-regressions rerun only alerting suite/shards on fresh runners via a shared reusable workflow benchmark-suite.yml (initial / confirm modes).

track_benchmarks.rb writes candidates with structured benchmark+measure alert pairs; confirmation submits to a synthetic Bencher branch against main's baseline without appending to main's series, and emits regression-confirmed.json only when the same pairs re-alert. report_regressions.rb files issues from confirmed payloads only, shows first-run vs confirmation tables, and owns final workflow failure on confirmed regressions. IGNORED_REGRESSION_BENCHMARKS moves to regression_report.rb and short-circuits before confirmation reruns in plan_confirmation.rb.

Reviewed by Cursor Bugbot for commit 985121d. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • New Features

    • Two-stage benchmark regression flow: initial candidate runs then optional confirmation reruns before filing issues.
    • Reports now include side-by-side "First run" vs "Confirmation" summaries for clearer comparison.
  • Refactor

    • Benchmark CI converted to a reusable, modular workflow for shard-level end-to-end runs and confirmation orchestration.
  • Tests

    • Added comprehensive tests covering confirmation planning, regression matching, and reporting behaviors.

…n issue

Implements #3670. A single noisy Bencher alert on main could previously
create/update a `performance-regression` issue. Now a main-push alert is a
non-fatal *candidate* that must be reproduced by an independent rerun on a
fresh GitHub Actions runner before any issue is filed, turning the
per-commit false-positive rate from ~p into ~p² for runner noise.

Pipeline (benchmark.yml):
  benchmark (initial)  -> on a main alert, write a NON-FATAL candidate
                          artifact (regression-candidate-*); the suite stays green
  plan-confirmation    -> read candidates, short-circuit IGNORED_REGRESSION_BENCHMARKS
                          BEFORE spending a rerun, emit a matrix of just the
                          alerting suite/shard(s)
  confirm-regressions  -> rerun those on fresh runners against main's baseline
                          WITHOUT polluting main's series (synthetic Bencher branch
                          + --start-point main --start-point-clone-thresholds
                          --start-point-reset); write a confirmed artifact only if
                          the SAME benchmark+measure re-alerts
  report-regressions   -> consume confirmed artifacts, file one issue with the
                          first-run and confirmation summaries side by side, and
                          fail the workflow (it owns the final pass/fail)

The per-suite execution is extracted into a reusable workflow
(benchmark-suite.yml, modes initial/confirm) so the initial run and the
confirmation rerun share one set of steps with no drift.

Behavior:
- A noisy alert that does not reproduce is cleared as noise (no issue, green).
- A confirmed regression fails the workflow AND files the issue.
- An inconclusive rerun (benchmark execution / Bencher reporting failure, or a
  corrupt/unmatched candidate) fails the workflow WITHOUT filing.
- The workflow summary states confirmed / cleared / inconclusive per alert.
- PR summary comments still post unconditionally (initial mode, unchanged); the
  confirmation machinery is main-push only.

Same benchmark+measure matching uses the structured alert pairs now carried in
the candidate payload (BencherReport::Alert already exposes measure), with a
name-only fallback when a measure is absent. IGNORED_REGRESSION_BENCHMARKS moved
to the shared regression_report.rb and is now applied in plan-confirmation.

Threshold loosening to buy back the power double-gating costs is intentionally a
sequenced, empirical follow-up (see #3670), not part of this change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds a two-stage benchmark regression confirmation flow: initial runs emit non-fatal candidate artifacts; a plan step filters/schedules reruns; confirmation runs re-alerting the same benchmark+measure produce confirmed artifacts; reporting files issues only for confirmed regressions. Integrates via a new reusable benchmark-suite workflow and updates tracking/reporting scripts and specs.

Changes

Benchmark regression confirmation workflow

Layer / File(s) Summary
Regression report contract and alert helpers
benchmarks/lib/regression_report.rb
RegressionReport expanded to candidate/confirmed filenames, alert builders, measure normalization, alerts_match? logic, confirmed_alerts, and ignore-list/actionable helpers.
Track benchmarks confirmation mode and artifact writers
benchmarks/track_benchmarks.rb
Adds BENCHMARK_MODE=confirm flow: confirmation branch naming, deduped alert-pairs, candidate loading, confirmation outcome classification, and writers for candidate/confirmed JSON artifacts and summaries.
Plan confirmation candidate filtering and matrix selection
benchmarks/plan_confirmation.rb
New script discovers candidate artifacts, short-circuits fully-ignored candidates, matches remaining candidates to benchmark matrix rows, emits confirmation_matrix or fails on unreadable/unmatched candidates, and announces suppressed benchmarks.
Reusable benchmark-suite workflow
.github/workflows/benchmark-suite.yml
New reusable workflow runs one suite/shard in initial or confirm mode: unpacks matrix_row, installs Bencher/tooling, builds assets and optional Rails Pro DB, starts server/renderer with PID readiness polling, runs benchmarks, validates results/logs, uploads outputs, and in confirm mode downloads candidate artifact and runs benchmarks/track_benchmarks.rb.
Main workflow orchestration
.github/workflows/benchmark.yml
Refactors top-level CI to call the reusable workflow per matrix row and orchestrates: benchmark (initial) → plan-confirmationconfirm-regressions (confirm mode, safe empty-matrix fallback) → report-regressions (consumes confirmed artifacts).
Regression reporting: confirmed-artifact consumption
benchmarks/report_regressions.rb
Now scans confirmed payloads, validates payload shapes, renders first-run vs confirmation shard summaries side-by-side, files issues only for confirmed regressions, and returns :clean/:filed/:error with workflow-failure semantics when confirmed regressions were filed or payloads unreadable.
Specs and test harness
benchmarks/spec/*
Adds RSpec coverage for RegressionReport, plan_confirmation behavior, report_regressions confirmed-artifact handling and exit semantics, and track_benchmarks confirmation-mode helpers and outcomes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related issues

  • #3670: Directly implements the confirm-regressions-before-posting-issue proposal (candidate/confirmed hand-off, fresh-run confirmations, reset baseline, and benchmark+measure matching).

Possibly related PRs

Suggested labels

enhancement, review-needed, full-ci, benchmark, ci-tooling, P2

Suggested reviewers

  • alexeyr-ci2
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: implementing a confirmation step on a fresh runner before filing benchmark regression issues on main.
Linked Issues check ✅ Passed The PR implements all major requirements from #3670: non-fatal candidate artifacts on first alert, plan-confirmation job gating reruns, fresh-runner confirmation without polluting main's baseline using synthetic Bencher branch, benchmark+measure matching with name-only fallback, ignored-benchmark short-circuit, confirmed-only issue filing, and preservation of PR comments.
Out of Scope Changes check ✅ Passed All changes directly support the confirmation-before-filing workflow: new reusable benchmark-suite.yml and related helpers, regression_report.rb expanded with confirmation logic, plan_confirmation.rb and report_regressions.rb refactored for candidate/confirmed artifact handoff, and track_benchmarks.rb updated for initial/confirm modes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch justin808/fix-issue-3670

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.

Reviewed by Cursor Bugbot for commit acc4f36. Configure here.

Comment thread benchmarks/track_benchmarks.rb
@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

test

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/benchmark-suite.yml:
- Line 83: Replace all GitHub Actions uses: entries that reference tags with
full commit SHAs: update actions/checkout@v6, bencherdev/bencher@v0.6.2,
grafana/setup-k6-action@v1, actions/cache@v5 (both occurrences),
pnpm/action-setup@v6, actions/setup-node@v6, actions/upload-artifact@v7 (all
occurrences), and actions/download-artifact@v7 to owner/repo@<full-commit-sha>
form; locate each occurrence by the exact uses string (e.g.,
"actions/checkout@v6", "bencherdev/bencher@v0.6.2", etc.) and replace the tag
with the corresponding full commit SHA from the action's repository, ensuring
every duplicate occurrence is updated and commit SHAs are immutable values.
- Around line 430-435: The workflow step named "Download first-run candidate"
currently can abort the job on failure and prevents the Ruby confirmation
handler from running; update that step to allow the job to continue on
missing/corrupt artifacts by adding continue-on-error: true (or enabling the
action’s native missing-artifact option) so the workflow proceeds to run
benchmarks/track_benchmarks.rb which expects load_candidate to return [nil, ""]
and then executes run_confirmation.

In @.github/workflows/benchmark.yml:
- Line 210: Replace the floating tag uses: actions/download-artifact@v7 with a
pinned commit SHA for both occurrences where the workflow references the
download-artifact action (search for the literal "uses:
actions/download-artifact@v7"); update each to the specific commit SHA form
actions/download-artifact@<commit-sha> (obtain the correct commit SHA from the
actions/download-artifact repository) so both download-artifact steps are
immutably pinned.
- Around line 179-183: Replace the broad secrets inheritance with an explicit
secrets mapping in the reusable workflow call: remove "secrets: inherit" in
.github/workflows/benchmark.yml (the step using the reusable workflow with with:
matrix_row/mode) and instead provide a limited secrets: mapping that only passes
REACT_ON_RAILS_PRO_LICENSE_V2, BENCHER_API_TOKEN, and GITHUB_TOKEN to the
reusable workflow; ensure the reusable workflow (benchmark-suite.yml) expects
those exact inputs. Additionally, update action pins in both
.github/workflows/benchmark.yml and benchmark-suite.yml (e.g.,
actions/download-artifact@v7, actions/checkout@v6) to use reviewed commit SHAs
rather than tag versions to comply with the supply-chain policy.

In `@benchmarks/plan_confirmation.rb`:
- Around line 33-38: The load_candidate_payload method currently parses any JSON
and returns it, which can later cause build_plan to crash when it expects a
Hash; update load_candidate_payload to validate that the parsed value is a Hash
(and contains required keys used by build_plan such as the matrix/fields you
later index) and return nil if the check fails so the existing
inconclusive/unreadable-candidate handling is used; reference
load_candidate_payload and build_plan to locate where to enforce
parsed.is_a?(Hash) and required key presence and ensure you still rescue
StandardError and warn with the same error message when parsing fails.

In `@benchmarks/report_regressions.rb`:
- Around line 366-373: The load_payload function currently parses JSON but
doesn't validate structure; update load_payload to ensure the parsed value is a
Hash and contains required keys (e.g., "suite_name", plus any other keys
expected later around where payload is used at lines 410-411) and return nil if
the schema is invalid; specifically, after JSON.parse in load_payload check that
the result.is_a?(Hash) and required_keys.all? { |k| result.key?(k) } and if not
warn with the same ::error:: message (including e.class/message when parse
fails) and return nil so schema-invalid payloads follow the existing error path
instead of causing crashes when consumers of load_payload (the code that reads
payload["suite_name"] etc.) assume a Hash.

In `@benchmarks/track_benchmarks.rb`:
- Around line 335-347: The current regressed_alert_pairs method drops alerts
with no benchmark (report.alerts.select(&:benchmark)), which lets incomplete
payloads be treated later as empty matches and classified as :cleared; instead
preserve unmatchable alerts and surface them as inconclusive so they cannot be
noise-cleared. Change regressed_alert_pairs to include alerts with nil benchmark
(don’t filter them out) and convert those to an explicit inconclusive sentinel
(e.g. call or construct a RegressionReport.inconclusive(alert) or return a pair
with a nil benchmark/measure marker) so downstream matching logic can detect and
return :inconclusive; also update the empty-match handling (the classifier
currently emitting :cleared for empty matches) to return :inconclusive when it
sees that sentinel or a nil benchmark/measure pair. Ensure you reference and
modify RegressionReport.alert usage and the empty-match classification code to
recognize the new inconclusive sentinel.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 155fe7b5-0611-4415-ac42-9e1b825cc540

📥 Commits

Reviewing files that changed from the base of the PR and between 2c2cbb7 and acc4f36.

📒 Files selected for processing (10)
  • .github/workflows/benchmark-suite.yml
  • .github/workflows/benchmark.yml
  • benchmarks/lib/regression_report.rb
  • benchmarks/plan_confirmation.rb
  • benchmarks/report_regressions.rb
  • benchmarks/spec/plan_confirmation_spec.rb
  • benchmarks/spec/regression_report_spec.rb
  • benchmarks/spec/report_regressions_spec.rb
  • benchmarks/spec/track_benchmarks_spec.rb
  • benchmarks/track_benchmarks.rb


steps:
- name: Checkout repository
uses: actions/checkout@v6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark-suite.yml"
if [ ! -f "$FILE" ]; then
  echo "Missing $FILE"
  exit 1
fi
echo "== uses: lines =="
rg -n "uses:\s*[^@]+@[^ ]+" "$FILE"
echo
echo "== line excerpts around specified lines =="
for L in 83 98 107 114 139 149 156 418 432 451 462; do
  echo "--- $L ---"
  sed -n "$((L-3)),$((L+3))p" "$FILE" | nl -ba
done

Repository: shakacode/react_on_rails

Length of output: 659


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark-suite.yml"
if [ ! -f "$FILE" ]; then
  echo "Missing $FILE"
  exit 1
fi
echo "== uses: lines =="
rg -n "uses:\s*[^@]+@[^ ]+" "$FILE"
echo
echo "== line excerpts around specified lines =="
for L in 83 98 107 114 139 149 156 418 432 451 462; do
  echo "--- $L ---"
  sed -n "$((L-3)),$((L+3))p" "$FILE" | nl -ba
done

Repository: shakacode/react_on_rails

Length of output: 659


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark-suite.yml"
rg -n "uses:\s*[^@]+@[^ ]+" "$FILE" || true
for L in 83 98 107 114 139 149 156 418 432 451 462; do
  echo "--- $L ---"
  sed -n "$((L-3)),$((L+3))p" "$FILE" | nl -ba
done

Repository: shakacode/react_on_rails

Length of output: 596


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark-suite.yml"
[ -f "$FILE" ] || { echo "Missing $FILE"; exit 1; }

echo "== uses: entries =="
rg -n "^\s*uses:\s*[^@]+@[^ ]+" "$FILE"

echo
echo "== Excerpts around lines =="
for L in 83 98 107 114 139 149 156 418 432 451 462; do
  echo "--- $L ---"
  sed -n "$((L-2)),$((L+2))p" "$FILE" | nl -ba
done

Repository: shakacode/react_on_rails

Length of output: 646


Pin GitHub Actions uses: entries to full commit SHAs (not tags) to prevent tag retargeting.
In .github/workflows/benchmark-suite.yml, these uses: lines use tag refs instead of SHAs: actions/checkout@v6 (83), bencherdev/bencher@v0.6.2 (98), grafana/setup-k6-action@v1 (107), actions/cache@v5 (114,139), pnpm/action-setup@v6 (149), actions/setup-node@v6 (156), actions/upload-artifact@v7 (418,451,462), actions/download-artifact@v7 (432). Pin each to owner/repo@<full-commit-sha>.

🧰 Tools
🪛 zizmor (1.25.2)

[error] 83-83: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmark-suite.yml at line 83, Replace all GitHub Actions
uses: entries that reference tags with full commit SHAs: update
actions/checkout@v6, bencherdev/bencher@v0.6.2, grafana/setup-k6-action@v1,
actions/cache@v5 (both occurrences), pnpm/action-setup@v6,
actions/setup-node@v6, actions/upload-artifact@v7 (all occurrences), and
actions/download-artifact@v7 to owner/repo@<full-commit-sha> form; locate each
occurrence by the exact uses string (e.g., "actions/checkout@v6",
"bencherdev/bencher@v0.6.2", etc.) and replace the tag with the corresponding
full commit SHA from the action's repository, ensuring every duplicate
occurrence is updated and commit SHAs are immutable values.

Source: Linters/SAST tools

Comment on lines +430 to +435
- name: Download first-run candidate
if: inputs.mode == 'confirm'
uses: actions/download-artifact@v7
with:
name: ${{ env.CANDIDATE_ARTIFACT_NAME }}
path: candidate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the referenced section of the workflow
sed -n '380,480p' .github/workflows/benchmark-suite.yml | cat -n

# Locate and inspect track_benchmarks.rb
fd -a "track_benchmarks\.rb" . || true
TRACK_FILE="$(fd -a "track_benchmarks\.rb" . | head -n 1 || true)"
if [ -n "${TRACK_FILE}" ]; then
  echo "Using: $TRACK_FILE"
  # Search for missing-candidate handling and run_confirmation classification
  rg -n "missing|candidate|inconclusive|run_confirmation|confirmation" "$TRACK_FILE" || true
  # Show surrounding relevant code blocks (best-effort)
  # (Print a few chunks around any matches)
  for pat in "run_confirmation" "inconclusive" "missing" "candidate"; do
    echo "---- Matches for $pat ----"
    rg -n --context 5 "$pat" "$TRACK_FILE" || true
  done
else
  echo "track_benchmarks.rb not found via fd"
fi

Repository: shakacode/react_on_rails

Length of output: 30976


Avoid short-circuiting the confirmation failure path in benchmark-suite.yml

track_benchmarks.rb already treats a missing/corrupt downloaded candidate as :inconclusive (it returns [nil, ""] from load_candidate, writes the inconclusive confirmation summary, and exits 1). If actions/download-artifact@v7 fails instead (e.g., candidate artifact missing), the job would stop before run_confirmation runs, so the Ruby handler can’t apply that logic. Add continue-on-error: true to the download step (or use the step’s native “missing artifact” option) so the workflow can reach ruby benchmarks/track_benchmarks.rb.

Suggested change
       - name: Download first-run candidate
         if: inputs.mode == 'confirm'
+        continue-on-error: true
         uses: actions/download-artifact@v7
         with:
           name: ${{ env.CANDIDATE_ARTIFACT_NAME }}
           path: candidate
🧰 Tools
🪛 zizmor (1.25.2)

[error] 432-432: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmark-suite.yml around lines 430 - 435, The workflow
step named "Download first-run candidate" currently can abort the job on failure
and prevents the Ruby confirmation handler from running; update that step to
allow the job to continue on missing/corrupt artifacts by adding
continue-on-error: true (or enabling the action’s native missing-artifact
option) so the workflow proceeds to run benchmarks/track_benchmarks.rb which
expects load_candidate to return [nil, ""] and then executes run_confirmation.

Comment on lines +179 to +183
uses: ./.github/workflows/benchmark-suite.yml
with:
matrix_row: ${{ toJSON(matrix) }}
mode: initial
secrets: inherit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show benchmark.yml around the mentioned lines
echo "=== benchmark.yml lines 150-270 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '150,270p'

echo "=== benchmark.yml lines 270-330 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '270,340p'

# Locate all occurrences of "secrets: inherit" in benchmark.yml
echo "=== occurrences of 'secrets: inherit' in benchmark.yml ==="
rg -n "secrets:\s*inherit" .github/workflows/benchmark.yml

# Inspect reusable workflow benchmark-suite.yml around its workflow_call secrets
if [ -f .github/workflows/benchmark-suite.yml ]; then
  echo "=== benchmark-suite.yml on.workflow_call secrets (search) ==="
  rg -n "workflow_call:|secrets:" .github/workflows/benchmark-suite.yml
  echo "=== benchmark-suite.yml lines 1-220 ==="
  nl -ba .github/workflows/benchmark-suite.yml | sed -n '1,220p'
  echo "=== benchmark-suite.yml lines 220-400 ==="
  nl -ba .github/workflows/benchmark-suite.yml | sed -n '220,400p'
else
  echo "benchmark-suite.yml not found at .github/workflows/benchmark-suite.yml"
fi

# Also check if any other reusable workflow files are used
echo "=== benchmark.yml reusable workflow uses (workflow_call) ==="
rg -n "uses:\s*\./\.github/workflows/" .github/workflows/benchmark.yml

Repository: shakacode/react_on_rails

Length of output: 165


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== benchmark.yml lines 150-270 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '150,270p'

echo "=== benchmark.yml lines 270-360 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '270,360p'

echo "=== occurrences of 'secrets: inherit' in benchmark.yml ==="
rg -n "secrets:\s*inherit" .github/workflows/benchmark.yml || true

# Determine exact path of reusable workflow being called
# It's referenced as "./.github/workflows/benchmark-suite.yml" in the snippet.
echo "=== check existence of reusable workflow ==="
ls -la .github/workflows/benchmark-suite.yml .github/workflows/benchmark-suite*.yml .github/workflows/benchmark-suite*.yaml 2>/dev/null || true

# Inspect reusable workflow benchmark-suite.yml
if [ -f .github/workflows/benchmark-suite.yml ]; then
  echo "=== benchmark-suite.yml: header through workflow_call secrets ==="
  # Print first 160 lines and any workflow_call secrets block
  nl -ba .github/workflows/benchmark-suite.yml | sed -n '1,200p'
  echo "=== benchmark-suite.yml: workflow_call secrets section (search) ==="
  rg -n "workflow_call:|secrets:" .github/workflows/benchmark-suite.yml
  echo "=== benchmark-suite.yml: around secrets block ==="
  # Show context around secrets: definitions
  rg -n "on:\s*workflow_call" -n .github/workflows/benchmark-suite.yml || true
  # Use grep for "secrets:" and show 40 lines around each match
  while read -r line; do
    ln=$(echo "$line" | cut -d: -f1)
    start=$((ln-20))
    end=$((ln+60))
    echo "--- benchmark-suite.yml around line $ln ---"
    nl -ba .github/workflows/benchmark-suite.yml | sed -n "${start},${end}p"
  done < <(rg -n "secrets:" .github/workflows/benchmark-suite.yml | head -n 20 || true)

  echo "=== benchmark-suite.yml: search for secret usages ==="
  rg -n "BENCHER_API_TOKEN|REACT_ON_RAILS_PRO_LICENSE_V2|GITHUB_TOKEN|\bsecrets\." .github/workflows/benchmark-suite.yml || true
fi

# Also check if benchmark-suite.yml uses any action without pinning (sha)
# We'll just check uses lines that reference github actions. Without pinning means using `@v1` or `@v2` etc.
if [ -f .github/workflows/benchmark-suite.yml ]; then
  echo "=== benchmark-suite.yml: action uses without SHA (heuristic) ==="
  rg -n "uses:\s*[^#\n]*`@v`[0-9]|uses:\s*[^#\n]*`@v`[0-9]+|uses:\s*[^#\n]*`@master`|uses:\s*[^#\n]*`@main`" .github/workflows/benchmark-suite.yml || true
fi

Repository: shakacode/react_on_rails

Length of output: 165


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== benchmark.yml lines 150-270 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '150,270p'

echo "=== benchmark.yml lines 270-360 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '270,360p'

echo "=== occurrences of 'secrets: inherit' in benchmark.yml ==="
rg -n "secrets:\s*inherit" .github/workflows/benchmark.yml || true

echo "=== check existence of reusable workflow ==="
ls -la .github/workflows/benchmark-suite.yml 2>/dev/null || true

if [ -f .github/workflows/benchmark-suite.yml ]; then
  echo "=== benchmark-suite.yml lines 1-240 ==="
  nl -ba .github/workflows/benchmark-suite.yml | sed -n '1,240p'
  echo "=== benchmark-suite.yml: locate workflow_call secrets definitions ==="
  rg -n "on:\s*workflow_call|secrets:" .github/workflows/benchmark-suite.yml
  echo "=== benchmark-suite.yml: locate secret usages ==="
  rg -n "secrets\.|BENCHER_API_TOKEN|REACT_ON_RAILS_PRO_LICENSE_V2" .github/workflows/benchmark-suite.yml || true
  echo "=== benchmark-suite.yml: uses lines and sha pinning heuristic ==="
  # Show uses lines (excluding local ./) to judge pinning
  rg -n "uses:\s*(?!\./)([^ ]+)" .github/workflows/benchmark-suite.yml | head -n 80
  echo "=== benchmark-suite.yml: uses lines without commit SHA heuristic ==="
  rg -n "uses:\s*(?!\./).*@((v|V)[0-9]+|[a-zA-Z]+)$" .github/workflows/benchmark-suite.yml || true
else
  echo "benchmark-suite.yml not found"
fi

Repository: shakacode/react_on_rails

Length of output: 165


🏁 Script executed:

# re-run with a single script (avoid duplication) 
set -euo pipefail

echo "=== benchmark.yml lines 150-270 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '150,270p'

echo "=== benchmark.yml lines 270-360 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '270,360p'

echo "=== occurrences of 'secrets: inherit' in benchmark.yml ==="
rg -n "secrets:\s*inherit" .github/workflows/benchmark.yml || true

echo "=== reusable workflow file listing matching benchmark-suite ==="
fd -a "benchmark-suite*.*ya?ml" .github/workflows || true

if [ -f .github/workflows/benchmark-suite.yml ]; then
  echo "=== benchmark-suite.yml lines 1-260 ==="
  nl -ba .github/workflows/benchmark-suite.yml | sed -n '1,260p'
  echo "=== benchmark-suite.yml workflow_call secrets definitions ==="
  rg -n "on:\s*workflow_call|secrets:" .github/workflows/benchmark-suite.yml
  echo "=== benchmark-suite.yml secret usages ==="
  rg -n "secrets\.|BENCHER_API_TOKEN|REACT_ON_RAILS_PRO_LICENSE_V2" .github/workflows/benchmark-suite.yml || true
else
  echo "benchmark-suite.yml not present at expected path"
fi

Repository: shakacode/react_on_rails

Length of output: 165


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file=".github/workflows/benchmark.yml"

echo "=== benchmark.yml: occurrences of 'secrets: inherit' ==="
rg -n "secrets:\s*inherit" "$file" || true

echo "=== benchmark.yml: show around first referenced area (170-210) ==="
awk 'NR>=170 && NR<=210 {printf "%d\t%s\n", NR, $0}' "$file" || true

echo "=== benchmark.yml: show around second referenced area (230-270) ==="
awk 'NR>=230 && NR<=270 {printf "%d\t%s\n", NR, $0}' "$file" || true

echo "=== benchmark.yml: show reusable workflow call sites ==="
rg -n "uses:\s*\./\.github/workflows/" "$file" || true


suite=".github/workflows/benchmark-suite.yml"
echo "=== check existence of benchmark-suite.yml ==="
ls -la "$suite" 2>/dev/null || true

if [ -f "$suite" ]; then
  echo "=== benchmark-suite.yml: locate on.workflow_call and secrets definitions ==="
  rg -n "on:\s*workflow_call|secrets:" "$suite" || true

  echo "=== benchmark-suite.yml: print beginning (1-220) ==="
  awk 'NR>=1 && NR<=220 {printf "%d\t%s\n", NR, $0}' "$suite" || true

  echo "=== benchmark-suite.yml: search for secret consumption (expected secrets) ==="
  rg -n "BENCHER_API_TOKEN|REACT_ON_RAILS_PRO_LICENSE_V2|GITHUB_TOKEN|secrets\." "$suite" || true

  echo "=== benchmark-suite.yml: detect third-party action uses (exclude local ./) ==="
  rg -n "uses:\s*(?!\./)([^#\n]+)" "$suite" || true

  echo "=== benchmark-suite.yml: heuristic check for unpinned action refs (`@v1/`@v2/@master) ==="
  rg -n "uses:\s*(?!\./).*@((v)[0-9]+|master|main)" "$suite" || true
fi

Repository: shakacode/react_on_rails

Length of output: 15640


🏁 Script executed:

#!/bin/bash
set -euo pipefail

suite=".github/workflows/benchmark-suite.yml"
file=".github/workflows/benchmark.yml"

echo "=== benchmark-suite.yml: lines containing 'secrets:' (declarations only) ==="
rg -n "secrets:" "$suite" || true

echo "=== benchmark-suite.yml: lines containing 'workflow_call' ==="
rg -n "workflow_call" "$suite" || true

echo "=== benchmark-suite.yml: show around lines ~35-90 (env and secret usage start) ==="
awk 'NR>=40 && NR<=90 {printf "%d\t%s\n", NR, $0}' "$suite"

echo "=== benchmark-suite.yml: show around lines 420-480 (secret env mapping) ==="
awk 'NR>=420 && NR<=480 {printf "%d\t%s\n", NR, $0}' "$suite"

echo "=== benchmark.yml: show around 200-220 (first download-artifact step) ==="
awk 'NR>=195 && NR<=220 {printf "%d\t%s\n", NR, $0}' "$file"

echo "=== benchmark.yml: show around 265-290 (second download-artifact step) ==="
awk 'NR>=260 && NR<=290 {printf "%d\t%s\n", NR, $0}' "$file"

Repository: shakacode/react_on_rails

Length of output: 8503


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for any repo policy about pinning actions to SHAs
rg -n "pin(ned)?( to)? (commit|SHA|sh a|hash)|SHA-pinned|commit SHA|pin to .*@([0-9a-f]{7,40})" .github README* CONTRIBUTING* docs* package* 2>/dev/null || true

# Also look for general "actions are pinned" statements
rg -n "pinned.*action|actions/.*@([0-9a-f]{7,40})|`@v`[0-9]+(?![^#]*[0-9a-f]{7,40})" .github/workflows 2>/dev/null || true

# Check whether benchmark.yml or benchmark-suite.yml contains any action refs that are already commit-SHA pinned
rg -n "uses:\s*([^#\n]+)@([0-9a-f]{7,40})" .github/workflows/benchmark.yml .github/workflows/benchmark-suite.yml || true

Repository: shakacode/react_on_rails

Length of output: 496


Stop passing all secrets to the reusable benchmark workflow and align action pinning with supply-chain policy.

  • In .github/workflows/benchmark.yml, replace secrets: inherit (lines 183 and 245) with an explicit secrets: mapping limited to the secrets benchmark-suite.yml actually uses: REACT_ON_RAILS_PRO_LICENSE_V2, BENCHER_API_TOKEN, and GITHUB_TOKEN (instead of exposing the full repo secret set).
  • .github/workflows/benchmark.yml (and benchmark-suite.yml) use tag-based actions like actions/download-artifact@v7 / actions/checkout@v6; docs/pro/license-ci-integration.md asks for third-party actions to be pinned to reviewed commit SHAs.
🧰 Tools
🪛 zizmor (1.25.2)

[warning] 179-179: secrets unconditionally inherited by called workflow (secrets-inherit): this reusable workflow

(secrets-inherit)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmark.yml around lines 179 - 183, Replace the broad
secrets inheritance with an explicit secrets mapping in the reusable workflow
call: remove "secrets: inherit" in .github/workflows/benchmark.yml (the step
using the reusable workflow with with: matrix_row/mode) and instead provide a
limited secrets: mapping that only passes REACT_ON_RAILS_PRO_LICENSE_V2,
BENCHER_API_TOKEN, and GITHUB_TOKEN to the reusable workflow; ensure the
reusable workflow (benchmark-suite.yml) expects those exact inputs.
Additionally, update action pins in both .github/workflows/benchmark.yml and
benchmark-suite.yml (e.g., actions/download-artifact@v7, actions/checkout@v6) to
use reviewed commit SHAs rather than tag versions to comply with the
supply-chain policy.

Source: Linters/SAST tools

if: always()
uses: actions/upload-artifact@v7
- name: Download regression candidates
uses: actions/download-artifact@v7

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE=".github/workflows/benchmark.yml"

echo "== File existence =="
ls -la "$FILE" || exit 1

echo
echo "== Context around line 210 =="
nl -ba "$FILE" | sed -n '195,230p'

echo
echo "== Context around line 272 =="
nl -ba "$FILE" | sed -n '255,290p'

echo
echo "== All uses: entries in this workflow (for tag vs SHA pin check) =="
nl -ba "$FILE" | rg -n "uses:\s*[^#]+" || true

Repository: shakacode/react_on_rails

Length of output: 264


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE=".github/workflows/benchmark.yml"

python3 - <<'PY'
import itertools
path = ".github/workflows/benchmark.yml"
with open(path, "r", encoding="utf-8") as f:
    lines = f.readlines()

def show(start, end):
    start_i = max(1, start)
    end_i = min(len(lines), end)
    for i in range(start_i-1, end_i):
        print(f"{i+1:4d}: {lines[i].rstrip()}")

print("== Context around line 210 ==")
show(195, 230)

print("\n== Context around line 272 ==")
show(255, 290)

print("\n== All uses: entries in this workflow ==")
for i, line in enumerate(lines, start=1):
    if "uses:" in line and not line.strip().startswith("#"):
        # print the uses line and its line number
        print(f"{i}: {line.rstrip()}")
PY

Repository: shakacode/react_on_rails

Length of output: 3424


Pin actions/download-artifact to commit SHAs (not @v7 tags).

Both artifact download steps use actions/download-artifact@v7, so they remain vulnerable to upstream tag moves. Update to actions/download-artifact@<commit-sha>.

Applies to: 210-210, 272-272

🧰 Tools
🪛 zizmor (1.25.2)

[error] 210-210: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/benchmark.yml at line 210, Replace the floating tag uses:
actions/download-artifact@v7 with a pinned commit SHA for both occurrences where
the workflow references the download-artifact action (search for the literal
"uses: actions/download-artifact@v7"); update each to the specific commit SHA
form actions/download-artifact@<commit-sha> (obtain the correct commit SHA from
the actions/download-artifact repository) so both download-artifact steps are
immutably pinned.

Source: Linters/SAST tools

Comment thread benchmarks/plan_confirmation.rb
Comment thread benchmarks/report_regressions.rb
Comment on lines +335 to +347
# The structured benchmark+measure pairs Bencher raised an active alert for, deduped.
# Read from the same alerts[] as #regression?/#regressed_benchmark_names. Handed off in
# the candidate so a confirmation rerun can require the SAME pair to re-alert (not just
# any alert in the suite). An alert with no benchmark name is dropped — it can't be
# matched. measure may be nil; the matcher falls back to name-only for those.
def regressed_alert_pairs(report)
return [] unless report

report.alerts
.select(&:benchmark)
.map { |alert| RegressionReport.alert(alert.benchmark, alert.measure) }
.uniq
end

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat unmatchable regression alerts as inconclusive instead of cleared.

At Line 344, alerts without a benchmark are dropped, and then Line 387 classifies empty matches as :cleared. That can silently clear a real first-run alert when the payload is incomplete (no benchmark identifier), even though it is not actually confirmable. This path should return :inconclusive (operational failure) rather than noise-cleared.

Also applies to: 379-389

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/track_benchmarks.rb` around lines 335 - 347, The current
regressed_alert_pairs method drops alerts with no benchmark
(report.alerts.select(&:benchmark)), which lets incomplete payloads be treated
later as empty matches and classified as :cleared; instead preserve unmatchable
alerts and surface them as inconclusive so they cannot be noise-cleared. Change
regressed_alert_pairs to include alerts with nil benchmark (don’t filter them
out) and convert those to an explicit inconclusive sentinel (e.g. call or
construct a RegressionReport.inconclusive(alert) or return a pair with a nil
benchmark/measure marker) so downstream matching logic can detect and return
:inconclusive; also update the empty-match handling (the classifier currently
emitting :cleared for empty matches) to return :inconclusive when it sees that
sentinel or a nil benchmark/measure pair. Ensure you reference and modify
RegressionReport.alert usage and the empty-match classification code to
recognize the new inconclusive sentinel.

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown

Greptile Summary

This PR implements a two-run confirmation gate for main-push benchmark regressions: a Bencher alert on the initial run is now a non-fatal candidate that must re-alert on an independent fresh-runner rerun before any GitHub issue is filed. The pipeline is extracted into a reusable workflow (benchmark-suite.yml) shared by both runs to eliminate drift between the initial and confirmation steps.

  • New 4-job pipeline in benchmark.yml: benchmark (initial, non-fatal on alert) → plan-confirmation (gate: drops ignored benchmarks, emits confirmation matrix) → confirm-regressions (fresh-runner rerun on a synthetic Bencher branch that never pollutes main's series) → report-regressions (files issue with side-by-side first-run + confirmation evidence, owns final pass/fail).
  • benchmarks/lib/regression_report.rb becomes the shared hand-off contract (candidate/confirmed filenames, structured alert pairs, IGNORED_REGRESSION_BENCHMARKS moved here from report_regressions.rb), with alerts_match? and confirmed_alerts helpers implementing the same-benchmark+measure matching rule including a nil-measure name-only fallback.
  • 26 new specs across track_benchmarks_spec, plan_confirmation_spec, and regression_report_spec cover the new confirmation mode outcomes; report_regressions_spec is reworked for the confirmed-artifact contract and now asserts that a confirmed regression correctly fails the workflow.

Confidence Score: 3/5

The Ruby scripts and specs are well-structured and the hand-off logic is sound, but both download-artifact glob steps in benchmark.yml lack continue-on-error: true, which will likely cause the confirmation pipeline to fail on every clean main push and every cleared-noise run — the two most common outcomes.

The plan-confirmation and report-regressions jobs each use actions/download-artifact@v7 with a pattern: glob but no fallback for zero matches. On any main push where no benchmarks regressed (or where alerts are cleared on confirmation), those download steps will fail before the handling scripts run, causing the workflow to turn red in all the cases it should be green. This would trigger false broken CI signals on every clean commit to main until fixed.

benchmark.yml lines 209–213 (Download regression candidates) and 271–276 (Download confirmed regressions) need continue-on-error: true; the Ruby scripts are otherwise ready.

Important Files Changed

Filename Overview
.github/workflows/benchmark-suite.yml New reusable workflow extracting all benchmark suite steps into a single callable unit; correctly gates candidate/confirmed upload on mode; the Download first-run candidate step (confirm mode) downloads by exact name and won't fail on an empty match since it's always expected to exist. Step ordering is correct.
.github/workflows/benchmark.yml Orchestration refactored into a 4-job confirmation pipeline; both Download regression candidates (plan-confirmation) and Download confirmed regressions (report-regressions) use pattern: glob downloads without continue-on-error: true, which will likely fail when no matching artifacts exist — the normal case on clean runs and cleared-noise runs.
benchmarks/lib/regression_report.rb Shared module expanded with candidate/confirmed filenames, structured alert helpers, and IGNORED_REGRESSION_BENCHMARKS moved here from report_regressions.rb; logic is clean, helpers are well-tested, and the measure normalisation correctly handles slug vs name drift.
benchmarks/plan_confirmation.rb New gate script that reads candidates, short-circuits ignored benchmarks, and emits a confirmation matrix; handles empty, corrupt, and unmatched candidates correctly. The script itself is robust but depends on the download step in benchmark.yml not failing first.
benchmarks/report_regressions.rb Updated to consume confirmed payloads with first-run/confirmation summaries side by side; old IGNORED_REGRESSION_BENCHMARKS filtering removed (now lives in plan_confirmation); exit-code contract correctly changed so confirmed regressions fail the workflow here.
benchmarks/track_benchmarks.rb Confirmation mode added with clean separation from initial mode; synthetic branch naming prevents main series pollution; edge case where candidate_alerts is empty after filtering causes :cleared instead of :inconclusive (P2 comment left inline).
benchmarks/spec/plan_confirmation_spec.rb New spec providing good coverage of the gate: empty/corrupt/unmatched candidates, ignored benchmark short-circuit, mixed suppressed+confirmed, and GITHUB_OUTPUT emission all tested.
benchmarks/spec/regression_report_spec.rb New spec pins alert normalisation, matching (including slug/name equivalence and nil-measure fallback), confirmed_alerts intersection, and ignore-list filtering.
benchmarks/spec/report_regressions_spec.rb Reworked for confirmed-artifact contract; correctly adds tests asserting exit 1 on confirmed regression and verifying first-run/confirmation evidence rendered side by side in the issue body.
benchmarks/spec/track_benchmarks_spec.rb 26 new examples covering confirmation mode: branch naming, start-point args, alert pair extraction, and all four outcome classifications are unit-tested.

Reviews (1): Last reviewed commit: "Confirm benchmark regressions on a fresh..." | Re-trigger Greptile

Comment on lines +209 to +213
- name: Download regression candidates
uses: actions/download-artifact@v7
with:
name: regression-${{ matrix.artifact_name_prefix }}-${{ github.run_number }}${{ matrix.artifact_name_suffix }}
path: bench_results/regression.json
if-no-files-found: ignore
retention-days: 7

- name: Benchmark workflow summary
if: always()
# Route context/matrix values through env so the shell treats them as data,
# not code (GitHub Actions injection hardening; ref_name can be attacker-set).
path: candidate-artifacts
pattern: regression-candidate-*

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 download-artifact with a glob pattern may fail when no candidates exist

actions/download-artifact@v4+ using pattern: (no name:) fails with an error when the glob matches zero artifacts. On any clean main push where no suite raised a Bencher alert, no regression-candidate-* artifacts are uploaded, so this step would fail before plan_confirmation.rb even runs. The job terminates here, emitting no outputs; confirm-regressions is skipped; report-regressions runs under always(), finds nothing, and exits 0 — but the workflow is already marked red from the failing plan-confirmation step. The same pattern appears in the report-regressions job (line 274–275) for regression-confirmed-*, meaning any run without confirmed regressions also fails the download step.

Adding continue-on-error: true to both download steps lets the scripts proceed and handle the empty-directory case they already cover correctly (plan_confirmation.rb returns true with has_confirmations=false when the directory is empty).

Comment on lines +271 to 276
- name: Download confirmed regressions
uses: actions/download-artifact@v7
with:
path: regression-artifacts
pattern: regression-*
path: confirmed-artifacts
pattern: regression-confirmed-*
# Stdlib-only script, so the runner's system Ruby is enough (no Bundler/Rails boot).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Same empty-pattern failure risk for report-regressions

Like the plan-confirmation download (line 209–213), this step will fail on any run where no regression-confirmed-* artifact was uploaded — i.e. whenever every alert was cleared as noise or the confirm-regressions job was skipped entirely. Both the cleared and the clean cases should produce a green workflow, but the failing download step prevents report_regressions.rb from running and marks the workflow red. Adding continue-on-error: true here (and at line 209) allows the scripts to reach their empty-directory short-circuits as intended.

Comment on lines 332 to +347
report.alerts.filter_map(&:benchmark).uniq
end

# The structured benchmark+measure pairs Bencher raised an active alert for, deduped.
# Read from the same alerts[] as #regression?/#regressed_benchmark_names. Handed off in
# the candidate so a confirmation rerun can require the SAME pair to re-alert (not just
# any alert in the suite). An alert with no benchmark name is dropped — it can't be
# matched. measure may be nil; the matcher falls back to name-only for those.
def regressed_alert_pairs(report)
return [] unless report

report.alerts
.select(&:benchmark)
.map { |alert| RegressionReport.alert(alert.benchmark, alert.measure) }
.uniq
end

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty ALERTS in a candidate silently clears the confirmation

load_candidate normalises a missing or non-Array alerts value to []. When candidate_alerts is [], actionable_alerts([]) returns [], and confirmed_alerts([], ...) returns [], so confirmation_outcome returns :cleared — even if the confirmation run itself did re-alert. This path can be reached if all active alerts in the initial run lacked a benchmark name (dropped by regressed_alert_pairs), which would cause the suite to be silently cleared instead of treated as inconclusive. Returning [:inconclusive, []] when candidate_alerts is empty after actionable filtering would fail safer, consistent with the "can't match what we can't name" intent stated in the docs.

Comment thread benchmarks/track_benchmarks.rb Outdated

parsed = JSON.parse(File.read(path))
alerts = parsed[RegressionReport::ALERTS]
alerts = [] unless alerts.is_a?(Array)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ALERTS is missing from the candidate payload (key absent or non-array), this falls back to []. The problem: confirmation_outcome then calls confirmed_alerts(actionable_alerts([]), ...) which always returns [], and the outcome is :cleared — silently treating the regression as noise.

This is reachable if regressed_alert_pairs returns [] (all Bencher alerts lacked a benchmark name and were dropped by .select(&:benchmark)) while regression?(report) was still true, causing write_candidate to write ALERTS: [].

Consider returning [nil, summary] (inconclusive) instead of [[], summary] when the array is empty/absent — a candidate with no matchable alert pairs cannot be confirmed or cleared on evidence:

Suggested change
alerts = [] unless alerts.is_a?(Array)
alerts = parsed[RegressionReport::ALERTS]
# Treat absent/empty ALERTS as inconclusive: without matchable pairs the
# confirmation can neither confirm nor clear on evidence, only on speculation.
unless alerts.is_a?(Array) && !alerts.empty?
warn "::error::Candidate has empty or missing ALERTS (#{RegressionReport::ALERTS}); " \
"treating the confirmation as inconclusive."
return [nil, parsed[RegressionReport::SUMMARY].to_s]
end

# NO alert is an operational failure, not a regression, and still fails the suite.
def report_main_push_candidate(report, report_markdown, bencher_exit_code, suite_name)
if regression?(report)
exit 1 unless write_candidate(report, report_markdown, suite_name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is that a successful write_candidate makes the initial run non-fatal (exit 0), but this is subtle: when write_candidate returns true the unless is skipped, the function returns normally, the surrounding if main_push? && bencher_exit_code != 0 block ends, and the script falls through to exit 0. A reader seeing the exit 1 on this line may be confused about how a Bencher regression alert produces a green suite.

Consider making the non-fatal exit explicit:

Suggested change
exit 1 unless write_candidate(report, report_markdown, suite_name)
# write_candidate succeeded: the alert is now a non-fatal candidate awaiting
# fresh-runner confirmation. Exit 0 so this suite stays green; report-regressions
# owns the final pass/fail.
exit 1 unless write_candidate(report, report_markdown, suite_name)
# (write_candidate returned true — fall through to exit 0)


def set_output(name, value)
path = ENV.fetch("GITHUB_OUTPUT", nil)
File.write(path, "#{name}=#{value}\n", mode: "a") if path && !path.empty?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON.generate always produces compact single-line JSON, so there is no newline risk today. For future-proofing, GitHub Actions supports the multiline heredoc format for GITHUB_OUTPUT which handles any embedded newlines:

Suggested change
File.write(path, "#{name}=#{value}\n", mode: "a") if path && !path.empty?
if path && !path.empty?
# Use the multiline heredoc delimiter so embedded newlines in `value`
# (e.g. if JSON ever becomes pretty-printed) don't corrupt the output file.
File.open(path, "a") { |f| f.write("#{name}<<__GHEOF__\n#{value}\n__GHEOF__\n") }
end

Not a bug today, but worth the one-line change to avoid a silent breakage if the matrix JSON ever gains newlines.

runs-on: ubuntu-24.04
permissions:
contents: read
issues: write

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issues: write and pull-requests: write are only needed for initial mode (posting PR comments and — indirectly — enabling the downstream report-regressions job to file issues). In confirm mode this job only submits to Bencher and writes artifacts; it never touches issues or PRs directly.

GitHub Actions does not support conditional permission blocks, so both modes get the full set. A short comment here would make this over-grant intentional and auditable:

Suggested change
issues: write
# issues: write and pull-requests: write are used by initial mode only
# (track_benchmarks.rb posts PR summary comments). confirm mode does not post
# to PRs or file issues, but workflow YAML does not support conditional
# permission blocks, so both modes inherit the full set.
issues: write
pull-requests: write

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benchmarks/spec/track_benchmarks_spec.rb`:
- Around line 187-193: The long test line calling write_candidate with
RegressionReport::ALERTS and RegressionReport::SUMMARY exceeds 120 chars; break
that call into multiple lines using Ruby method continuation (e.g., wrap
arguments in parentheses or pass a multi-line hash) so each hash key-value pair
(RegressionReport::ALERTS => "not an array", RegressionReport::SUMMARY => "first
run") is on its own line and the overall line length is under 120 chars; keep
the surrounding Dir.mktmpdir block and the expect(load_candidate(dir)) assertion
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 79d49ece-155d-4367-a262-fa9ace62799f

📥 Commits

Reviewing files that changed from the base of the PR and between acc4f36 and 985121d.

📒 Files selected for processing (6)
  • benchmarks/plan_confirmation.rb
  • benchmarks/report_regressions.rb
  • benchmarks/spec/plan_confirmation_spec.rb
  • benchmarks/spec/report_regressions_spec.rb
  • benchmarks/spec/track_benchmarks_spec.rb
  • benchmarks/track_benchmarks.rb
🚧 Files skipped from review as they are similar to previous changes (4)
  • benchmarks/spec/plan_confirmation_spec.rb
  • benchmarks/report_regressions.rb
  • benchmarks/plan_confirmation.rb
  • benchmarks/track_benchmarks.rb

Comment on lines +187 to +193
it "treats non-array alerts as inconclusive" do
Dir.mktmpdir do |dir|
write_candidate(dir, RegressionReport::ALERTS => "not an array", RegressionReport::SUMMARY => "first run")

expect(load_candidate(dir)).to eq([nil, "first run"])
end
end

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Break the long line to comply with the 120-character limit.

Line 189 exceeds the 120-character limit specified in the coding guidelines for Ruby code. As per coding guidelines, Ruby code should have a maximum line length of 120 characters and long chains should be broken using method continuation syntax.

📏 Proposed fix to break the long line
     it "treats non-array alerts as inconclusive" do
       Dir.mktmpdir do |dir|
-        write_candidate(dir, RegressionReport::ALERTS => "not an array", RegressionReport::SUMMARY => "first run")
+        write_candidate(dir,
+                        RegressionReport::ALERTS => "not an array",
+                        RegressionReport::SUMMARY => "first run")

         expect(load_candidate(dir)).to eq([nil, "first run"])
       end
     end
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
it "treats non-array alerts as inconclusive" do
Dir.mktmpdir do |dir|
write_candidate(dir, RegressionReport::ALERTS => "not an array", RegressionReport::SUMMARY => "first run")
expect(load_candidate(dir)).to eq([nil, "first run"])
end
end
it "treats non-array alerts as inconclusive" do
Dir.mktmpdir do |dir|
write_candidate(dir,
RegressionReport::ALERTS => "not an array",
RegressionReport::SUMMARY => "first run")
expect(load_candidate(dir)).to eq([nil, "first run"])
end
end
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmarks/spec/track_benchmarks_spec.rb` around lines 187 - 193, The long
test line calling write_candidate with RegressionReport::ALERTS and
RegressionReport::SUMMARY exceeds 120 chars; break that call into multiple lines
using Ruby method continuation (e.g., wrap arguments in parentheses or pass a
multi-line hash) so each hash key-value pair (RegressionReport::ALERTS => "not
an array", RegressionReport::SUMMARY => "first run") is on its own line and the
overall line length is under 120 chars; keep the surrounding Dir.mktmpdir block
and the expect(load_candidate(dir)) assertion unchanged.

Source: Coding guidelines

@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

PR 3810 Code Review - see inline comments for details

strategy:
fail-fast: false
# Fall back to an empty (but valid) matrix if plan-confirmation was skipped/crashed
# before setting the output: its output is then "" and fromJSON("") would error at

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confirm-regressions job carries issues: write and pull-requests: write permissions, but in confirm mode track_benchmarks.rb explicitly skips replace_pr_comments() and never files issues — that is report-regressions's job. These scopes are unnecessary here and widen the blast radius if the runner token is ever misused. Suggest trimming to contents: read + checks: write (Bencher needs checks: write for a synthetic-branch run; no issue/PR scopes required).

uses: ./.github/workflows/benchmark-suite.yml
with:
matrix_row: ${{ toJSON(matrix) }}
mode: confirm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The {"include":[]} fallback is correct but the comment ("crashed before setting output") understates why it's needed. GitHub Actions expands the matrix: expression before evaluating the if: condition, so if confirmation_matrix is empty string the runner would error at matrix-construction time even though the if: gate would skip the job. Worth noting that ordering explicitly — otherwise a future reader might remove the fallback as apparently redundant given the if: guard.

Comment on lines +238 to +239
issues: write
pull-requests: write

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issues: write and pull-requests: write are not needed in confirm mode. In confirm mode track_benchmarks.rb explicitly skips replace_pr_comments() and never calls the GitHub issue API — that is report-regressions's responsibility. Consider trimming to contents: read + checks: write here (Bencher needs checks: write for the synthetic-branch submission; no issue/PR scopes are required).

Comment on lines +231 to +235
# Fall back to an empty (but valid) matrix if plan-confirmation was skipped/crashed
# before setting the output: its output is then "" and fromJSON("") would error at
# matrix-construction time even though the if: gate skips this job. The empty matrix
# produces zero jobs; the if: gate is what actually decides whether confirmation runs.
matrix: ${{ fromJSON(needs.plan-confirmation.outputs.confirmation_matrix || '{"include":[]}') }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback is correct and should stay, but the comment only explains what happens ("its output is then \"\""). It's worth also noting why the fallback is required: GitHub Actions expands the matrix: expression before evaluating the if: condition, so a missing or empty confirmation_matrix would cause a matrix-construction error even though the if: gate would have skipped the job. Without that context a future reader may remove the fallback as redundant given the if: guard.

Comment on lines +158 to +162
unless plan[:unmatched].empty?
warn_unmatched(plan[:unmatched])
emit_no_confirmations
return false
end

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design question worth confirming: when there are both unmatched candidates and confirmable ones, this bails out entirely — no confirmations are emitted, and the workflow goes red without filing an issue. That is conservative and safe, but it means one operationally-broken shard silences a real regression in a different shard. Is that the intended trade-off? If partial confirmation is acceptable, the logic could emit confirmations for the matched rows and only warn (not fail) on the unmatched ones. Either policy is defensible; this just warrants a comment so the intention is clear to future readers.

@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary for PR 3810 - Two-stage benchmark regression confirmation pipeline.

MEDIUM: confirm-regressions job has issues:write and pull-requests:write permissions that are not needed in confirm mode. track_benchmarks.rb skips replace_pr_comments() and never touches the issues API when BENCHMARK_MODE=confirm. Least-privilege fix: trim to contents:read + checks:write. See inline comment at lines 238-239 of benchmark.yml.

LOW: The fallback matrix comment (lines 231-235) explains what happens but not why the fallback is required. GitHub Actions expands matrix: expressions before evaluating if: conditions, so a missing output would error at matrix-construction time regardless of the if: gate. Worth one sentence to prevent future removal.

LOW: The Benchmark workflow summary step in benchmark-suite.yml echoes to stdout, not to GITHUB_STEP_SUMMARY. The meaningful confirmation outcome (confirmed/cleared/inconclusive) is correctly written via append_step_summary in track_benchmarks.rb, so this is informational only.

QUESTION: In plan_confirmation.rb lines 158-162, when there are both unmatched candidates AND confirmable ones the function bails entirely. Safe, but one broken shard silences a real regression in another. See inline comment for detail.

POSITIVE: Alert matching fallback is well-reasoned. The --start-point-reset + unique synthetic branch names correctly isolate the confirmation from mains series. Fail-loud philosophy is consistent throughout. Both write_candidate and write_confirmed have explicit true returns. 212 test examples with good coverage of the new code paths.

Overall: well-engineered. The permissions issue is worth fixing before merge; the rest are minor.

justin808 added a commit that referenced this pull request Jun 9, 2026
…sitives) (#3829)

## Problem

~99% of recent benchmark "performance-regression" alerts on `main` are
false positives from a single **orphaned server-side Bencher
threshold**.

`p90_latency` (and `p99_latency`) were intentionally dropped from
`track_benchmarks.rb` `THRESHOLDS` as too noisy ("tail noise can't meet
the 1/20 target"). But Bencher thresholds are persistent server-side
objects keyed on (branch, testbed, measure) — removing a measure from
the CLI `--threshold-*` args stops *updating* the threshold, it never
*deletes* it. The p90 metric is still submitted (for the dashboard), so
the orphaned p90 threshold keeps evaluating p90 tail noise and firing
alerts on nearly every run.

These alerts are doubly bad:
- They file a `performance-regression` issue (exit≠0), and
- They are **invisible in the summary table** — the p90 column has no
`:direction`, so it is never flagged 🔴. The filed issue names nothing
actionable (this is the #3782 / #3795 "🔴 only appears in the legend"
symptom).

### Evidence (public Bencher API, project `react-on-rails-t8a9ncxo`)

- 255 most-recent active alerts: **248 p90-latency, 4 rps, 3
p50-latency**.
- `main` branch: **164 active alerts → 162 p90-latency, 2 rps** (98.8%
p90).
- #3782 (`dec4b8c`) filed on exactly two p90 alerts —
`/client_side_hello_world: Core` p90 `24.20 > upper 23.61` and
`/rendered_html: Core` p90 — with no rps/p50/failed_pct alert.
- `main`/`github-actions` carries thresholds for `rps, p50-latency,
p90-latency, p99-latency, failed-pct` — two more than the code manages.
(p99 is dormant: no p99 metric is submitted.)

## Fix

Thread the measures we actually track (`THRESHOLDS` names) into
`BencherReport`. An active alert on any **other** measure is classified
as *filtered* rather than a regression. This reuses #3822's existing
`filtered_alert?` exit-normalization, so a p90-only run no longer writes
a candidate or files an issue. The fix is fully in-repo and makes the
orphaned threshold harmless regardless of its server-side state.

- Tracked measures (`rps`, `p50_latency`, `failed_pct`) are unaffected —
including the "hidden" `failed_pct` regressions #3822 added coverage
for.
- Measure-less and benchmark-less alerts keep their existing fail-safe
(still counted).
- `tracked_measures` defaults to `nil` (track every measure) so
non-production callers (`BenchmarkTable`, specs) are unchanged.

## Validation

- `bundle exec rspec benchmarks/spec` — **234 examples, 0 failures** (5
new specs for tracked-measure filtering).
- `bundle exec rubocop benchmarks/...` — no offenses.
- **End-to-end against the real #3782 reports** (fetched from the
Bencher API): `regression?` flips `true → false` (`filtered_alert? =
true`) for both the Core and Pro p90 alerts, while rps/p50 alerts are
untouched.
- `script/ci-changes-detector origin/main` → Benchmark scripts → Lint
(Ruby + JS).

## Companion cleanup (separate, manual — not in this PR)

This neutralizes the orphaned threshold in code. To also stop it
polluting the Bencher dashboard (162 cosmetic active alerts) and being
cloned to every new branch, delete the server-side thresholds (needs
`BENCHER_API_TOKEN`):

- `main` p90 `51fb6a47-0083-4e84-a745-60ee42e3bba4`, p99
`6faa7a68-1835-4cd7-96dc-959220737172`
- `master` p90 `d4ad2066-74cb-41f7-93bb-f0885358c56c`, p99
`61bf5ae6-77fa-474f-b6d8-ded630bd0c20`

## Relationship to other work

Completes the noise fix that #3810 (fresh-runner confirmation) and #3822
(stale-alert filtering) started: #3822 only filters alerts whose metric
*recovered*; a **live** p90 crossing (the dominant case) still passed
through. Substantially removes the p90 tail noise behind #3169 and the
issue explosion researched in #3755.

Refs #3755, #3169, #3795

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes which Bencher alerts count as regressions and can affect
main-push candidate filing and CI exit behavior, though scoped to
benchmark reporting with explicit specs and a nil default for other
callers.
> 
> **Overview**
> Adds optional **`tracked_measures`** to `BencherReport.parse` /
`#initialize` so active Bencher alerts on measures the repo no longer
tracks (e.g. orphaned server-side **p90_latency** thresholds) are moved
to **`filtered_alert?`** instead of **`regression?`**, reusing the
existing exit-code normalization for filtered-only runs.
> 
> **`track_benchmarks.rb`** now passes `THRESHOLDS.map(&:first)` when
parsing the JSON report, so p90-only false positives no longer write
regression candidates or file issues the summary table cannot flag.
Tracked measures, measure-less fail-safe alerts, and callers that omit
`tracked_measures` stay backward compatible.
> 
> Specs cover orphaned p90 filtering, slug/name normalization, and a
small fix for `BencherReport.new` with a Hash root in a perf-links test.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
7693f64. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Bug Fixes**
* Fixed false regression alerts for measures no longer tracked in
performance benchmarks. The system now filters regression reports to
only flag alerts for actively monitored metrics, preventing issue
creation based on orphaned or retired threshold measurements that are no
longer part of the current tracking configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
justin808 added a commit that referenced this pull request Jun 9, 2026
## Problem

A confirmed `performance-regression` issue currently only shows the
per-suite summary tables. Those tables flag 🔴 on **rps** and **p50** but
never on **failed_pct** (it has no column — it lives in the "Status"
cell) and never on any untracked measure. So a reader has to open the
tables to see what regressed, and a `failed_pct`-only regression renders
with no visible 🔴 at all — the same "the issue names nothing actionable"
problem behind #3782/#3795.

## Fix

Add a **"What regressed"** section to the issue body, built from the
confirmed `ALERTS` payload (#3810's structured `{benchmark, measure}`
pairs), aggregated across all suites and computed once in
`report_regressions`:

```
### What regressed

The fresh-runner confirmation re-alerted on these benchmark + measure pairs:

- `/client_side_log_throw: Pro` — **rps**
- `/server_side_redux_app: Core` — **failed_pct**
```

- Threaded through `report_suite` → `RegressionIssueReporter`; only the
reporter that *creates* the issue renders the body, so the cross-suite
list is complete.
- `regressed_overview` defaults to `""` → the section is omitted and the
body is byte-for-byte unchanged, so it composes with the existing flow
and with older payloads that lack `ALERTS`.

## Validation

- `bundle exec rspec benchmarks/spec` — **233 examples, 0 failures** (4
new: helper rendering/dedup + issue-body inclusion/omission).
- `bundle exec rubocop benchmarks/...` — no offenses.
- Rendered the body locally with and without pairs to confirm clean
Markdown spacing and that the no-pairs body is unchanged.

## Relationship to other work

Complements #3829 (filter alerts to tracked measures): #3829 stops the
p90 false positives from filing at all; this makes the *real* confirmed
regressions self-explanatory in the issue body. Refs #3755.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> CI reporting and GitHub issue markdown only; no runtime app or auth
changes.
> 
> **Overview**
> Adds a **"What regressed"** section to auto-filed
`performance-regression` issue bodies, listing confirmed benchmark +
measure pairs as markdown bullets (including measures like
**failed_pct** that the per-suite tables do not surface clearly).
> 
> `report_regressions` builds the list once from each confirmed
payload's structured `ALERTS` via new `regressed_overview_markdown`
(deduped across suites) and passes `regressed_overview` into
`RegressionIssueReporter`. The section is omitted when the overview is
empty, so older payloads without `ALERTS` keep the previous body shape.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
3068c93. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Performance regression reports now include a "What regressed" overview
section, displaying all confirmed benchmark and measure pairs that
regressed across test suites in a single consolidated view.

* **Tests**
* Added comprehensive test coverage for the new regression overview
markdown rendering logic and deduplication.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
justin808 added a commit that referenced this pull request Jun 9, 2026
* origin/main:
  Add Pro license header checker
  RSC: stop serializing props into embedded payload cache key (#3800)
  Make PR batches skip customer-feedback issues (#3826)
  Name the regressed benchmark+measure pairs in the issue body (#3830)
  Clarify agent batch policy handoffs (#3824)
  Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829)
  Fix auto-bundled component pack normalization (#3818)
  Filter stale Bencher alerts before reporting (#3822)
  Tighten benchmark confirmation workflow permissions (#3819)
  Add issue evaluation skill (#3816)
  Confirm benchmark regressions on a fresh runner before filing the main issue (#3810)
  Define agent scope and accelerated RC auto-merge policy (#3808)
  Replace custom MockClient with async-http Mock::Endpoint (#3703)
  Docs: per-request data sharing in RSC with React.cache() (#3769)
  Pro RSC: share unstable_cache across renderer workers via Redis (#3705)
  [codex] Add PR batch planning skill (#3792)
  Docs: document PR batch operational lessons (#3789)
  Document dummy Redux state indexing rationale (#3781)
  Pro RSC: avoid caching failed Flight renders (#3775)

# Conflicts:
#	packages/react-on-rails-pro/tests/getReactServerComponent.client.test.ts
justin808 added a commit that referenced this pull request Jun 9, 2026
* origin/main: (23 commits)
  Enforce Pro license headers in CI and pre-commit (#3821)
  Add RSC payload route-data helper (#3783)
  [Pro] Fix React.cache request dedupe in generated RSC configs (#3813)
  Docs: clarify RuboCop autofix ownership (#3827)
  Add Pro license header checker
  RSC: stop serializing props into embedded payload cache key (#3800)
  Make PR batches skip customer-feedback issues (#3826)
  Name the regressed benchmark+measure pairs in the issue body (#3830)
  Clarify agent batch policy handoffs (#3824)
  Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829)
  Fix auto-bundled component pack normalization (#3818)
  Filter stale Bencher alerts before reporting (#3822)
  Tighten benchmark confirmation workflow permissions (#3819)
  Add issue evaluation skill (#3816)
  Confirm benchmark regressions on a fresh runner before filing the main issue (#3810)
  Define agent scope and accelerated RC auto-merge policy (#3808)
  Replace custom MockClient with async-http Mock::Endpoint (#3703)
  Docs: per-request data sharing in RSC with React.cache() (#3769)
  Pro RSC: share unstable_cache across renderer workers via Redis (#3705)
  [codex] Add PR batch planning skill (#3792)
  ...
justin808 added a commit that referenced this pull request Jun 10, 2026
## Summary

The revert gate for the temporary `/posts_page: Pro` regression-issue
suppression has been verified PASSED (gate evidence:
#3669 (comment)
— the failure-era samples have rolled out of Bencher's 64-run t-test
window and the baseline reflects real timings). This PR removes the
suppression entry so a confirmed regression on `/posts_page: Pro` files
an issue again.

This is a **minimal revert**: `IGNORED_REGRESSION_BENCHMARKS` is
emptied, but the general ignore machinery is kept (see decision log
below). The issue's original "How to revert" steps are partially stale
after #3670/#3810; this follows the updated plan.

Closes #3669

## Changes

- `benchmarks/lib/regression_report.rb`: `IGNORED_REGRESSION_BENCHMARKS`
is now `[].freeze`; the /posts_page-specific temporary-suppression
comment block is replaced with a short doc of the general-purpose
mechanism (exact Bencher name matching, pre-rerun short-circuit,
tracking-issue requirement for any future entry).
- Specs that read `IGNORED_REGRESSION_BENCHMARKS.first`
(regression_report_spec, plan_confirmation_spec, and the
confirmation-outcome example in track_benchmarks_spec) now `stub_const`
a sample entry (`"/ignored: Pro"`) so they keep pinning the machinery
against the now-empty production list. No coverage was deleted.
- `"/posts_page: Pro"` remains in track_benchmarks_spec only as an inert
fixture name in report-parsing examples.

## Validation

- `bundle exec rspec benchmarks/spec/regression_report_spec.rb
benchmarks/spec/plan_confirmation_spec.rb
benchmarks/spec/track_benchmarks_spec.rb` (root bundle) → 63 examples, 0
failures
- `(cd react_on_rails && BUNDLE_GEMFILE=../Gemfile bundle exec rubocop)`
→ 214 files, no offenses
- `BUNDLE_GEMFILE=Gemfile bundle exec rubocop benchmarks/` → 36 files,
no offenses
- `script/ci-changes-detector origin/main` → Benchmark scripts;
recommended job: Lint (Ruby + JS)
- `git grep "3669"` / `git grep "posts_page: Pro"` → no live suppression
references remain (fixture-only uses kept)
- Codex review: skipped — CLI usage limit reached (resets Jun 10 6:30
PM); self-review performed instead

Labels: none — this changes benchmark regression-reporting tooling only,
fully validated by its specs; it is not a performance-sensitive runtime
path, and a benchmark run would not exercise the suppression branch (it
only activates when a Bencher alert fires on a main push).

## Codex Decision Log

- **Minimal revert scope (coordinator decision):** emptied the ignore
list but KEPT `ignored_benchmark?`, `actionable_alerts`,
`actionable_benchmarks`, and plan_confirmation.rb's
`fully_ignored?`/suppressed handling. Why: #3810 made the ignore list
load-bearing general plumbing in the confirmation gate; full machinery
removal is out of scope for #3669. Maintainers may revisit whether the
machinery should be removed entirely if no new suppression is ever
needed.
- **Specs converted to `stub_const`** instead of deleted: the examples
previously sampled the live list (`.first`), which is now empty.
Stubbing `RegressionReport::IGNORED_REGRESSION_BENCHMARKS` keeps the
machinery pinned independent of production contents. Note:
track_benchmarks_spec.rb line 332 also sampled the live list (the
issue's plan assumed it was fixture-only — stale), so it got the same
conversion to avoid a nil-driven false `:confirmed` outcome.
- **plan_confirmation.rb comments left untouched:** they document the
general mechanism and never claim an active suppression.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Benchmark CI reporting tooling only; no runtime app behavior, and
specs still cover the ignore path via stubs.
> 
> **Overview**
> Reverts the temporary **`/posts_page: Pro`** regression-issue
suppression (#3669): **`IGNORED_REGRESSION_BENCHMARKS`** is now
**`[]`**, so confirmed regressions on that benchmark can open issues
again. The long `/posts_page`-specific comment is replaced with a short
description of the general ignore mechanism (exact Bencher names,
pre-rerun short-circuit, tracking issue required for future entries).
> 
> Specs that previously used **`IGNORED_REGRESSION_BENCHMARKS.first`**
now **`stub_const`** a sample **`"/ignored: Pro"`** so ignore-list
behavior stays covered while production has no active suppressions.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
a79526d. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Cleared the regression benchmark suppression list; all confirmed
regressions will now be reported.
* Updated test setup with deterministic test data to ensure stable
regression report behavior.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

## Agent Merge Confidence

Mode: accelerated-rc
Score: 9/10
Auto-merge recommendation: yes
Affected areas: benchmarks regression-reporting tooling (no runtime/perf
surface)
CI detector: `script/ci-changes-detector origin/main` -> Benchmark
scripts; recommended job: Lint (Ruby + JS)
Validation run:
- `bundle exec rspec benchmarks/spec/regression_report_spec.rb
benchmarks/spec/plan_confirmation_spec.rb
benchmarks/spec/track_benchmarks_spec.rb` -> 63 examples, 0 failures
- `(cd react_on_rails && BUNDLE_GEMFILE=../Gemfile bundle exec rubocop)`
-> no offenses; `BUNDLE_GEMFILE=Gemfile bundle exec rubocop benchmarks/`
-> no offenses
Review/check gate:
- Claude review: complete for a79526d
(claude-review SUCCESS), no confirmed blocker; one OPTIONAL spec-scoping
nit triaged with reply + resolved
- Codex review: skipped — CLI usage limit (resets Jun 10 6:30 PM),
evidence in PR body; claude-review served as the independent gate
- GitHub checks: complete for a79526d, all pass; skips are
path-selection per detector
Known residual risk: none — suppression-list removal is fully covered by
the updated specs; 1-point deduction for the skipped secondary Codex
pass
Finalized by: Claude Code Review check `claude-review` (SUCCESS for head
a79526d, GitHub Checks API record)

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Confirm regressions before posting an issue

1 participant