Confirm benchmark regressions on a fresh runner before filing the main issue#3810
Conversation
…n issue Implements #3670. A single noisy Bencher alert on main could previously create/update a `performance-regression` issue. Now a main-push alert is a non-fatal *candidate* that must be reproduced by an independent rerun on a fresh GitHub Actions runner before any issue is filed, turning the per-commit false-positive rate from ~p into ~p² for runner noise. Pipeline (benchmark.yml): benchmark (initial) -> on a main alert, write a NON-FATAL candidate artifact (regression-candidate-*); the suite stays green plan-confirmation -> read candidates, short-circuit IGNORED_REGRESSION_BENCHMARKS BEFORE spending a rerun, emit a matrix of just the alerting suite/shard(s) confirm-regressions -> rerun those on fresh runners against main's baseline WITHOUT polluting main's series (synthetic Bencher branch + --start-point main --start-point-clone-thresholds --start-point-reset); write a confirmed artifact only if the SAME benchmark+measure re-alerts report-regressions -> consume confirmed artifacts, file one issue with the first-run and confirmation summaries side by side, and fail the workflow (it owns the final pass/fail) The per-suite execution is extracted into a reusable workflow (benchmark-suite.yml, modes initial/confirm) so the initial run and the confirmation rerun share one set of steps with no drift. Behavior: - A noisy alert that does not reproduce is cleared as noise (no issue, green). - A confirmed regression fails the workflow AND files the issue. - An inconclusive rerun (benchmark execution / Bencher reporting failure, or a corrupt/unmatched candidate) fails the workflow WITHOUT filing. - The workflow summary states confirmed / cleared / inconclusive per alert. - PR summary comments still post unconditionally (initial mode, unchanged); the confirmation machinery is main-push only. Same benchmark+measure matching uses the structured alert pairs now carried in the candidate payload (BencherReport::Alert already exposes measure), with a name-only fallback when a measure is absent. IGNORED_REGRESSION_BENCHMARKS moved to the shared regression_report.rb and is now applied in plan-confirmation. Threshold loosening to buy back the power double-gating costs is intentionally a sequenced, empirical follow-up (see #3670), not part of this change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
WalkthroughAdds a two-stage benchmark regression confirmation flow: initial runs emit non-fatal candidate artifacts; a plan step filters/schedules reruns; confirmation runs re-alerting the same benchmark+measure produce confirmed artifacts; reporting files issues only for confirmed regressions. Integrates via a new reusable benchmark-suite workflow and updates tracking/reporting scripts and specs. ChangesBenchmark regression confirmation workflow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.
Reviewed by Cursor Bugbot for commit acc4f36. Configure here.
|
test |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/benchmark-suite.yml:
- Line 83: Replace all GitHub Actions uses: entries that reference tags with
full commit SHAs: update actions/checkout@v6, bencherdev/bencher@v0.6.2,
grafana/setup-k6-action@v1, actions/cache@v5 (both occurrences),
pnpm/action-setup@v6, actions/setup-node@v6, actions/upload-artifact@v7 (all
occurrences), and actions/download-artifact@v7 to owner/repo@<full-commit-sha>
form; locate each occurrence by the exact uses string (e.g.,
"actions/checkout@v6", "bencherdev/bencher@v0.6.2", etc.) and replace the tag
with the corresponding full commit SHA from the action's repository, ensuring
every duplicate occurrence is updated and commit SHAs are immutable values.
- Around line 430-435: The workflow step named "Download first-run candidate"
currently can abort the job on failure and prevents the Ruby confirmation
handler from running; update that step to allow the job to continue on
missing/corrupt artifacts by adding continue-on-error: true (or enabling the
action’s native missing-artifact option) so the workflow proceeds to run
benchmarks/track_benchmarks.rb which expects load_candidate to return [nil, ""]
and then executes run_confirmation.
In @.github/workflows/benchmark.yml:
- Line 210: Replace the floating tag uses: actions/download-artifact@v7 with a
pinned commit SHA for both occurrences where the workflow references the
download-artifact action (search for the literal "uses:
actions/download-artifact@v7"); update each to the specific commit SHA form
actions/download-artifact@<commit-sha> (obtain the correct commit SHA from the
actions/download-artifact repository) so both download-artifact steps are
immutably pinned.
- Around line 179-183: Replace the broad secrets inheritance with an explicit
secrets mapping in the reusable workflow call: remove "secrets: inherit" in
.github/workflows/benchmark.yml (the step using the reusable workflow with with:
matrix_row/mode) and instead provide a limited secrets: mapping that only passes
REACT_ON_RAILS_PRO_LICENSE_V2, BENCHER_API_TOKEN, and GITHUB_TOKEN to the
reusable workflow; ensure the reusable workflow (benchmark-suite.yml) expects
those exact inputs. Additionally, update action pins in both
.github/workflows/benchmark.yml and benchmark-suite.yml (e.g.,
actions/download-artifact@v7, actions/checkout@v6) to use reviewed commit SHAs
rather than tag versions to comply with the supply-chain policy.
In `@benchmarks/plan_confirmation.rb`:
- Around line 33-38: The load_candidate_payload method currently parses any JSON
and returns it, which can later cause build_plan to crash when it expects a
Hash; update load_candidate_payload to validate that the parsed value is a Hash
(and contains required keys used by build_plan such as the matrix/fields you
later index) and return nil if the check fails so the existing
inconclusive/unreadable-candidate handling is used; reference
load_candidate_payload and build_plan to locate where to enforce
parsed.is_a?(Hash) and required key presence and ensure you still rescue
StandardError and warn with the same error message when parsing fails.
In `@benchmarks/report_regressions.rb`:
- Around line 366-373: The load_payload function currently parses JSON but
doesn't validate structure; update load_payload to ensure the parsed value is a
Hash and contains required keys (e.g., "suite_name", plus any other keys
expected later around where payload is used at lines 410-411) and return nil if
the schema is invalid; specifically, after JSON.parse in load_payload check that
the result.is_a?(Hash) and required_keys.all? { |k| result.key?(k) } and if not
warn with the same ::error:: message (including e.class/message when parse
fails) and return nil so schema-invalid payloads follow the existing error path
instead of causing crashes when consumers of load_payload (the code that reads
payload["suite_name"] etc.) assume a Hash.
In `@benchmarks/track_benchmarks.rb`:
- Around line 335-347: The current regressed_alert_pairs method drops alerts
with no benchmark (report.alerts.select(&:benchmark)), which lets incomplete
payloads be treated later as empty matches and classified as :cleared; instead
preserve unmatchable alerts and surface them as inconclusive so they cannot be
noise-cleared. Change regressed_alert_pairs to include alerts with nil benchmark
(don’t filter them out) and convert those to an explicit inconclusive sentinel
(e.g. call or construct a RegressionReport.inconclusive(alert) or return a pair
with a nil benchmark/measure marker) so downstream matching logic can detect and
return :inconclusive; also update the empty-match handling (the classifier
currently emitting :cleared for empty matches) to return :inconclusive when it
sees that sentinel or a nil benchmark/measure pair. Ensure you reference and
modify RegressionReport.alert usage and the empty-match classification code to
recognize the new inconclusive sentinel.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 155fe7b5-0611-4415-ac42-9e1b825cc540
📒 Files selected for processing (10)
.github/workflows/benchmark-suite.yml.github/workflows/benchmark.ymlbenchmarks/lib/regression_report.rbbenchmarks/plan_confirmation.rbbenchmarks/report_regressions.rbbenchmarks/spec/plan_confirmation_spec.rbbenchmarks/spec/regression_report_spec.rbbenchmarks/spec/report_regressions_spec.rbbenchmarks/spec/track_benchmarks_spec.rbbenchmarks/track_benchmarks.rb
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v6 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark-suite.yml"
if [ ! -f "$FILE" ]; then
echo "Missing $FILE"
exit 1
fi
echo "== uses: lines =="
rg -n "uses:\s*[^@]+@[^ ]+" "$FILE"
echo
echo "== line excerpts around specified lines =="
for L in 83 98 107 114 139 149 156 418 432 451 462; do
echo "--- $L ---"
sed -n "$((L-3)),$((L+3))p" "$FILE" | nl -ba
doneRepository: shakacode/react_on_rails
Length of output: 659
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark-suite.yml"
if [ ! -f "$FILE" ]; then
echo "Missing $FILE"
exit 1
fi
echo "== uses: lines =="
rg -n "uses:\s*[^@]+@[^ ]+" "$FILE"
echo
echo "== line excerpts around specified lines =="
for L in 83 98 107 114 139 149 156 418 432 451 462; do
echo "--- $L ---"
sed -n "$((L-3)),$((L+3))p" "$FILE" | nl -ba
doneRepository: shakacode/react_on_rails
Length of output: 659
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark-suite.yml"
rg -n "uses:\s*[^@]+@[^ ]+" "$FILE" || true
for L in 83 98 107 114 139 149 156 418 432 451 462; do
echo "--- $L ---"
sed -n "$((L-3)),$((L+3))p" "$FILE" | nl -ba
doneRepository: shakacode/react_on_rails
Length of output: 596
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark-suite.yml"
[ -f "$FILE" ] || { echo "Missing $FILE"; exit 1; }
echo "== uses: entries =="
rg -n "^\s*uses:\s*[^@]+@[^ ]+" "$FILE"
echo
echo "== Excerpts around lines =="
for L in 83 98 107 114 139 149 156 418 432 451 462; do
echo "--- $L ---"
sed -n "$((L-2)),$((L+2))p" "$FILE" | nl -ba
doneRepository: shakacode/react_on_rails
Length of output: 646
Pin GitHub Actions uses: entries to full commit SHAs (not tags) to prevent tag retargeting.
In .github/workflows/benchmark-suite.yml, these uses: lines use tag refs instead of SHAs: actions/checkout@v6 (83), bencherdev/bencher@v0.6.2 (98), grafana/setup-k6-action@v1 (107), actions/cache@v5 (114,139), pnpm/action-setup@v6 (149), actions/setup-node@v6 (156), actions/upload-artifact@v7 (418,451,462), actions/download-artifact@v7 (432). Pin each to owner/repo@<full-commit-sha>.
🧰 Tools
🪛 zizmor (1.25.2)
[error] 83-83: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/benchmark-suite.yml at line 83, Replace all GitHub Actions
uses: entries that reference tags with full commit SHAs: update
actions/checkout@v6, bencherdev/bencher@v0.6.2, grafana/setup-k6-action@v1,
actions/cache@v5 (both occurrences), pnpm/action-setup@v6,
actions/setup-node@v6, actions/upload-artifact@v7 (all occurrences), and
actions/download-artifact@v7 to owner/repo@<full-commit-sha> form; locate each
occurrence by the exact uses string (e.g., "actions/checkout@v6",
"bencherdev/bencher@v0.6.2", etc.) and replace the tag with the corresponding
full commit SHA from the action's repository, ensuring every duplicate
occurrence is updated and commit SHAs are immutable values.
Source: Linters/SAST tools
| - name: Download first-run candidate | ||
| if: inputs.mode == 'confirm' | ||
| uses: actions/download-artifact@v7 | ||
| with: | ||
| name: ${{ env.CANDIDATE_ARTIFACT_NAME }} | ||
| path: candidate |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the referenced section of the workflow
sed -n '380,480p' .github/workflows/benchmark-suite.yml | cat -n
# Locate and inspect track_benchmarks.rb
fd -a "track_benchmarks\.rb" . || true
TRACK_FILE="$(fd -a "track_benchmarks\.rb" . | head -n 1 || true)"
if [ -n "${TRACK_FILE}" ]; then
echo "Using: $TRACK_FILE"
# Search for missing-candidate handling and run_confirmation classification
rg -n "missing|candidate|inconclusive|run_confirmation|confirmation" "$TRACK_FILE" || true
# Show surrounding relevant code blocks (best-effort)
# (Print a few chunks around any matches)
for pat in "run_confirmation" "inconclusive" "missing" "candidate"; do
echo "---- Matches for $pat ----"
rg -n --context 5 "$pat" "$TRACK_FILE" || true
done
else
echo "track_benchmarks.rb not found via fd"
fiRepository: shakacode/react_on_rails
Length of output: 30976
Avoid short-circuiting the confirmation failure path in benchmark-suite.yml
track_benchmarks.rb already treats a missing/corrupt downloaded candidate as :inconclusive (it returns [nil, ""] from load_candidate, writes the inconclusive confirmation summary, and exits 1). If actions/download-artifact@v7 fails instead (e.g., candidate artifact missing), the job would stop before run_confirmation runs, so the Ruby handler can’t apply that logic. Add continue-on-error: true to the download step (or use the step’s native “missing artifact” option) so the workflow can reach ruby benchmarks/track_benchmarks.rb.
Suggested change
- name: Download first-run candidate
if: inputs.mode == 'confirm'
+ continue-on-error: true
uses: actions/download-artifact@v7
with:
name: ${{ env.CANDIDATE_ARTIFACT_NAME }}
path: candidate🧰 Tools
🪛 zizmor (1.25.2)
[error] 432-432: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/benchmark-suite.yml around lines 430 - 435, The workflow
step named "Download first-run candidate" currently can abort the job on failure
and prevents the Ruby confirmation handler from running; update that step to
allow the job to continue on missing/corrupt artifacts by adding
continue-on-error: true (or enabling the action’s native missing-artifact
option) so the workflow proceeds to run benchmarks/track_benchmarks.rb which
expects load_candidate to return [nil, ""] and then executes run_confirmation.
| uses: ./.github/workflows/benchmark-suite.yml | ||
| with: | ||
| matrix_row: ${{ toJSON(matrix) }} | ||
| mode: initial | ||
| secrets: inherit |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show benchmark.yml around the mentioned lines
echo "=== benchmark.yml lines 150-270 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '150,270p'
echo "=== benchmark.yml lines 270-330 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '270,340p'
# Locate all occurrences of "secrets: inherit" in benchmark.yml
echo "=== occurrences of 'secrets: inherit' in benchmark.yml ==="
rg -n "secrets:\s*inherit" .github/workflows/benchmark.yml
# Inspect reusable workflow benchmark-suite.yml around its workflow_call secrets
if [ -f .github/workflows/benchmark-suite.yml ]; then
echo "=== benchmark-suite.yml on.workflow_call secrets (search) ==="
rg -n "workflow_call:|secrets:" .github/workflows/benchmark-suite.yml
echo "=== benchmark-suite.yml lines 1-220 ==="
nl -ba .github/workflows/benchmark-suite.yml | sed -n '1,220p'
echo "=== benchmark-suite.yml lines 220-400 ==="
nl -ba .github/workflows/benchmark-suite.yml | sed -n '220,400p'
else
echo "benchmark-suite.yml not found at .github/workflows/benchmark-suite.yml"
fi
# Also check if any other reusable workflow files are used
echo "=== benchmark.yml reusable workflow uses (workflow_call) ==="
rg -n "uses:\s*\./\.github/workflows/" .github/workflows/benchmark.ymlRepository: shakacode/react_on_rails
Length of output: 165
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== benchmark.yml lines 150-270 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '150,270p'
echo "=== benchmark.yml lines 270-360 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '270,360p'
echo "=== occurrences of 'secrets: inherit' in benchmark.yml ==="
rg -n "secrets:\s*inherit" .github/workflows/benchmark.yml || true
# Determine exact path of reusable workflow being called
# It's referenced as "./.github/workflows/benchmark-suite.yml" in the snippet.
echo "=== check existence of reusable workflow ==="
ls -la .github/workflows/benchmark-suite.yml .github/workflows/benchmark-suite*.yml .github/workflows/benchmark-suite*.yaml 2>/dev/null || true
# Inspect reusable workflow benchmark-suite.yml
if [ -f .github/workflows/benchmark-suite.yml ]; then
echo "=== benchmark-suite.yml: header through workflow_call secrets ==="
# Print first 160 lines and any workflow_call secrets block
nl -ba .github/workflows/benchmark-suite.yml | sed -n '1,200p'
echo "=== benchmark-suite.yml: workflow_call secrets section (search) ==="
rg -n "workflow_call:|secrets:" .github/workflows/benchmark-suite.yml
echo "=== benchmark-suite.yml: around secrets block ==="
# Show context around secrets: definitions
rg -n "on:\s*workflow_call" -n .github/workflows/benchmark-suite.yml || true
# Use grep for "secrets:" and show 40 lines around each match
while read -r line; do
ln=$(echo "$line" | cut -d: -f1)
start=$((ln-20))
end=$((ln+60))
echo "--- benchmark-suite.yml around line $ln ---"
nl -ba .github/workflows/benchmark-suite.yml | sed -n "${start},${end}p"
done < <(rg -n "secrets:" .github/workflows/benchmark-suite.yml | head -n 20 || true)
echo "=== benchmark-suite.yml: search for secret usages ==="
rg -n "BENCHER_API_TOKEN|REACT_ON_RAILS_PRO_LICENSE_V2|GITHUB_TOKEN|\bsecrets\." .github/workflows/benchmark-suite.yml || true
fi
# Also check if benchmark-suite.yml uses any action without pinning (sha)
# We'll just check uses lines that reference github actions. Without pinning means using `@v1` or `@v2` etc.
if [ -f .github/workflows/benchmark-suite.yml ]; then
echo "=== benchmark-suite.yml: action uses without SHA (heuristic) ==="
rg -n "uses:\s*[^#\n]*`@v`[0-9]|uses:\s*[^#\n]*`@v`[0-9]+|uses:\s*[^#\n]*`@master`|uses:\s*[^#\n]*`@main`" .github/workflows/benchmark-suite.yml || true
fiRepository: shakacode/react_on_rails
Length of output: 165
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== benchmark.yml lines 150-270 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '150,270p'
echo "=== benchmark.yml lines 270-360 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '270,360p'
echo "=== occurrences of 'secrets: inherit' in benchmark.yml ==="
rg -n "secrets:\s*inherit" .github/workflows/benchmark.yml || true
echo "=== check existence of reusable workflow ==="
ls -la .github/workflows/benchmark-suite.yml 2>/dev/null || true
if [ -f .github/workflows/benchmark-suite.yml ]; then
echo "=== benchmark-suite.yml lines 1-240 ==="
nl -ba .github/workflows/benchmark-suite.yml | sed -n '1,240p'
echo "=== benchmark-suite.yml: locate workflow_call secrets definitions ==="
rg -n "on:\s*workflow_call|secrets:" .github/workflows/benchmark-suite.yml
echo "=== benchmark-suite.yml: locate secret usages ==="
rg -n "secrets\.|BENCHER_API_TOKEN|REACT_ON_RAILS_PRO_LICENSE_V2" .github/workflows/benchmark-suite.yml || true
echo "=== benchmark-suite.yml: uses lines and sha pinning heuristic ==="
# Show uses lines (excluding local ./) to judge pinning
rg -n "uses:\s*(?!\./)([^ ]+)" .github/workflows/benchmark-suite.yml | head -n 80
echo "=== benchmark-suite.yml: uses lines without commit SHA heuristic ==="
rg -n "uses:\s*(?!\./).*@((v|V)[0-9]+|[a-zA-Z]+)$" .github/workflows/benchmark-suite.yml || true
else
echo "benchmark-suite.yml not found"
fiRepository: shakacode/react_on_rails
Length of output: 165
🏁 Script executed:
# re-run with a single script (avoid duplication)
set -euo pipefail
echo "=== benchmark.yml lines 150-270 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '150,270p'
echo "=== benchmark.yml lines 270-360 ==="
nl -ba .github/workflows/benchmark.yml | sed -n '270,360p'
echo "=== occurrences of 'secrets: inherit' in benchmark.yml ==="
rg -n "secrets:\s*inherit" .github/workflows/benchmark.yml || true
echo "=== reusable workflow file listing matching benchmark-suite ==="
fd -a "benchmark-suite*.*ya?ml" .github/workflows || true
if [ -f .github/workflows/benchmark-suite.yml ]; then
echo "=== benchmark-suite.yml lines 1-260 ==="
nl -ba .github/workflows/benchmark-suite.yml | sed -n '1,260p'
echo "=== benchmark-suite.yml workflow_call secrets definitions ==="
rg -n "on:\s*workflow_call|secrets:" .github/workflows/benchmark-suite.yml
echo "=== benchmark-suite.yml secret usages ==="
rg -n "secrets\.|BENCHER_API_TOKEN|REACT_ON_RAILS_PRO_LICENSE_V2" .github/workflows/benchmark-suite.yml || true
else
echo "benchmark-suite.yml not present at expected path"
fiRepository: shakacode/react_on_rails
Length of output: 165
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file=".github/workflows/benchmark.yml"
echo "=== benchmark.yml: occurrences of 'secrets: inherit' ==="
rg -n "secrets:\s*inherit" "$file" || true
echo "=== benchmark.yml: show around first referenced area (170-210) ==="
awk 'NR>=170 && NR<=210 {printf "%d\t%s\n", NR, $0}' "$file" || true
echo "=== benchmark.yml: show around second referenced area (230-270) ==="
awk 'NR>=230 && NR<=270 {printf "%d\t%s\n", NR, $0}' "$file" || true
echo "=== benchmark.yml: show reusable workflow call sites ==="
rg -n "uses:\s*\./\.github/workflows/" "$file" || true
suite=".github/workflows/benchmark-suite.yml"
echo "=== check existence of benchmark-suite.yml ==="
ls -la "$suite" 2>/dev/null || true
if [ -f "$suite" ]; then
echo "=== benchmark-suite.yml: locate on.workflow_call and secrets definitions ==="
rg -n "on:\s*workflow_call|secrets:" "$suite" || true
echo "=== benchmark-suite.yml: print beginning (1-220) ==="
awk 'NR>=1 && NR<=220 {printf "%d\t%s\n", NR, $0}' "$suite" || true
echo "=== benchmark-suite.yml: search for secret consumption (expected secrets) ==="
rg -n "BENCHER_API_TOKEN|REACT_ON_RAILS_PRO_LICENSE_V2|GITHUB_TOKEN|secrets\." "$suite" || true
echo "=== benchmark-suite.yml: detect third-party action uses (exclude local ./) ==="
rg -n "uses:\s*(?!\./)([^#\n]+)" "$suite" || true
echo "=== benchmark-suite.yml: heuristic check for unpinned action refs (`@v1/`@v2/@master) ==="
rg -n "uses:\s*(?!\./).*@((v)[0-9]+|master|main)" "$suite" || true
fiRepository: shakacode/react_on_rails
Length of output: 15640
🏁 Script executed:
#!/bin/bash
set -euo pipefail
suite=".github/workflows/benchmark-suite.yml"
file=".github/workflows/benchmark.yml"
echo "=== benchmark-suite.yml: lines containing 'secrets:' (declarations only) ==="
rg -n "secrets:" "$suite" || true
echo "=== benchmark-suite.yml: lines containing 'workflow_call' ==="
rg -n "workflow_call" "$suite" || true
echo "=== benchmark-suite.yml: show around lines ~35-90 (env and secret usage start) ==="
awk 'NR>=40 && NR<=90 {printf "%d\t%s\n", NR, $0}' "$suite"
echo "=== benchmark-suite.yml: show around lines 420-480 (secret env mapping) ==="
awk 'NR>=420 && NR<=480 {printf "%d\t%s\n", NR, $0}' "$suite"
echo "=== benchmark.yml: show around 200-220 (first download-artifact step) ==="
awk 'NR>=195 && NR<=220 {printf "%d\t%s\n", NR, $0}' "$file"
echo "=== benchmark.yml: show around 265-290 (second download-artifact step) ==="
awk 'NR>=260 && NR<=290 {printf "%d\t%s\n", NR, $0}' "$file"Repository: shakacode/react_on_rails
Length of output: 8503
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for any repo policy about pinning actions to SHAs
rg -n "pin(ned)?( to)? (commit|SHA|sh a|hash)|SHA-pinned|commit SHA|pin to .*@([0-9a-f]{7,40})" .github README* CONTRIBUTING* docs* package* 2>/dev/null || true
# Also look for general "actions are pinned" statements
rg -n "pinned.*action|actions/.*@([0-9a-f]{7,40})|`@v`[0-9]+(?![^#]*[0-9a-f]{7,40})" .github/workflows 2>/dev/null || true
# Check whether benchmark.yml or benchmark-suite.yml contains any action refs that are already commit-SHA pinned
rg -n "uses:\s*([^#\n]+)@([0-9a-f]{7,40})" .github/workflows/benchmark.yml .github/workflows/benchmark-suite.yml || trueRepository: shakacode/react_on_rails
Length of output: 496
Stop passing all secrets to the reusable benchmark workflow and align action pinning with supply-chain policy.
- In
.github/workflows/benchmark.yml, replacesecrets: inherit(lines 183 and 245) with an explicitsecrets:mapping limited to the secretsbenchmark-suite.ymlactually uses:REACT_ON_RAILS_PRO_LICENSE_V2,BENCHER_API_TOKEN, andGITHUB_TOKEN(instead of exposing the full repo secret set). .github/workflows/benchmark.yml(andbenchmark-suite.yml) use tag-based actions likeactions/download-artifact@v7/actions/checkout@v6;docs/pro/license-ci-integration.mdasks for third-party actions to be pinned to reviewed commit SHAs.
🧰 Tools
🪛 zizmor (1.25.2)
[warning] 179-179: secrets unconditionally inherited by called workflow (secrets-inherit): this reusable workflow
(secrets-inherit)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/benchmark.yml around lines 179 - 183, Replace the broad
secrets inheritance with an explicit secrets mapping in the reusable workflow
call: remove "secrets: inherit" in .github/workflows/benchmark.yml (the step
using the reusable workflow with with: matrix_row/mode) and instead provide a
limited secrets: mapping that only passes REACT_ON_RAILS_PRO_LICENSE_V2,
BENCHER_API_TOKEN, and GITHUB_TOKEN to the reusable workflow; ensure the
reusable workflow (benchmark-suite.yml) expects those exact inputs.
Additionally, update action pins in both .github/workflows/benchmark.yml and
benchmark-suite.yml (e.g., actions/download-artifact@v7, actions/checkout@v6) to
use reviewed commit SHAs rather than tag versions to comply with the
supply-chain policy.
Source: Linters/SAST tools
| if: always() | ||
| uses: actions/upload-artifact@v7 | ||
| - name: Download regression candidates | ||
| uses: actions/download-artifact@v7 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark.yml"
echo "== File existence =="
ls -la "$FILE" || exit 1
echo
echo "== Context around line 210 =="
nl -ba "$FILE" | sed -n '195,230p'
echo
echo "== Context around line 272 =="
nl -ba "$FILE" | sed -n '255,290p'
echo
echo "== All uses: entries in this workflow (for tag vs SHA pin check) =="
nl -ba "$FILE" | rg -n "uses:\s*[^#]+" || trueRepository: shakacode/react_on_rails
Length of output: 264
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/benchmark.yml"
python3 - <<'PY'
import itertools
path = ".github/workflows/benchmark.yml"
with open(path, "r", encoding="utf-8") as f:
lines = f.readlines()
def show(start, end):
start_i = max(1, start)
end_i = min(len(lines), end)
for i in range(start_i-1, end_i):
print(f"{i+1:4d}: {lines[i].rstrip()}")
print("== Context around line 210 ==")
show(195, 230)
print("\n== Context around line 272 ==")
show(255, 290)
print("\n== All uses: entries in this workflow ==")
for i, line in enumerate(lines, start=1):
if "uses:" in line and not line.strip().startswith("#"):
# print the uses line and its line number
print(f"{i}: {line.rstrip()}")
PYRepository: shakacode/react_on_rails
Length of output: 3424
Pin actions/download-artifact to commit SHAs (not @v7 tags).
Both artifact download steps use actions/download-artifact@v7, so they remain vulnerable to upstream tag moves. Update to actions/download-artifact@<commit-sha>.
Applies to: 210-210, 272-272
🧰 Tools
🪛 zizmor (1.25.2)
[error] 210-210: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/benchmark.yml at line 210, Replace the floating tag uses:
actions/download-artifact@v7 with a pinned commit SHA for both occurrences where
the workflow references the download-artifact action (search for the literal
"uses: actions/download-artifact@v7"); update each to the specific commit SHA
form actions/download-artifact@<commit-sha> (obtain the correct commit SHA from
the actions/download-artifact repository) so both download-artifact steps are
immutably pinned.
Source: Linters/SAST tools
| # The structured benchmark+measure pairs Bencher raised an active alert for, deduped. | ||
| # Read from the same alerts[] as #regression?/#regressed_benchmark_names. Handed off in | ||
| # the candidate so a confirmation rerun can require the SAME pair to re-alert (not just | ||
| # any alert in the suite). An alert with no benchmark name is dropped — it can't be | ||
| # matched. measure may be nil; the matcher falls back to name-only for those. | ||
| def regressed_alert_pairs(report) | ||
| return [] unless report | ||
|
|
||
| report.alerts | ||
| .select(&:benchmark) | ||
| .map { |alert| RegressionReport.alert(alert.benchmark, alert.measure) } | ||
| .uniq | ||
| end |
There was a problem hiding this comment.
Treat unmatchable regression alerts as inconclusive instead of cleared.
At Line 344, alerts without a benchmark are dropped, and then Line 387 classifies empty matches as :cleared. That can silently clear a real first-run alert when the payload is incomplete (no benchmark identifier), even though it is not actually confirmable. This path should return :inconclusive (operational failure) rather than noise-cleared.
Also applies to: 379-389
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/track_benchmarks.rb` around lines 335 - 347, The current
regressed_alert_pairs method drops alerts with no benchmark
(report.alerts.select(&:benchmark)), which lets incomplete payloads be treated
later as empty matches and classified as :cleared; instead preserve unmatchable
alerts and surface them as inconclusive so they cannot be noise-cleared. Change
regressed_alert_pairs to include alerts with nil benchmark (don’t filter them
out) and convert those to an explicit inconclusive sentinel (e.g. call or
construct a RegressionReport.inconclusive(alert) or return a pair with a nil
benchmark/measure marker) so downstream matching logic can detect and return
:inconclusive; also update the empty-match handling (the classifier currently
emitting :cleared for empty matches) to return :inconclusive when it sees that
sentinel or a nil benchmark/measure pair. Ensure you reference and modify
RegressionReport.alert usage and the empty-match classification code to
recognize the new inconclusive sentinel.
Greptile SummaryThis PR implements a two-run confirmation gate for
Confidence Score: 3/5The Ruby scripts and specs are well-structured and the hand-off logic is sound, but both The
Important Files Changed
Reviews (1): Last reviewed commit: "Confirm benchmark regressions on a fresh..." | Re-trigger Greptile |
| - name: Download regression candidates | ||
| uses: actions/download-artifact@v7 | ||
| with: | ||
| name: regression-${{ matrix.artifact_name_prefix }}-${{ github.run_number }}${{ matrix.artifact_name_suffix }} | ||
| path: bench_results/regression.json | ||
| if-no-files-found: ignore | ||
| retention-days: 7 | ||
|
|
||
| - name: Benchmark workflow summary | ||
| if: always() | ||
| # Route context/matrix values through env so the shell treats them as data, | ||
| # not code (GitHub Actions injection hardening; ref_name can be attacker-set). | ||
| path: candidate-artifacts | ||
| pattern: regression-candidate-* |
There was a problem hiding this comment.
download-artifact with a glob pattern may fail when no candidates exist
actions/download-artifact@v4+ using pattern: (no name:) fails with an error when the glob matches zero artifacts. On any clean main push where no suite raised a Bencher alert, no regression-candidate-* artifacts are uploaded, so this step would fail before plan_confirmation.rb even runs. The job terminates here, emitting no outputs; confirm-regressions is skipped; report-regressions runs under always(), finds nothing, and exits 0 — but the workflow is already marked red from the failing plan-confirmation step. The same pattern appears in the report-regressions job (line 274–275) for regression-confirmed-*, meaning any run without confirmed regressions also fails the download step.
Adding continue-on-error: true to both download steps lets the scripts proceed and handle the empty-directory case they already cover correctly (plan_confirmation.rb returns true with has_confirmations=false when the directory is empty).
| - name: Download confirmed regressions | ||
| uses: actions/download-artifact@v7 | ||
| with: | ||
| path: regression-artifacts | ||
| pattern: regression-* | ||
| path: confirmed-artifacts | ||
| pattern: regression-confirmed-* | ||
| # Stdlib-only script, so the runner's system Ruby is enough (no Bundler/Rails boot). |
There was a problem hiding this comment.
Same empty-pattern failure risk for
report-regressions
Like the plan-confirmation download (line 209–213), this step will fail on any run where no regression-confirmed-* artifact was uploaded — i.e. whenever every alert was cleared as noise or the confirm-regressions job was skipped entirely. Both the cleared and the clean cases should produce a green workflow, but the failing download step prevents report_regressions.rb from running and marks the workflow red. Adding continue-on-error: true here (and at line 209) allows the scripts to reach their empty-directory short-circuits as intended.
| report.alerts.filter_map(&:benchmark).uniq | ||
| end | ||
|
|
||
| # The structured benchmark+measure pairs Bencher raised an active alert for, deduped. | ||
| # Read from the same alerts[] as #regression?/#regressed_benchmark_names. Handed off in | ||
| # the candidate so a confirmation rerun can require the SAME pair to re-alert (not just | ||
| # any alert in the suite). An alert with no benchmark name is dropped — it can't be | ||
| # matched. measure may be nil; the matcher falls back to name-only for those. | ||
| def regressed_alert_pairs(report) | ||
| return [] unless report | ||
|
|
||
| report.alerts | ||
| .select(&:benchmark) | ||
| .map { |alert| RegressionReport.alert(alert.benchmark, alert.measure) } | ||
| .uniq | ||
| end |
There was a problem hiding this comment.
Empty
ALERTS in a candidate silently clears the confirmation
load_candidate normalises a missing or non-Array alerts value to []. When candidate_alerts is [], actionable_alerts([]) returns [], and confirmed_alerts([], ...) returns [], so confirmation_outcome returns :cleared — even if the confirmation run itself did re-alert. This path can be reached if all active alerts in the initial run lacked a benchmark name (dropped by regressed_alert_pairs), which would cause the suite to be silently cleared instead of treated as inconclusive. Returning [:inconclusive, []] when candidate_alerts is empty after actionable filtering would fail safer, consistent with the "can't match what we can't name" intent stated in the docs.
|
|
||
| parsed = JSON.parse(File.read(path)) | ||
| alerts = parsed[RegressionReport::ALERTS] | ||
| alerts = [] unless alerts.is_a?(Array) |
There was a problem hiding this comment.
When ALERTS is missing from the candidate payload (key absent or non-array), this falls back to []. The problem: confirmation_outcome then calls confirmed_alerts(actionable_alerts([]), ...) which always returns [], and the outcome is :cleared — silently treating the regression as noise.
This is reachable if regressed_alert_pairs returns [] (all Bencher alerts lacked a benchmark name and were dropped by .select(&:benchmark)) while regression?(report) was still true, causing write_candidate to write ALERTS: [].
Consider returning [nil, summary] (inconclusive) instead of [[], summary] when the array is empty/absent — a candidate with no matchable alert pairs cannot be confirmed or cleared on evidence:
| alerts = [] unless alerts.is_a?(Array) | |
| alerts = parsed[RegressionReport::ALERTS] | |
| # Treat absent/empty ALERTS as inconclusive: without matchable pairs the | |
| # confirmation can neither confirm nor clear on evidence, only on speculation. | |
| unless alerts.is_a?(Array) && !alerts.empty? | |
| warn "::error::Candidate has empty or missing ALERTS (#{RegressionReport::ALERTS}); " \ | |
| "treating the confirmation as inconclusive." | |
| return [nil, parsed[RegressionReport::SUMMARY].to_s] | |
| end |
| # NO alert is an operational failure, not a regression, and still fails the suite. | ||
| def report_main_push_candidate(report, report_markdown, bencher_exit_code, suite_name) | ||
| if regression?(report) | ||
| exit 1 unless write_candidate(report, report_markdown, suite_name) |
There was a problem hiding this comment.
The intent here is that a successful write_candidate makes the initial run non-fatal (exit 0), but this is subtle: when write_candidate returns true the unless is skipped, the function returns normally, the surrounding if main_push? && bencher_exit_code != 0 block ends, and the script falls through to exit 0. A reader seeing the exit 1 on this line may be confused about how a Bencher regression alert produces a green suite.
Consider making the non-fatal exit explicit:
| exit 1 unless write_candidate(report, report_markdown, suite_name) | |
| # write_candidate succeeded: the alert is now a non-fatal candidate awaiting | |
| # fresh-runner confirmation. Exit 0 so this suite stays green; report-regressions | |
| # owns the final pass/fail. | |
| exit 1 unless write_candidate(report, report_markdown, suite_name) | |
| # (write_candidate returned true — fall through to exit 0) |
|
|
||
| def set_output(name, value) | ||
| path = ENV.fetch("GITHUB_OUTPUT", nil) | ||
| File.write(path, "#{name}=#{value}\n", mode: "a") if path && !path.empty? |
There was a problem hiding this comment.
JSON.generate always produces compact single-line JSON, so there is no newline risk today. For future-proofing, GitHub Actions supports the multiline heredoc format for GITHUB_OUTPUT which handles any embedded newlines:
| File.write(path, "#{name}=#{value}\n", mode: "a") if path && !path.empty? | |
| if path && !path.empty? | |
| # Use the multiline heredoc delimiter so embedded newlines in `value` | |
| # (e.g. if JSON ever becomes pretty-printed) don't corrupt the output file. | |
| File.open(path, "a") { |f| f.write("#{name}<<__GHEOF__\n#{value}\n__GHEOF__\n") } | |
| end |
Not a bug today, but worth the one-line change to avoid a silent breakage if the matrix JSON ever gains newlines.
| runs-on: ubuntu-24.04 | ||
| permissions: | ||
| contents: read | ||
| issues: write |
There was a problem hiding this comment.
issues: write and pull-requests: write are only needed for initial mode (posting PR comments and — indirectly — enabling the downstream report-regressions job to file issues). In confirm mode this job only submits to Bencher and writes artifacts; it never touches issues or PRs directly.
GitHub Actions does not support conditional permission blocks, so both modes get the full set. A short comment here would make this over-grant intentional and auditable:
| issues: write | |
| # issues: write and pull-requests: write are used by initial mode only | |
| # (track_benchmarks.rb posts PR summary comments). confirm mode does not post | |
| # to PRs or file issues, but workflow YAML does not support conditional | |
| # permission blocks, so both modes inherit the full set. | |
| issues: write | |
| pull-requests: write |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/spec/track_benchmarks_spec.rb`:
- Around line 187-193: The long test line calling write_candidate with
RegressionReport::ALERTS and RegressionReport::SUMMARY exceeds 120 chars; break
that call into multiple lines using Ruby method continuation (e.g., wrap
arguments in parentheses or pass a multi-line hash) so each hash key-value pair
(RegressionReport::ALERTS => "not an array", RegressionReport::SUMMARY => "first
run") is on its own line and the overall line length is under 120 chars; keep
the surrounding Dir.mktmpdir block and the expect(load_candidate(dir)) assertion
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 79d49ece-155d-4367-a262-fa9ace62799f
📒 Files selected for processing (6)
benchmarks/plan_confirmation.rbbenchmarks/report_regressions.rbbenchmarks/spec/plan_confirmation_spec.rbbenchmarks/spec/report_regressions_spec.rbbenchmarks/spec/track_benchmarks_spec.rbbenchmarks/track_benchmarks.rb
🚧 Files skipped from review as they are similar to previous changes (4)
- benchmarks/spec/plan_confirmation_spec.rb
- benchmarks/report_regressions.rb
- benchmarks/plan_confirmation.rb
- benchmarks/track_benchmarks.rb
| it "treats non-array alerts as inconclusive" do | ||
| Dir.mktmpdir do |dir| | ||
| write_candidate(dir, RegressionReport::ALERTS => "not an array", RegressionReport::SUMMARY => "first run") | ||
|
|
||
| expect(load_candidate(dir)).to eq([nil, "first run"]) | ||
| end | ||
| end |
There was a problem hiding this comment.
Break the long line to comply with the 120-character limit.
Line 189 exceeds the 120-character limit specified in the coding guidelines for Ruby code. As per coding guidelines, Ruby code should have a maximum line length of 120 characters and long chains should be broken using method continuation syntax.
📏 Proposed fix to break the long line
it "treats non-array alerts as inconclusive" do
Dir.mktmpdir do |dir|
- write_candidate(dir, RegressionReport::ALERTS => "not an array", RegressionReport::SUMMARY => "first run")
+ write_candidate(dir,
+ RegressionReport::ALERTS => "not an array",
+ RegressionReport::SUMMARY => "first run")
expect(load_candidate(dir)).to eq([nil, "first run"])
end
end📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| it "treats non-array alerts as inconclusive" do | |
| Dir.mktmpdir do |dir| | |
| write_candidate(dir, RegressionReport::ALERTS => "not an array", RegressionReport::SUMMARY => "first run") | |
| expect(load_candidate(dir)).to eq([nil, "first run"]) | |
| end | |
| end | |
| it "treats non-array alerts as inconclusive" do | |
| Dir.mktmpdir do |dir| | |
| write_candidate(dir, | |
| RegressionReport::ALERTS => "not an array", | |
| RegressionReport::SUMMARY => "first run") | |
| expect(load_candidate(dir)).to eq([nil, "first run"]) | |
| end | |
| end |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmarks/spec/track_benchmarks_spec.rb` around lines 187 - 193, The long
test line calling write_candidate with RegressionReport::ALERTS and
RegressionReport::SUMMARY exceeds 120 chars; break that call into multiple lines
using Ruby method continuation (e.g., wrap arguments in parentheses or pass a
multi-line hash) so each hash key-value pair (RegressionReport::ALERTS => "not
an array", RegressionReport::SUMMARY => "first run") is on its own line and the
overall line length is under 120 chars; keep the surrounding Dir.mktmpdir block
and the expect(load_candidate(dir)) assertion unchanged.
Source: Coding guidelines
|
PR 3810 Code Review - see inline comments for details |
| strategy: | ||
| fail-fast: false | ||
| # Fall back to an empty (but valid) matrix if plan-confirmation was skipped/crashed | ||
| # before setting the output: its output is then "" and fromJSON("") would error at |
There was a problem hiding this comment.
The confirm-regressions job carries issues: write and pull-requests: write permissions, but in confirm mode track_benchmarks.rb explicitly skips replace_pr_comments() and never files issues — that is report-regressions's job. These scopes are unnecessary here and widen the blast radius if the runner token is ever misused. Suggest trimming to contents: read + checks: write (Bencher needs checks: write for a synthetic-branch run; no issue/PR scopes required).
| uses: ./.github/workflows/benchmark-suite.yml | ||
| with: | ||
| matrix_row: ${{ toJSON(matrix) }} | ||
| mode: confirm |
There was a problem hiding this comment.
The {"include":[]} fallback is correct but the comment ("crashed before setting output") understates why it's needed. GitHub Actions expands the matrix: expression before evaluating the if: condition, so if confirmation_matrix is empty string the runner would error at matrix-construction time even though the if: gate would skip the job. Worth noting that ordering explicitly — otherwise a future reader might remove the fallback as apparently redundant given the if: guard.
| issues: write | ||
| pull-requests: write |
There was a problem hiding this comment.
issues: write and pull-requests: write are not needed in confirm mode. In confirm mode track_benchmarks.rb explicitly skips replace_pr_comments() and never calls the GitHub issue API — that is report-regressions's responsibility. Consider trimming to contents: read + checks: write here (Bencher needs checks: write for the synthetic-branch submission; no issue/PR scopes are required).
| # Fall back to an empty (but valid) matrix if plan-confirmation was skipped/crashed | ||
| # before setting the output: its output is then "" and fromJSON("") would error at | ||
| # matrix-construction time even though the if: gate skips this job. The empty matrix | ||
| # produces zero jobs; the if: gate is what actually decides whether confirmation runs. | ||
| matrix: ${{ fromJSON(needs.plan-confirmation.outputs.confirmation_matrix || '{"include":[]}') }} |
There was a problem hiding this comment.
The fallback is correct and should stay, but the comment only explains what happens ("its output is then \"\""). It's worth also noting why the fallback is required: GitHub Actions expands the matrix: expression before evaluating the if: condition, so a missing or empty confirmation_matrix would cause a matrix-construction error even though the if: gate would have skipped the job. Without that context a future reader may remove the fallback as redundant given the if: guard.
| unless plan[:unmatched].empty? | ||
| warn_unmatched(plan[:unmatched]) | ||
| emit_no_confirmations | ||
| return false | ||
| end |
There was a problem hiding this comment.
Design question worth confirming: when there are both unmatched candidates and confirmable ones, this bails out entirely — no confirmations are emitted, and the workflow goes red without filing an issue. That is conservative and safe, but it means one operationally-broken shard silences a real regression in a different shard. Is that the intended trade-off? If partial confirmation is acceptable, the logic could emit confirmations for the matched rows and only warn (not fail) on the unmatched ones. Either policy is defensible; this just warrants a comment so the intention is clear to future readers.
|
Code Review Summary for PR 3810 - Two-stage benchmark regression confirmation pipeline. MEDIUM: confirm-regressions job has issues:write and pull-requests:write permissions that are not needed in confirm mode. track_benchmarks.rb skips replace_pr_comments() and never touches the issues API when BENCHMARK_MODE=confirm. Least-privilege fix: trim to contents:read + checks:write. See inline comment at lines 238-239 of benchmark.yml. LOW: The fallback matrix comment (lines 231-235) explains what happens but not why the fallback is required. GitHub Actions expands matrix: expressions before evaluating if: conditions, so a missing output would error at matrix-construction time regardless of the if: gate. Worth one sentence to prevent future removal. LOW: The Benchmark workflow summary step in benchmark-suite.yml echoes to stdout, not to GITHUB_STEP_SUMMARY. The meaningful confirmation outcome (confirmed/cleared/inconclusive) is correctly written via append_step_summary in track_benchmarks.rb, so this is informational only. QUESTION: In plan_confirmation.rb lines 158-162, when there are both unmatched candidates AND confirmable ones the function bails entirely. Safe, but one broken shard silences a real regression in another. See inline comment for detail. POSITIVE: Alert matching fallback is well-reasoned. The --start-point-reset + unique synthetic branch names correctly isolate the confirmation from mains series. Fail-loud philosophy is consistent throughout. Both write_candidate and write_confirmed have explicit true returns. 212 test examples with good coverage of the new code paths. Overall: well-engineered. The permissions issue is worth fixing before merge; the rest are minor. |
…sitives) (#3829) ## Problem ~99% of recent benchmark "performance-regression" alerts on `main` are false positives from a single **orphaned server-side Bencher threshold**. `p90_latency` (and `p99_latency`) were intentionally dropped from `track_benchmarks.rb` `THRESHOLDS` as too noisy ("tail noise can't meet the 1/20 target"). But Bencher thresholds are persistent server-side objects keyed on (branch, testbed, measure) — removing a measure from the CLI `--threshold-*` args stops *updating* the threshold, it never *deletes* it. The p90 metric is still submitted (for the dashboard), so the orphaned p90 threshold keeps evaluating p90 tail noise and firing alerts on nearly every run. These alerts are doubly bad: - They file a `performance-regression` issue (exit≠0), and - They are **invisible in the summary table** — the p90 column has no `:direction`, so it is never flagged 🔴. The filed issue names nothing actionable (this is the #3782 / #3795 "🔴 only appears in the legend" symptom). ### Evidence (public Bencher API, project `react-on-rails-t8a9ncxo`) - 255 most-recent active alerts: **248 p90-latency, 4 rps, 3 p50-latency**. - `main` branch: **164 active alerts → 162 p90-latency, 2 rps** (98.8% p90). - #3782 (`dec4b8c`) filed on exactly two p90 alerts — `/client_side_hello_world: Core` p90 `24.20 > upper 23.61` and `/rendered_html: Core` p90 — with no rps/p50/failed_pct alert. - `main`/`github-actions` carries thresholds for `rps, p50-latency, p90-latency, p99-latency, failed-pct` — two more than the code manages. (p99 is dormant: no p99 metric is submitted.) ## Fix Thread the measures we actually track (`THRESHOLDS` names) into `BencherReport`. An active alert on any **other** measure is classified as *filtered* rather than a regression. This reuses #3822's existing `filtered_alert?` exit-normalization, so a p90-only run no longer writes a candidate or files an issue. The fix is fully in-repo and makes the orphaned threshold harmless regardless of its server-side state. - Tracked measures (`rps`, `p50_latency`, `failed_pct`) are unaffected — including the "hidden" `failed_pct` regressions #3822 added coverage for. - Measure-less and benchmark-less alerts keep their existing fail-safe (still counted). - `tracked_measures` defaults to `nil` (track every measure) so non-production callers (`BenchmarkTable`, specs) are unchanged. ## Validation - `bundle exec rspec benchmarks/spec` — **234 examples, 0 failures** (5 new specs for tracked-measure filtering). - `bundle exec rubocop benchmarks/...` — no offenses. - **End-to-end against the real #3782 reports** (fetched from the Bencher API): `regression?` flips `true → false` (`filtered_alert? = true`) for both the Core and Pro p90 alerts, while rps/p50 alerts are untouched. - `script/ci-changes-detector origin/main` → Benchmark scripts → Lint (Ruby + JS). ## Companion cleanup (separate, manual — not in this PR) This neutralizes the orphaned threshold in code. To also stop it polluting the Bencher dashboard (162 cosmetic active alerts) and being cloned to every new branch, delete the server-side thresholds (needs `BENCHER_API_TOKEN`): - `main` p90 `51fb6a47-0083-4e84-a745-60ee42e3bba4`, p99 `6faa7a68-1835-4cd7-96dc-959220737172` - `master` p90 `d4ad2066-74cb-41f7-93bb-f0885358c56c`, p99 `61bf5ae6-77fa-474f-b6d8-ded630bd0c20` ## Relationship to other work Completes the noise fix that #3810 (fresh-runner confirmation) and #3822 (stale-alert filtering) started: #3822 only filters alerts whose metric *recovered*; a **live** p90 crossing (the dominant case) still passed through. Substantially removes the p90 tail noise behind #3169 and the issue explosion researched in #3755. Refs #3755, #3169, #3795 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes which Bencher alerts count as regressions and can affect main-push candidate filing and CI exit behavior, though scoped to benchmark reporting with explicit specs and a nil default for other callers. > > **Overview** > Adds optional **`tracked_measures`** to `BencherReport.parse` / `#initialize` so active Bencher alerts on measures the repo no longer tracks (e.g. orphaned server-side **p90_latency** thresholds) are moved to **`filtered_alert?`** instead of **`regression?`**, reusing the existing exit-code normalization for filtered-only runs. > > **`track_benchmarks.rb`** now passes `THRESHOLDS.map(&:first)` when parsing the JSON report, so p90-only false positives no longer write regression candidates or file issues the summary table cannot flag. Tracked measures, measure-less fail-safe alerts, and callers that omit `tracked_measures` stay backward compatible. > > Specs cover orphaned p90 filtering, slug/name normalization, and a small fix for `BencherReport.new` with a Hash root in a perf-links test. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 7693f64. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Bug Fixes** * Fixed false regression alerts for measures no longer tracked in performance benchmarks. The system now filters regression reports to only flag alerts for actively monitored metrics, preventing issue creation based on orphaned or retired threshold measurements that are no longer part of the current tracking configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
## Problem A confirmed `performance-regression` issue currently only shows the per-suite summary tables. Those tables flag 🔴 on **rps** and **p50** but never on **failed_pct** (it has no column — it lives in the "Status" cell) and never on any untracked measure. So a reader has to open the tables to see what regressed, and a `failed_pct`-only regression renders with no visible 🔴 at all — the same "the issue names nothing actionable" problem behind #3782/#3795. ## Fix Add a **"What regressed"** section to the issue body, built from the confirmed `ALERTS` payload (#3810's structured `{benchmark, measure}` pairs), aggregated across all suites and computed once in `report_regressions`: ``` ### What regressed The fresh-runner confirmation re-alerted on these benchmark + measure pairs: - `/client_side_log_throw: Pro` — **rps** - `/server_side_redux_app: Core` — **failed_pct** ``` - Threaded through `report_suite` → `RegressionIssueReporter`; only the reporter that *creates* the issue renders the body, so the cross-suite list is complete. - `regressed_overview` defaults to `""` → the section is omitted and the body is byte-for-byte unchanged, so it composes with the existing flow and with older payloads that lack `ALERTS`. ## Validation - `bundle exec rspec benchmarks/spec` — **233 examples, 0 failures** (4 new: helper rendering/dedup + issue-body inclusion/omission). - `bundle exec rubocop benchmarks/...` — no offenses. - Rendered the body locally with and without pairs to confirm clean Markdown spacing and that the no-pairs body is unchanged. ## Relationship to other work Complements #3829 (filter alerts to tracked measures): #3829 stops the p90 false positives from filing at all; this makes the *real* confirmed regressions self-explanatory in the issue body. Refs #3755. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > CI reporting and GitHub issue markdown only; no runtime app or auth changes. > > **Overview** > Adds a **"What regressed"** section to auto-filed `performance-regression` issue bodies, listing confirmed benchmark + measure pairs as markdown bullets (including measures like **failed_pct** that the per-suite tables do not surface clearly). > > `report_regressions` builds the list once from each confirmed payload's structured `ALERTS` via new `regressed_overview_markdown` (deduped across suites) and passes `regressed_overview` into `RegressionIssueReporter`. The section is omitted when the overview is empty, so older payloads without `ALERTS` keep the previous body shape. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 3068c93. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Performance regression reports now include a "What regressed" overview section, displaying all confirmed benchmark and measure pairs that regressed across test suites in a single consolidated view. * **Tests** * Added comprehensive test coverage for the new regression overview markdown rendering logic and deduplication. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* origin/main: Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) Add issue evaluation skill (#3816) Confirm benchmark regressions on a fresh runner before filing the main issue (#3810) Define agent scope and accelerated RC auto-merge policy (#3808) Replace custom MockClient with async-http Mock::Endpoint (#3703) Docs: per-request data sharing in RSC with React.cache() (#3769) Pro RSC: share unstable_cache across renderer workers via Redis (#3705) [codex] Add PR batch planning skill (#3792) Docs: document PR batch operational lessons (#3789) Document dummy Redux state indexing rationale (#3781) Pro RSC: avoid caching failed Flight renders (#3775) # Conflicts: # packages/react-on-rails-pro/tests/getReactServerComponent.client.test.ts
* origin/main: (23 commits) Enforce Pro license headers in CI and pre-commit (#3821) Add RSC payload route-data helper (#3783) [Pro] Fix React.cache request dedupe in generated RSC configs (#3813) Docs: clarify RuboCop autofix ownership (#3827) Add Pro license header checker RSC: stop serializing props into embedded payload cache key (#3800) Make PR batches skip customer-feedback issues (#3826) Name the regressed benchmark+measure pairs in the issue body (#3830) Clarify agent batch policy handoffs (#3824) Filter Bencher alerts to tracked measures (drop orphaned p90 false positives) (#3829) Fix auto-bundled component pack normalization (#3818) Filter stale Bencher alerts before reporting (#3822) Tighten benchmark confirmation workflow permissions (#3819) Add issue evaluation skill (#3816) Confirm benchmark regressions on a fresh runner before filing the main issue (#3810) Define agent scope and accelerated RC auto-merge policy (#3808) Replace custom MockClient with async-http Mock::Endpoint (#3703) Docs: per-request data sharing in RSC with React.cache() (#3769) Pro RSC: share unstable_cache across renderer workers via Redis (#3705) [codex] Add PR batch planning skill (#3792) ...
## Summary The revert gate for the temporary `/posts_page: Pro` regression-issue suppression has been verified PASSED (gate evidence: #3669 (comment) — the failure-era samples have rolled out of Bencher's 64-run t-test window and the baseline reflects real timings). This PR removes the suppression entry so a confirmed regression on `/posts_page: Pro` files an issue again. This is a **minimal revert**: `IGNORED_REGRESSION_BENCHMARKS` is emptied, but the general ignore machinery is kept (see decision log below). The issue's original "How to revert" steps are partially stale after #3670/#3810; this follows the updated plan. Closes #3669 ## Changes - `benchmarks/lib/regression_report.rb`: `IGNORED_REGRESSION_BENCHMARKS` is now `[].freeze`; the /posts_page-specific temporary-suppression comment block is replaced with a short doc of the general-purpose mechanism (exact Bencher name matching, pre-rerun short-circuit, tracking-issue requirement for any future entry). - Specs that read `IGNORED_REGRESSION_BENCHMARKS.first` (regression_report_spec, plan_confirmation_spec, and the confirmation-outcome example in track_benchmarks_spec) now `stub_const` a sample entry (`"/ignored: Pro"`) so they keep pinning the machinery against the now-empty production list. No coverage was deleted. - `"/posts_page: Pro"` remains in track_benchmarks_spec only as an inert fixture name in report-parsing examples. ## Validation - `bundle exec rspec benchmarks/spec/regression_report_spec.rb benchmarks/spec/plan_confirmation_spec.rb benchmarks/spec/track_benchmarks_spec.rb` (root bundle) → 63 examples, 0 failures - `(cd react_on_rails && BUNDLE_GEMFILE=../Gemfile bundle exec rubocop)` → 214 files, no offenses - `BUNDLE_GEMFILE=Gemfile bundle exec rubocop benchmarks/` → 36 files, no offenses - `script/ci-changes-detector origin/main` → Benchmark scripts; recommended job: Lint (Ruby + JS) - `git grep "3669"` / `git grep "posts_page: Pro"` → no live suppression references remain (fixture-only uses kept) - Codex review: skipped — CLI usage limit reached (resets Jun 10 6:30 PM); self-review performed instead Labels: none — this changes benchmark regression-reporting tooling only, fully validated by its specs; it is not a performance-sensitive runtime path, and a benchmark run would not exercise the suppression branch (it only activates when a Bencher alert fires on a main push). ## Codex Decision Log - **Minimal revert scope (coordinator decision):** emptied the ignore list but KEPT `ignored_benchmark?`, `actionable_alerts`, `actionable_benchmarks`, and plan_confirmation.rb's `fully_ignored?`/suppressed handling. Why: #3810 made the ignore list load-bearing general plumbing in the confirmation gate; full machinery removal is out of scope for #3669. Maintainers may revisit whether the machinery should be removed entirely if no new suppression is ever needed. - **Specs converted to `stub_const`** instead of deleted: the examples previously sampled the live list (`.first`), which is now empty. Stubbing `RegressionReport::IGNORED_REGRESSION_BENCHMARKS` keeps the machinery pinned independent of production contents. Note: track_benchmarks_spec.rb line 332 also sampled the live list (the issue's plan assumed it was fixture-only — stale), so it got the same conversion to avoid a nil-driven false `:confirmed` outcome. - **plan_confirmation.rb comments left untouched:** they document the general mechanism and never claim an active suppression. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Benchmark CI reporting tooling only; no runtime app behavior, and specs still cover the ignore path via stubs. > > **Overview** > Reverts the temporary **`/posts_page: Pro`** regression-issue suppression (#3669): **`IGNORED_REGRESSION_BENCHMARKS`** is now **`[]`**, so confirmed regressions on that benchmark can open issues again. The long `/posts_page`-specific comment is replaced with a short description of the general ignore mechanism (exact Bencher names, pre-rerun short-circuit, tracking issue required for future entries). > > Specs that previously used **`IGNORED_REGRESSION_BENCHMARKS.first`** now **`stub_const`** a sample **`"/ignored: Pro"`** so ignore-list behavior stays covered while production has no active suppressions. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit a79526d. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Cleared the regression benchmark suppression list; all confirmed regressions will now be reported. * Updated test setup with deterministic test data to ensure stable regression report behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> ## Agent Merge Confidence Mode: accelerated-rc Score: 9/10 Auto-merge recommendation: yes Affected areas: benchmarks regression-reporting tooling (no runtime/perf surface) CI detector: `script/ci-changes-detector origin/main` -> Benchmark scripts; recommended job: Lint (Ruby + JS) Validation run: - `bundle exec rspec benchmarks/spec/regression_report_spec.rb benchmarks/spec/plan_confirmation_spec.rb benchmarks/spec/track_benchmarks_spec.rb` -> 63 examples, 0 failures - `(cd react_on_rails && BUNDLE_GEMFILE=../Gemfile bundle exec rubocop)` -> no offenses; `BUNDLE_GEMFILE=Gemfile bundle exec rubocop benchmarks/` -> no offenses Review/check gate: - Claude review: complete for a79526d (claude-review SUCCESS), no confirmed blocker; one OPTIONAL spec-scoping nit triaged with reply + resolved - Codex review: skipped — CLI usage limit (resets Jun 10 6:30 PM), evidence in PR body; claude-review served as the independent gate - GitHub checks: complete for a79526d, all pass; skips are path-selection per detector Known residual risk: none — suppression-list removal is fully covered by the updated specs; 1-point deduction for the skipped secondary Codex pass Finalized by: Claude Code Review check `claude-review` (SUCCESS for head a79526d, GitHub Checks API record) Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Summary
Fixes #3670. A single noisy Bencher alert on
maincould previously create/update aperformance-regressionissue. This adds a fresh-runner confirmation step between the initial benchmark run and themainissue: a main-push alert is now a non-fatal candidate that must re-alert on an independent rerun before any issue is filed. Requiring a second, independent alert turns the per-commit false-positive rate from ~p into ~p² for runner noise.Scope is exactly as the issue specifies: only
mainissue creation is gated. PR summary comments still post unconditionally (initial mode, unchanged).Pipeline (
benchmark.yml)The per-suite execution is extracted into a reusable workflow (
benchmark-suite.yml, modesinitial/confirm) so the initial run and the confirmation rerun share one set of steps with no drift.Behavior
IGNORED_REGRESSION_BENCHMARKSThe workflow summary states confirmed / cleared-as-noise / inconclusive per alert.
Key implementation notes
confirm-main-<run_id>-<suite-slug>(nevermain) with--start-point main --start-point-clone-thresholds --start-point-reset, so the confirmation sample is compared against main's baseline but never appended to main's series — exactly the anchoring the PR path already uses.BencherReport::Alertalready exposes the measure slug). Falls back to benchmark-name-only when a measure is absent, as the issue permits.report-regressionsowns the final pass/fail. A confirmed regression fails the workflow there.IGNORED_REGRESSION_BENCHMARKS(Suppress regression issues when only /posts_page: Pro regresses #3668) moved to the sharedbenchmarks/lib/regression_report.rband is now applied inplan-confirmation, short-circuiting a rerun for benchmarks that would be suppressed anyway.regression-candidate-*/regression-confirmed-*).Out of scope (sequenced follow-up, per #3670)
Loosening the
THRESHOLDSto buy back the statistical power that double-gating costs is intentionally a separate, empirical follow-up: it requires measuring the post-confirmation false-positive and missed-regression rates on realmaindata, so it cannot be tuned blindly here.Known tradeoff (acknowledged in the issue)
The confirmation clones main's baseline after the initial run appended its (regressing) sample to main's series, so a borderline regression can be slightly diluted by its own already-merged sample. This is inherent to confirming post-merge and is what the threshold-loosening follow-up addresses. A future refinement could pin the clone to the commit's parent via
--start-point-hash.Testing
bundle exec rspec benchmarks/spec— 212 examples, 0 failures (26 new:regression_report_spec,plan_confirmation_spec, confirmation-mode cases intrack_benchmarks_spec;report_regressions_specreworked for the confirmed-artifact contract).actionlintclean on bothbenchmark.ymlandbenchmark-suite.yml.rubocopclean on all changed/new Ruby files.Live validation gap: the multi-job GitHub Actions DAG (reusable-workflow calls, dynamic confirmation matrix, fresh-runner reruns, Bencher submissions) cannot be exercised locally — it needs a real
mainpush with a Bencher token. A maintainer should watch the first alertingmainrun to confirm the candidate→confirm→report handoff end to end. No CHANGELOG entry: per the repo's changelog guidelines this is internal CI/benchmark tooling, not a user-visible source change.🤖 Generated with Claude Code
Note
Medium Risk
Large CI workflow refactor changes when benchmark workflows fail and when regression issues are filed; mistakes could miss real regressions or spam issues, but scope is internal benchmark tooling only.
Overview
Adds a two-stage main regression pipeline: Bencher alerts on the first
mainrun are non-fatal candidates (regression-candidate.jsonartifacts) instead of failing the suite or filing immediately. New jobsplan-confirmation(filter ignored benchmarks, build rerun matrix) andconfirm-regressionsrerun only alerting suite/shards on fresh runners via a shared reusable workflowbenchmark-suite.yml(initial/confirmmodes).track_benchmarks.rbwrites candidates with structured benchmark+measure alert pairs; confirmation submits to a synthetic Bencher branch againstmain's baseline without appending tomain's series, and emitsregression-confirmed.jsononly when the same pairs re-alert.report_regressions.rbfiles issues from confirmed payloads only, shows first-run vs confirmation tables, and owns final workflow failure on confirmed regressions.IGNORED_REGRESSION_BENCHMARKSmoves toregression_report.rband short-circuits before confirmation reruns inplan_confirmation.rb.Reviewed by Cursor Bugbot for commit 985121d. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
New Features
Refactor
Tests