docs: document benchmark gate tuning plan#3234
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a "Main Gate Re-Enablement Plan" to internal/planning/library-benchmarking.md describing prerequisites (PR 3148, ≥30 post-merge main Benchmark Workflow runs), a stepwise tuning sequence while keeping main-regression alerts in warning mode, detailed overlap/noise criteria and acceptance criteria for restoring the hard CI gate, and a pointer to Issue 3169. ChangesMain Gate Re-Enablement Plan
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~4 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR extends Confidence Score: 5/5Safe to merge — documentation-only change with no runtime impact. The PR adds only markdown documentation. No code, configuration, or CI workflow files are modified. The content is clear, internally consistent, and correctly references the blocking PR (#3148) and tracking issue (#3169). No issues found. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([PR #3148 merged]) --> B{30+ post-merge\nmain runs accumulated?}
B -- No --> C[Keep gate in warning mode\ncollect baseline history]
C --> B
B -- Yes --> D[Compare adjacent main runs\nfor shared alert pairs]
D --> E{Alert sets\nnear-disjoint?}
E -- Yes, noise dominated --> F[Move to larger/dedicated runners\nor widen Bencher boundary toward 0.99]
F --> D
E -- No, stable alerts --> G[Add --threshold-min-sample-size\nrequire consecutive-run agreement]
G --> H{Acceptance criteria met?\n5 clean runs, tuned false-positive rate}
H -- No --> D
H -- Yes --> I([Restore hard gate on main])
Reviews (1): Last reviewed commit: "Document benchmark gate tuning plan" | Re-trigger Greptile |
Code Review: docs: document benchmark gate tuning planOverall this is a well-structured, operationally useful addition. The three-section layout (Baseline Dependency → Tuning Sequence → Acceptance Criteria) is logical, and the acceptance criteria are concrete and measurable. A few minor clarity suggestions via inline comments below. Strengths
Minor suggestions (see inline comments)
|
Code ReviewOverall: Well-structured, clearly written documentation PR. The plan is logically sequenced — prerequisites before tuning, tuning before restoring the hard gate — and the acceptance criteria are a good start. A few suggestions to tighten clarity and actionability. Strengths
Suggestions
See inline comments for specific line suggestions. |
Code ReviewDocumentation-only PR adding a structured re-enablement plan for the CI benchmark hard gate. The three-section layout (Baseline Dependency → Tuning Sequence → Acceptance Criteria) is logical, and the acceptance criteria are measurable. Three items worth addressing before merge: 1. PR #3148 has already merged (2026-04-23) 2. 3. Consecutive-alert requirement is a configuration step, not a passive criterion See inline comments for line-level suggestions. |
Code ReviewSummary: Documentation-only change adding a concrete re-enablement plan for the Bencher main CI gate. Well-structured overall with measurable criteria and good traceability to related PR/issue. Strengths
Suggestions1. Quantify "near-disjoint" (Tuning Sequence, step 2) "Near-disjoint" is subjective and leaves the reviewer with no decision threshold. Consider adding a concrete overlap ratio, e.g. "fewer than 20% of alerts appear in both adjacent runs" as the threshold for concluding noise still dominates. 2. Missing rollback trigger (Acceptance Criteria) The plan defines when to re-enable the gate but not what happens if it over-fires afterward. Consider a brief note like: "If the false-positive rate exceeds the 1-in-20 target after re-enabling, revert the gate to warning mode and re-tune thresholds before attempting again." Without this, teams may leave a noisy gate in place rather than taking explicit action. 3. Timeline expectation "At least 30 post-merge main runs" gives no sense of wall-clock time. If main gets ~2–3 non-docs pushes per week, that is roughly 10–15 weeks. A rough estimate (even parenthetical) would help set reviewer/team expectations on when this work becomes actionable. Overall this is a solid planning doc and the approach is technically sound. The above are minor improvements rather than blockers. |
Code ReviewOverall: This is a well-structured, actionable planning document. The section fits naturally alongside the existing strategy content, the acceptance criteria are concrete and measurable, and the reference to Issue 3169 gives a live tracking thread. A few small clarity gaps are worth tightening before merge. SuggestionsConsecutive-run count is unspecified (Tuning Sequence step 3) Mixed-content pushes are ambiguous (Acceptance Criteria) No pointer to where the gate actually lives Minor: duplication with Future Considerations Not blockingThese are all documentation-clarity issues. The plan itself is sound and the acceptance criteria are exactly the kind of concrete, falsifiable targets that make re-enablement decisions defensible. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
internal/planning/library-benchmarking.md (1)
65-68: ⚡ Quick winMinor: Add 1 sentence on failure mode when gate gets flipped back (how quickly to re-tune).
You already say “revert it to warning mode and re-tune thresholds before trying to re-enable again” (Lines 99-101), but it doesn’t say whether re-tuning requires re-collecting history (e.g., “start a fresh baseline window”) or can be done immediately using existing overlap data. Adding a single operational rule reduces repeated churn.
Also applies to: 76-87
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/planning/library-benchmarking.md` around lines 65 - 68, The document lacks an operational rule describing the failure mode when the gate is flipped back to warning mode and how quickly thresholds should be re-tuned; add one clear sentence near the existing guidance that states whether re-tuning must re‑collect history (e.g., "start a fresh baseline window of X runs") or may use the existing overlap data and how long to wait before re-enabling the hard gate—reference the terms "main-regression alerts", "warning mode", "re-tune thresholds", and "baseline window/overlap data" so readers can find and apply the rule consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/planning/library-benchmarking.md`:
- Around line 71-74: Clarify the “30 post-merge main runs” rule: count GitHub
Actions Bencher workflow executions triggered on the main branch after the merge
of PR 3148 (do not count runs from other branches or pre-merge test runs);
include scheduled workflow runs and manual/auto reruns as separate counts (each
rerun increments the count once) but exclude runs triggered by doc-only commits
or changes that skip the benchmark workflow; verify and audit counts by listing
the Bencher workflow run history on the repository’s Actions -> Workflows ->
Bencher benchmark workflow page and cross-reference the corresponding Bencher
dashboard/history for fresh baselines, and annotate each counted run in the
PR/issue with its run-id and timestamp for traceability.
- Around line 99-101: Define "no more than 1 noisy failure in 20
unchanged-performance runs" concretely: classify a "noisy failure" as a
benchmark alert that does not show a reproducible performance delta (absolute
delta below a tolerance and statistically non-significant) and whose alert
window does not overlap with code changes that touch the benchmark owner (i.e.,
alert appears but delta < X% and p-value > 0.05); classify a "real regression"
as an alert with a reproducible delta beyond tolerance (e.g., >1% or
configurable threshold) and statistically significant (p ≤ 0.05) and/or
overlapping with relevant code changes. Define the "unchanged-performance runs"
cohort as the last 20 successful benchmark workflow runs on main that meet
baseline stability criteria (each benchmark's coefficient of variation ≤5% over
that window and no infra/anomaly tags), excluding runs triggered by commits that
modify the benchmark's owner or relevant performance-sensitive files; count "20"
as 20 successful benchmark workflow runs (re-runs allowed and counted separately
only if they were completed successfully and not marked as infra-failure). Apply
these rules in the restore-gate logic (the text around "no more than 1 noisy
failure in 20 unchanged-performance runs") so teams can consistently classify
runs and select the cohort.
- Around line 78-82: The doc currently mentions “shared (benchmark, measure)
alert pairs” and a ~20% overlap cutoff without defining the metric; replace the
vague phrase in bullet 2 with an explicit definition such as: define alert sets
A and B for two adjacent runs and compute alert overlap = |A ∩ B| / |A ∪ B|
(Jaccard index), state the threshold as overlap < 0.20 indicates runner noise,
and add a short note that if Bencher applies any normalization (or an alternate
“shared-alert fraction” is used) you must cite the exact computation and graph
in Issue 3169 (reference the specific section/graph/commit from Issue 3169) so
readers can reproduce the value; also add a one-line concrete example of
computing the metric for clarity.
- Around line 91-93: The acceptance criterion "5 consecutive non-docs main
pushes pass the warning-mode check" is ambiguous; update the document text to
(1) define "non-docs" via an explicit glob set (e.g., exclude changes matching
docs/**, **/*.md, internal/**, .github/**, README.md) and state that any
PR/commit that touches files outside those globs counts as a non-docs push, (2)
name the exact CI job/status referred to by "warning-mode check" (use the job
name or workflow identifier) so readers know what to observe, and (3) add a
fallback clause to handle low traffic such as "or within X main pushes or Y
days" (suggest values like 10 main pushes or 30 days) so the requirement is
robust; revise the sentence in the planning doc accordingly to include these
explicit rules and the fallback.
- Around line 83-86: The doc omits explicit Bencher flag values and misstates
flags; update the "library-benchmarking.md" to add a "Bencher config mapping"
section that copies the exact runtime values used in the workflow (e.g.,
BOUNDARY=0.95, MAX_SAMPLE=64 and the actual threshold flag used in the workflow
such as --threshold-max-sample-size=64) and clarify whether the intended flag is
--threshold-min-sample-size or --threshold-max-sample-size (fix the text to
match the workflow invocation), and explicitly state the consecutive-run
confirmation behavior by naming the Bencher flag or external logic (or note that
none exists) and the required count (e.g., require 2 consecutive alerts) so
acceptance criteria match the actual invocation strings like BOUNDARY,
MAX_SAMPLE and the --threshold-* flags referenced in the repo.
---
Nitpick comments:
In `@internal/planning/library-benchmarking.md`:
- Around line 65-68: The document lacks an operational rule describing the
failure mode when the gate is flipped back to warning mode and how quickly
thresholds should be re-tuned; add one clear sentence near the existing guidance
that states whether re-tuning must re‑collect history (e.g., "start a fresh
baseline window of X runs") or may use the existing overlap data and how long to
wait before re-enabling the hard gate—reference the terms "main-regression
alerts", "warning mode", "re-tune thresholds", and "baseline window/overlap
data" so readers can find and apply the rule consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 87024bba-e5eb-4cef-befd-b7d73bc28c12
📒 Files selected for processing (1)
internal/planning/library-benchmarking.md
Code ReviewDocumentation-only PR adding a "Main Gate Re-Enablement Plan" section to Workflow Accuracy CheckI cross-checked the doc's configuration snapshot against the actual
The snapshot is accurate at time of writing and the IssuesBencher URL discrepancy (minor) Forward reference to "qualifying run" (minor) Manual gate: one-time regressions are intentionally silenced Strengths
|
Code Review — PR #3234: docs: document benchmark gate tuning planOverviewThis PR adds a Main Gate Re-Enablement Plan section to the internal benchmarking planning doc and adds a Bencher dashboard URL exclusion to The planning document is comprehensive and well-reasoned. The use of Jaccard overlap to distinguish noise from signal is a solid empirical approach, and the tuning sequence → escalation → acceptance-criteria structure is clear and actionable. The math example ( IssuesLychee URL exclusion gap (potential CI failure)The newly added document references The fix is either to widen the existing pattern to cover both paths, or to verify that Minor observations
Checklist
Verdict: Approve once the lychee URL coverage gap is resolved (or confirmed not to be a problem). The planning content itself is solid. |
Fold in eight non-blocking suggestions from auto-reviews on PR 3234: archive note for plan lifecycle, manual+auto rerun specificity, operational-flag scope note for the snapshot, "5 pairs = 6 runs" clarification, motivation for the 20% smoke-test magnitude, BENCHER_HAS_ALERT=1 semantics inline, criterion-4 rollback cadence reference, and an explicit rolling-window definition for the 1-in-20 false-positive rate that excludes intentional-perf-change commits from both the numerator and denominator.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Auto-review round summary (post-648a4b70e)I (Claude Code) triaged the 14 auto-generated PR-level reviews that landed after my last batch reply at 08:17 UTC. Most of their substantive feedback was already addressed in commits 4c74827 → 8351e71 (rps upper-boundary mention, "external tracking" → "manual tracking", linear-ladder iteration policy, formatting at the step-3 continuation, Just landed in 029aa38 (this commit)Eight non-blocking polish items pulled from the new round of auto-reviews:
Skipped (with rationale)
I am not posting individual rationale replies on the duplicative LGTM-style summary reviews; their concerns are already addressed (per the mapping above) and threads are not resolvable on PR-level comments. |
Code Review: docs: document benchmark gate tuning planOverall: Solid documentation PR. The technical claims in the new section are accurate — I verified Issues Found1. The new exclude rule targets That said, lychee only scans 2. Tuning Sequence step 2 is very dense Lines 113–126 pack many nested conditions — qualifying-run definition, Jaccard formula, 3. Criterion 5 has a stale line-wrap Line 166 reads Minor Observations
|
- Broaden lychee exclusion to cover both /console/projects/ and /perf/ paths under bencher.dev so future scans of either URL prefix stay defensive without a follow-up edit. - Restructure Tuning Sequence step 2 into sub-bullets (data source, qualifying run, Jaccard formula, when to start, noise floor, signal threshold, small-sample caveat) so each condition is independently scannable. - Rewrap criterion 5 at the sentence boundary after "real or noisy." to fix the 173-character mid-clause line and match the surrounding ~120-character wrap. - Cross-reference the 2-consecutive-run requirement between Tuning Sequence step 3 and Acceptance Criterion 2 so a reader sees they describe the same gate.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Auto-review round summary (post-029aa3893)I (Claude Code) triaged the new claude[bot] review session that landed at 01:30 UTC (PR-level summary #4414146528 plus three inline comments). All four readability suggestions are addressed in commit a897d93; nothing was a blocker (the review itself opens with "Solid documentation PR" and verifies all technical claims against Just landed in a897d93
Skipped (with rationale)
All three new inline review threads are resolved. |
|
Review: docs: document benchmark gate tuning plan - see inline comments for details |
|
Accuracy check passed: all values in Current Bencher Configuration verified against .github/workflows/benchmark.yml (BOUNDARY=0.95 line 558, MAX_SAMPLE=64 line 559, --err line 609, t_test lines 613-635, BENCHER_HAS_ALERT lines 681-692, both warning/error step names, and the re-enable instruction). Minor suggestions posted as inline comments. |
Refactor the dense Main Gate Re-Enablement Plan intro paragraph into a short rationale sentence followed by a bulleted list of standing instructions (history wait, fallback, baseline-reset exception, archival trigger). Make the archival mechanism explicit by spelling out that the entire `## Main Gate Re-Enablement Plan` section (heading through end of file, including the trailing Issue 3169 link) gets deleted once the plan executes. Addresses #3234 (comment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Auto-review round summary (post-a897d9391)I (Claude Code) triaged the new claude[bot] review session that landed at 06:24–06:25 UTC (PR-level pointer #4414622003, accuracy-check ack #4414623593, and two inline comments). One readability suggestion was actionable; everything else was either a positive acknowledgment or an explicit FYI. Just landed in dfd4250
Skipped (with rationale)
Both new inline review threads are resolved. |
Code ReviewOverall: LGTM with minor note — this is a well-structured documentation PR. The Strengths
IssueDuplicated gate condition (Acceptance Criteria 2 = Tuning Sequence step 3): the same Minor observations
No blocking issues. The one real maintenance risk is the duplicated gate condition. |
| Issue 3169 to show the same `(benchmark, measure)` pair alerting on at least 2 consecutive runs before filing or | ||
| failing; a single noisy run does not trigger the gate. This is a manual gate: Bencher still alerts on the first run, | ||
| and the requirement is that a reviewer confirms recurrence in Issue 3169 before acting on it. (This is the same gate | ||
| stated in Tuning Sequence step 3; restated here because it is also a pre-condition for the 5-run clean-run count.) |
There was a problem hiding this comment.
The gate condition here (same (benchmark, measure) pair must alert on ≥ 2 consecutive runs) is intentionally duplicated from Tuning Sequence step 3. The duplication is acknowledged, but if one copy is updated during active tuning, both need to stay in sync. Consider adding a bold cross-reference in step 3 — e.g., "Keep in sync with Acceptance Criteria 2" — so editors are warned before they touch only one copy.
| invalidate the old history. | ||
| - **Archive when done.** Once the hard gate has been restored and held for 30 or more qualifying runs without breaching | ||
| the false-positive target, delete the entire `## Main Gate Re-Enablement Plan` section (this heading and every | ||
| subsection through the end of the file, including the trailing Issue 3169 link) in a follow-up commit; the executed |
There was a problem hiding this comment.
"The trailing Issue 3169 link" is unambiguous today because this section ends the file, but if content is ever appended below this heading the phrase becomes stale. A slightly more durable phrasing: "delete from this ## Main Gate Re-Enablement Plan heading through the last line of the file as it stands at deletion time".
…y-patch-DrhDk * origin/main: docs: document legacy Webpacker compatibility shims (#3225) chore: sync address-review command and import prompt variant (#3271) feat: add REACT_ON_RAILS_BASE_PORT for concurrent worktree port management (#3142) feat: align renderer cache staging with bundle-hash layout (#3124) docs: document benchmark gate tuning plan (#3234) docs: document Gumroad RSC benchmark signal (#3233) docs: document migration proof artifact template (#3227) fix: support nested JavaScript package roots in doctor (#3220) docs: add Claude verify command (#3235) fix: serialize generated pack regeneration (#3231) docs: amend CHANGELOG entry for #3229 scope drift fix: validate selected JavaScript package manager fix: preserve causes when wrapping render errors (#3230) # Conflicts: # CHANGELOG.md
* origin/main: docs: document RSC node renderer test setup (#3223) Clarify Pro pricing and license guidance (#3272) docs: add AI security scanner evaluation plan (#3236) fix: pin pnpm version in scaffolded CI when packageManager is absent (#3172) (#3174) docs: document RSC HTTP response ownership (#3222) docs: document legacy Webpacker compatibility shims (#3225) chore: sync address-review command and import prompt variant (#3271) feat: add REACT_ON_RAILS_BASE_PORT for concurrent worktree port management (#3142) feat: align renderer cache staging with bundle-hash layout (#3124) docs: document benchmark gate tuning plan (#3234) docs: document Gumroad RSC benchmark signal (#3233) docs: document migration proof artifact template (#3227) fix: support nested JavaScript package roots in doctor (#3220) docs: add Claude verify command (#3235) fix: serialize generated pack regeneration (#3231) # Conflicts: # .gitignore # react_on_rails/spec/dummy/Procfile.dev
Summary
Refs #3169
Test Plan
pnpm exec prettier --check internal/planning/library-benchmarking.mdgit diff --checkNote
Low Risk
Low risk: changes are documentation plus a link-checker exclusion for Bencher URLs; no runtime or CI behavior is modified.
Overview
Adds a detailed “Main Gate Re-Enablement Plan” to
internal/planning/library-benchmarking.md, defining baseline prerequisites (post-PR-3148 history), an alert-overlap based tuning sequence, and acceptance criteria/false-positive targets before switching Bencher regressions from warnings back to a hard CI gate.Updates
.lychee.tomlto exclude authenticated Bencher dashboard/perf URLs from link checking to avoid CI false failures.Reviewed by Cursor Bugbot for commit dfd4250. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit