Skip to content

docs: document benchmark gate tuning plan#3234

Merged
justin808 merged 24 commits into
mainfrom
codex/plan-benchmark-gate-tuning
May 10, 2026
Merged

docs: document benchmark gate tuning plan#3234
justin808 merged 24 commits into
mainfrom
codex/plan-benchmark-gate-tuning

Conversation

@justin808

@justin808 justin808 commented Apr 30, 2026

Copy link
Copy Markdown
Member

Summary

  • extend the library benchmarking strategy with a main-gate re-enable plan
  • capture the baseline dependency on PR Fix Bencher reporting permanently broken on pushes to main #3148 and the need for about 30 clean main runs
  • define threshold, sample-size, overlap, and runner-noise acceptance criteria before restoring a hard gate

Refs #3169

Test Plan

  • pnpm exec prettier --check internal/planning/library-benchmarking.md
  • git diff --check
  • pre-commit/pre-push hooks

Note

Low Risk
Low risk: changes are documentation plus a link-checker exclusion for Bencher URLs; no runtime or CI behavior is modified.

Overview
Adds a detailed “Main Gate Re-Enablement Plan” to internal/planning/library-benchmarking.md, defining baseline prerequisites (post-PR-3148 history), an alert-overlap based tuning sequence, and acceptance criteria/false-positive targets before switching Bencher regressions from warnings back to a hard CI gate.

Updates .lychee.toml to exclude authenticated Bencher dashboard/perf URLs from link checking to avoid CI false failures.

Reviewed by Cursor Bugbot for commit dfd4250. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • Documentation
    • Added a "Main Gate Re-Enablement Plan" in internal planning: describes baseline run requirements, a stepwise tuning sequence for alert sensitivity, guidance to escalate to dedicated benchmarking runners if noise persists, and acceptance criteria (consecutive-run checks, induced-regression verification, and false-positive targets) for restoring the CI main gate.
  • No end-user facing changes in this release.

@coderabbitai

coderabbitai Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a "Main Gate Re-Enablement Plan" to internal/planning/library-benchmarking.md describing prerequisites (PR 3148, ≥30 post-merge main Benchmark Workflow runs), a stepwise tuning sequence while keeping main-regression alerts in warning mode, detailed overlap/noise criteria and acceptance criteria for restoring the hard CI gate, and a pointer to Issue 3169.

Changes

Main Gate Re-Enablement Plan

Layer / File(s) Summary
Context / Intent
internal/planning/library-benchmarking.md
Adds the "Main Gate Re-Enablement Plan" noting main-regression alerts are currently in warning mode due to Bencher noise on GitHub-hosted runners (new section lines ~62–68).
Baseline Dependency
internal/planning/library-benchmarking.md
Specifies dependency on PR 3148 (Bencher baseline fix) and requires at least 30 successful post-merge main Benchmark Workflow runs (counting only qualifying runs) before tuning begins (lines ~69–75).
Current Config Snapshot
internal/planning/library-benchmarking.md
Documents current Bencher settings (e.g., BOUNDARY=0.95, MAX_SAMPLE=64) and that regression alerts are in warning mode on main (lines ~76–78).
Tuning Sequence / Metrics
internal/planning/library-benchmarking.md
Defines tuning steps while in warning mode: collect baseline; compute Jaccard overlap of shared (benchmark, measure) alert pairs between adjacent main runs (treat overlap <0.20 as noise); prefer widening BOUNDARY toward 0.99; align --threshold-max-sample-size to available history; optionally add a minimum-sample rule only if supported; require the same (benchmark, measure) to alert on ≥2 consecutive qualifying runs before failing; consider larger/dedicated runners if noise persists (lines ~79–95).
Acceptance Criteria / Validation
internal/planning/library-benchmarking.md
Defines acceptance: ≥5 consecutive qualifying non-docs main Benchmark Workflow runs pass the warning-mode check; demonstrate sensitivity via an intentionally introduced local regression (temporary controller delay) that triggers an alert then is reverted; only restore the hard gate when noisy-failure rate ≤1 per 20 unchanged runs (where "noisy" = non-recurrence next qualifying run and no matching perf-sensitive code change), otherwise revert to warning mode and re-tune (lines ~96–125).
Tracking / Links
internal/planning/library-benchmarking.md
Adds pointer to Issue 3169 for tracking overlap evidence and tuning discussion (lines ~103–104).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~4 minutes

Poem

🐰 I watch the gate sleep, numbers quiet as dew,
Thirty runs I count, patience sows what's true.
Tune the thresholds, listen for the same bell,
When echoes match, we'll lift the guarded swell.
A carrot of certainty—hop, test, then dwell.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'docs: document benchmark gate tuning plan' accurately and concisely describes the main change: adding documentation for a benchmark gate tuning plan to the library-benchmarking.md file.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/plan-benchmark-gate-tuning

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps

greptile-apps Bot commented Apr 30, 2026

Copy link
Copy Markdown

Greptile Summary

This PR extends internal/planning/library-benchmarking.md with a new "Main Gate Re-Enablement Plan" section that documents the prerequisites (PR #3148 landing and 30 clean main runs), a four-step tuning sequence for Bencher threshold adjustments, and explicit acceptance criteria before the hard benchmark gate can be restored.

Confidence Score: 5/5

Safe to merge — documentation-only change with no runtime impact.

The PR adds only markdown documentation. No code, configuration, or CI workflow files are modified. The content is clear, internally consistent, and correctly references the blocking PR (#3148) and tracking issue (#3169). No issues found.

No files require special attention.

Important Files Changed

Filename Overview
internal/planning/library-benchmarking.md Adds a "Main Gate Re-Enablement Plan" section with baseline dependency, tuning sequence, and acceptance criteria for restoring the hard benchmark gate.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([PR #3148 merged]) --> B{30+ post-merge\nmain runs accumulated?}
    B -- No --> C[Keep gate in warning mode\ncollect baseline history]
    C --> B
    B -- Yes --> D[Compare adjacent main runs\nfor shared alert pairs]
    D --> E{Alert sets\nnear-disjoint?}
    E -- Yes, noise dominated --> F[Move to larger/dedicated runners\nor widen Bencher boundary toward 0.99]
    F --> D
    E -- No, stable alerts --> G[Add --threshold-min-sample-size\nrequire consecutive-run agreement]
    G --> H{Acceptance criteria met?\n5 clean runs, tuned false-positive rate}
    H -- No --> D
    H -- Yes --> I([Restore hard gate on main])
Loading

Reviews (1): Last reviewed commit: "Document benchmark gate tuning plan" | Re-trigger Greptile

@claude

claude Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Code Review: docs: document benchmark gate tuning plan

Overall this is a well-structured, operationally useful addition. The three-section layout (Baseline Dependency → Tuning Sequence → Acceptance Criteria) is logical, and the acceptance criteria are concrete and measurable. A few minor clarity suggestions via inline comments below.

Strengths

Minor suggestions (see inline comments)

  1. "Hard gate" is undefined for a reader who hasn't seen the CI config — a brief parenthetical would help.
  2. "30 post-merge main runs" — clarify whether these must be CI-passing runs or just completed runs regardless of benchmark alert status.
  3. Step 2 (overlap comparison) — describes what to check but not how; a pointer to the Bencher dashboard or the specific query method would make this actionable.
  4. Deliberate regression criterion — note who should verify it and roughly how (e.g. "the developer re-enabling the gate introduces a temporary delay, confirms an alert fires, then reverts").

Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
@claude

claude Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

Overall: Well-structured, clearly written documentation PR. The plan is logically sequenced — prerequisites before tuning, tuning before restoring the hard gate — and the acceptance criteria are a good start. A few suggestions to tighten clarity and actionability.

Strengths

Suggestions

  1. "Non-docs" is undefined — the acceptance criterion "5 consecutive non-docs main pushes" relies on a category that isn't defined. Does the existing CI label or path filter already gate this? If not, a brief definition or pointer would help whoever applies the plan later.

  2. False-positive target uses example language — "such as no more than 1 noisy failure in 20 unchanged-performance runs" reads as illustrative, not binding. If this is the actual target, drop "such as" and make it the definitive number. If the project hasn't decided yet, call it out explicitly as a TBD.

  3. No tracking hook for the 30-run prerequisite — counting post-merge main runs is a manual check with no mention of where to track progress. A one-liner pointing to Issue Benchmark CI: tune thresholds / reduce flakiness so the main gate can be re-enabled #3169 (or a Bencher dashboard query) as the canonical count would reduce the risk of the plan sitting ignored until someone rediscovers it.

  4. No owner or review trigger — the plan has no named owner and no signal for when to revisit it. Adding a short note like "Re-evaluate after the 30-run milestone; ping @justin808" would keep it from becoming a zombie planning doc.

See inline comments for specific line suggestions.

Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
@claude

claude Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

Documentation-only PR adding a structured re-enablement plan for the CI benchmark hard gate. The three-section layout (Baseline Dependency → Tuning Sequence → Acceptance Criteria) is logical, and the acceptance criteria are measurable. Three items worth addressing before merge:

1. PR #3148 has already merged (2026-04-23)
The Baseline Dependency section reads as a future condition ("Do not re-enable the hard gate until the Bencher reporting baseline fix from PR 3148 has landed"), but that prerequisite is already satisfied. The plan should be updated to reflect the current state: the project is now in the 30-run accumulation window, not waiting for the merge. Leaving it as-is risks future readers treating an already-satisfied condition as still pending.

2. non-docs path exclusion has no automated tracking
The acceptance criterion "5 consecutive non-docs main pushes" depends on correctly categorising runs by path exclusion (docs/, *.md, internal/), but there is no CI counter, dashboard query, or hook that tracks this automatically. If the existing benchmark CI workflow already uses a paths-ignore filter that segments these runs, point to it. Otherwise, acknowledge this is a manual count and specify where it is tracked (e.g. Issue #3169 comments).

3. Consecutive-alert requirement is a configuration step, not a passive criterion
"The regression tracker accumulates only sustained or overlapping alerts, not a new random alert set on each run" reads as if this is existing Bencher behaviour to verify. It is actually a configuration change that needs to be made (e.g. requiring N consecutive alerts before failing). It belongs in the Tuning Sequence as a step to implement, not in the Acceptance Criteria as a condition to observe.

See inline comments for line-level suggestions.

Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
@claude

claude Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

Summary: Documentation-only change adding a concrete re-enablement plan for the Bencher main CI gate. Well-structured overall with measurable criteria and good traceability to related PR/issue.

Strengths

  • Clear prerequisite chain: baseline fix → history accumulation → threshold tuning → acceptance criteria → re-enable
  • Quantified where it matters most (30 runs, 5 clean pushes, 1-in-20 FP target)
  • Deliberate regression test as a true-positive sanity check is a nice touch
  • internal/ exclusion from "non-docs" is appropriate since the directory is all planning docs

Suggestions

1. Quantify "near-disjoint" (Tuning Sequence, step 2)

"Near-disjoint" is subjective and leaves the reviewer with no decision threshold. Consider adding a concrete overlap ratio, e.g. "fewer than 20% of alerts appear in both adjacent runs" as the threshold for concluding noise still dominates.

2. Missing rollback trigger (Acceptance Criteria)

The plan defines when to re-enable the gate but not what happens if it over-fires afterward. Consider a brief note like: "If the false-positive rate exceeds the 1-in-20 target after re-enabling, revert the gate to warning mode and re-tune thresholds before attempting again." Without this, teams may leave a noisy gate in place rather than taking explicit action.

3. Timeline expectation

"At least 30 post-merge main runs" gives no sense of wall-clock time. If main gets ~2–3 non-docs pushes per week, that is roughly 10–15 weeks. A rough estimate (even parenthetical) would help set reviewer/team expectations on when this work becomes actionable.

Overall this is a solid planning doc and the approach is technically sound. The above are minor improvements rather than blockers.

Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
@claude

claude Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

Overall: This is a well-structured, actionable planning document. The section fits naturally alongside the existing strategy content, the acceptance criteria are concrete and measurable, and the reference to Issue 3169 gives a live tracking thread. A few small clarity gaps are worth tightening before merge.

Suggestions

Consecutive-run count is unspecified (Tuning Sequence step 3)
Step 3 says "require the same (benchmark, measure) pair to alert across consecutive runs" without saying how many consecutive runs. Two? Three? Without a number this is difficult to implement or audit later. Recommend picking a specific count (e.g. 2) and wiring it to whatever Bencher flag or custom heuristic will enforce it.

Mixed-content pushes are ambiguous (Acceptance Criteria)
The first bullet defines a qualifying push as one that "modifies files outside docs/, *.md, and internal/". A push that touches both lib/react_on_rails/foo.rb and docs/bar.md — a very common combination — is not clearly in or out. Recommend adding: "A push counts if it modifies at least one file outside those directories" to make the intent unambiguous.

No pointer to where the gate actually lives
The Tuning Sequence tells someone what to change (boundary from 0.95 toward 0.99, --threshold-min-sample-size, etc.) but not where. The relevant settings are in .github/workflows/benchmark.yml (around lines 558–636). Adding a single-line pointer — e.g., "All threshold flags are set in .github/workflows/benchmark.yml under the run_bencher function" — would save the re-enabler a file hunt.

Minor: duplication with Future Considerations
Option 4 in the existing Future Considerations section already mentions dedicated benchmark runners. The new Tuning Sequence step 4 repeats this. No action required, just noting the overlap in case a future pass wants to consolidate.

Not blocking

These are all documentation-clarity issues. The plan itself is sound and the acceptance criteria are exactly the kind of concrete, falsifiable targets that make re-enablement decisions defensible.

Comment thread internal/planning/library-benchmarking.md
Comment thread internal/planning/library-benchmarking.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
internal/planning/library-benchmarking.md (1)

65-68: ⚡ Quick win

Minor: Add 1 sentence on failure mode when gate gets flipped back (how quickly to re-tune).

You already say “revert it to warning mode and re-tune thresholds before trying to re-enable again” (Lines 99-101), but it doesn’t say whether re-tuning requires re-collecting history (e.g., “start a fresh baseline window”) or can be done immediately using existing overlap data. Adding a single operational rule reduces repeated churn.

Also applies to: 76-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/planning/library-benchmarking.md` around lines 65 - 68, The document
lacks an operational rule describing the failure mode when the gate is flipped
back to warning mode and how quickly thresholds should be re-tuned; add one
clear sentence near the existing guidance that states whether re-tuning must
re‑collect history (e.g., "start a fresh baseline window of X runs") or may use
the existing overlap data and how long to wait before re-enabling the hard
gate—reference the terms "main-regression alerts", "warning mode", "re-tune
thresholds", and "baseline window/overlap data" so readers can find and apply
the rule consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/planning/library-benchmarking.md`:
- Around line 71-74: Clarify the “30 post-merge main runs” rule: count GitHub
Actions Bencher workflow executions triggered on the main branch after the merge
of PR 3148 (do not count runs from other branches or pre-merge test runs);
include scheduled workflow runs and manual/auto reruns as separate counts (each
rerun increments the count once) but exclude runs triggered by doc-only commits
or changes that skip the benchmark workflow; verify and audit counts by listing
the Bencher workflow run history on the repository’s Actions -> Workflows ->
Bencher benchmark workflow page and cross-reference the corresponding Bencher
dashboard/history for fresh baselines, and annotate each counted run in the
PR/issue with its run-id and timestamp for traceability.
- Around line 99-101: Define "no more than 1 noisy failure in 20
unchanged-performance runs" concretely: classify a "noisy failure" as a
benchmark alert that does not show a reproducible performance delta (absolute
delta below a tolerance and statistically non-significant) and whose alert
window does not overlap with code changes that touch the benchmark owner (i.e.,
alert appears but delta < X% and p-value > 0.05); classify a "real regression"
as an alert with a reproducible delta beyond tolerance (e.g., >1% or
configurable threshold) and statistically significant (p ≤ 0.05) and/or
overlapping with relevant code changes. Define the "unchanged-performance runs"
cohort as the last 20 successful benchmark workflow runs on main that meet
baseline stability criteria (each benchmark's coefficient of variation ≤5% over
that window and no infra/anomaly tags), excluding runs triggered by commits that
modify the benchmark's owner or relevant performance-sensitive files; count "20"
as 20 successful benchmark workflow runs (re-runs allowed and counted separately
only if they were completed successfully and not marked as infra-failure). Apply
these rules in the restore-gate logic (the text around "no more than 1 noisy
failure in 20 unchanged-performance runs") so teams can consistently classify
runs and select the cohort.
- Around line 78-82: The doc currently mentions “shared (benchmark, measure)
alert pairs” and a ~20% overlap cutoff without defining the metric; replace the
vague phrase in bullet 2 with an explicit definition such as: define alert sets
A and B for two adjacent runs and compute alert overlap = |A ∩ B| / |A ∪ B|
(Jaccard index), state the threshold as overlap < 0.20 indicates runner noise,
and add a short note that if Bencher applies any normalization (or an alternate
“shared-alert fraction” is used) you must cite the exact computation and graph
in Issue 3169 (reference the specific section/graph/commit from Issue 3169) so
readers can reproduce the value; also add a one-line concrete example of
computing the metric for clarity.
- Around line 91-93: The acceptance criterion "5 consecutive non-docs main
pushes pass the warning-mode check" is ambiguous; update the document text to
(1) define "non-docs" via an explicit glob set (e.g., exclude changes matching
docs/**, **/*.md, internal/**, .github/**, README.md) and state that any
PR/commit that touches files outside those globs counts as a non-docs push, (2)
name the exact CI job/status referred to by "warning-mode check" (use the job
name or workflow identifier) so readers know what to observe, and (3) add a
fallback clause to handle low traffic such as "or within X main pushes or Y
days" (suggest values like 10 main pushes or 30 days) so the requirement is
robust; revise the sentence in the planning doc accordingly to include these
explicit rules and the fallback.
- Around line 83-86: The doc omits explicit Bencher flag values and misstates
flags; update the "library-benchmarking.md" to add a "Bencher config mapping"
section that copies the exact runtime values used in the workflow (e.g.,
BOUNDARY=0.95, MAX_SAMPLE=64 and the actual threshold flag used in the workflow
such as --threshold-max-sample-size=64) and clarify whether the intended flag is
--threshold-min-sample-size or --threshold-max-sample-size (fix the text to
match the workflow invocation), and explicitly state the consecutive-run
confirmation behavior by naming the Bencher flag or external logic (or note that
none exists) and the required count (e.g., require 2 consecutive alerts) so
acceptance criteria match the actual invocation strings like BOUNDARY,
MAX_SAMPLE and the --threshold-* flags referenced in the repo.

---

Nitpick comments:
In `@internal/planning/library-benchmarking.md`:
- Around line 65-68: The document lacks an operational rule describing the
failure mode when the gate is flipped back to warning mode and how quickly
thresholds should be re-tuned; add one clear sentence near the existing guidance
that states whether re-tuning must re‑collect history (e.g., "start a fresh
baseline window of X runs") or may use the existing overlap data and how long to
wait before re-enabling the hard gate—reference the terms "main-regression
alerts", "warning mode", "re-tune thresholds", and "baseline window/overlap
data" so readers can find and apply the rule consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87024bba-e5eb-4cef-befd-b7d73bc28c12

📥 Commits

Reviewing files that changed from the base of the PR and between 3417b09 and 07ceacd.

📒 Files selected for processing (1)
  • internal/planning/library-benchmarking.md

Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md
Comment thread internal/planning/library-benchmarking.md
Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
@claude

claude Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Code Review

Documentation-only PR adding a "Main Gate Re-Enablement Plan" section to internal/planning/library-benchmarking.md and a lychee exclusion for the authenticated Bencher dashboard URL.

Workflow Accuracy Check

I cross-checked the doc's configuration snapshot against the actual .github/workflows/benchmark.yml:

Claim Status
BOUNDARY=0.95, MAX_SAMPLE=64 ✅ Matches
--threshold-test t_test for all metrics ✅ Matches
rps uses --threshold-lower-boundary $BOUNDARY --threshold-upper-boundary _ ✅ Matches
Latency/failed_pct use --threshold-upper-boundary $BOUNDARY ✅ Matches
BENCHER_EXIT_CODE / BENCHER_HAS_ALERT env vars ✅ Matches
Step names ("Warn if Bencher detected regression on main", "Fail on non-regression Bencher error on main") ✅ Matches
performance-regression label ✅ Matches
Hard-gate change = exit with $BENCHER_EXIT_CODE, not removing --err ✅ Matches

The snapshot is accurate at time of writing and the > **Note:** caveat is the right way to handle its eventual staleness.

Issues

Bencher URL discrepancy (minor)
The doc links to https://bencher.dev/console/projects/react-on-rails-t8a9ncxo/ (the authenticated project console, returns 403 without login). The workflow's regression issue template uses https://bencher.dev/perf/react-on-rails-t8a9ncxo (a separate URL that links directly to the public performance view). A maintainer following the plan and clicking the doc's dashboard link will hit a 403. Inline comment below.

Forward reference to "qualifying run" (minor)
The Tuning Sequence (step 2, line 117–118) says "Use the qualifying-run definition from acceptance criterion 3" but the definition doesn't appear for another ~25 lines. A reader following the sequence top-to-bottom has to jump ahead to learn what counts as a qualifying run before they can execute step 2. Inline comment below.

Manual gate: one-time regressions are intentionally silenced
Criterion 2 requires the same (benchmark, measure) pair to alert on 2 consecutive runs before action. This means a genuine regression that appears in one run and then disappears (e.g., intermittent but real) will be classified as noise and silently ignored. The tradeoff is acknowledged and reasonable given the noisy runner environment — just worth calling out explicitly in case future maintainers wonder why a one-off alert was not investigated.

Strengths

  • Acceptance criterion 1 (deliberately introduce a 20% SSR delay to confirm the gate fires before re-enabling it) is excellent discipline — verifying that the gate can catch real regressions before trusting it.
  • The Jaccard overlap metric for noise analysis is well-chosen; the 0.20/0.40 band with empirical rationale is documented clearly.
  • The lychee exclusion regex '^https://bencher\.dev/console/projects/react-on-rails-t8a9ncxo(/.*)?$' correctly anchors both the base URL and any sub-paths.
  • Two escalation cycles (10 extra qualifying runs) before unconditionally moving to dedicated runners is a concrete, unambiguous criterion.

Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
@claude

claude Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Code Review — PR #3234: docs: document benchmark gate tuning plan

Overview

This PR adds a Main Gate Re-Enablement Plan section to the internal benchmarking planning doc and adds a Bencher dashboard URL exclusion to .lychee.toml. Documentation-only changes with no runtime or CI behaviour impact.

The planning document is comprehensive and well-reasoned. The use of Jaccard overlap to distinguish noise from signal is a solid empirical approach, and the tuning sequence → escalation → acceptance-criteria structure is clear and actionable. The math example (2/10 = 20%) checks out.


Issues

Lychee URL exclusion gap (potential CI failure)

The newly added document references https://bencher.dev/perf/react-on-rails-t8a9ncxo in a backtick code span (line 111 of the md file). Lychee checks URLs inside code spans, and internal/planning/ is not in exclude_path, so this file is scanned. The exclusion added in .lychee.toml only covers the /console/projects/ path — if the /perf/ URL also returns 403 for automated requests, lychee CI will fail.

The fix is either to widen the existing pattern to cover both paths, or to verify that bencher.dev/perf/ is publicly accessible. See inline comment.


Minor observations

  • Acceptance criterion 2 is purely manual — the doc correctly calls this out as a "manual gate", but a reviewer must actively track recurrences in Issue 3169 with no automation fallback. Fine for an internal planning doc, but increases the burden on whoever owns the re-enablement process.
  • 30-run counting methodology — the criteria for which runs count are well-documented, but tracking is entirely manual in Issue 3169. A simple table template in that issue might reduce off-by-one errors when counting across days or weeks.

Checklist

  • lychee exclusion is correctly anchored with ^ and allows a path suffix with (/.*)?$
  • BENCHER_HAS_ALERT reference matches the actual workflow variable (lines 680-692 of benchmark.yml)
  • Boundary tuning direction is correct (widening 0.95 toward 0.99 makes the gate more lenient)
  • Escalation path (threshold tuning to dedicated runners) is sound
  • bencher.dev/perf/ URL in the doc needs either a lychee exclusion or a confirmed-public status check

Verdict: Approve once the lychee URL coverage gap is resolved (or confirmed not to be a problem). The planning content itself is solid.

Comment thread .lychee.toml Outdated
Fold in eight non-blocking suggestions from auto-reviews on PR 3234:
archive note for plan lifecycle, manual+auto rerun specificity,
operational-flag scope note for the snapshot, "5 pairs = 6 runs"
clarification, motivation for the 20% smoke-test magnitude,
BENCHER_HAS_ALERT=1 semantics inline, criterion-4 rollback cadence
reference, and an explicit rolling-window definition for the 1-in-20
false-positive rate that excludes intentional-perf-change commits from
both the numerator and denominator.
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@justin808

Copy link
Copy Markdown
Member Author

Auto-review round summary (post-648a4b70e)

I (Claude Code) triaged the 14 auto-generated PR-level reviews that landed after my last batch reply at 08:17 UTC. Most of their substantive feedback was already addressed in commits 4c748278351e71 (rps upper-boundary mention, "external tracking" → "manual tracking", linear-ladder iteration policy, formatting at the step-3 continuation, _ sentinel explanation, Jaccard provenance citing Issue 3169, acceptance-criteria ordering, GitHub-issue-creation step, manual-enforcement clarification, asymmetric threshold description, and the lychee (/.*)?$ anchor). All 56 inline review threads were already resolved.

Just landed in 029aa38 (this commit)

Eight non-blocking polish items pulled from the new round of auto-reviews:

  1. Archive-note for plan lifecycle ("archive once the hard gate has held for 30+ qualifying runs without breaching the false-positive target"). [#4412325697]
  2. Criterion 4 now explicitly references criterion 5's review cadence as the rollback trigger. [#4412372977]
  3. Explicit rolling-window definition for the 1-in-20 false-positive target, with intentional-perf-change commits excluded from both numerator and denominator. [#4412432492]
  4. Baseline Dependency now distinguishes "manual workflow reruns" from "automatic GitHub retries" (both excluded). [#4412432492]
  5. Snapshot note now flags that operational flags (--quiet, --format html) are intentionally omitted because they don't affect threshold behavior. [#4412451464]
  6. Acceptance criterion 1 now justifies the 20% smoke-test magnitude (matches the project's stated detection goal in "Why We Chose Max Rate" and the runner-noise floor under BOUNDARY=0.95 with the t-test) and notes when to recalibrate it. [#4412480644, #4412514641]
  7. "5 adjacent qualifying-run pairs" now adds the parenthetical "(i.e., at least 6 qualifying runs because each pair shares one run with its neighbor)". [#4412514641]
  8. BENCHER_HAS_ALERT semantics now inline: 0 is clean, 1 is what triggers the warning step and the regression-issue update. [#4412467068]

Skipped (with rationale)

  • lychee URL exclusion gap (#4412609718): Factually incorrect. The lychee workflow at .github/workflows/check-markdown-links.yml only scans docs/, root *.md, and react_on_rails_pro/*.md (see line 40 of that file). internal/planning/library-benchmarking.md is not in those paths, so the https://bencher.dev/perf/react-on-rails-t8a9ncxo URL added in 8351e71 is not subject to lychee link-checking. No exclusion change is needed.
  • "placeholder" (#4412449638), "test comment - will delete" (#4412501024), "Code Review - PR 3234 (DELETED - test comment)" (#4412502471): Bot self-test/placeholder posts with no actionable content.

I am not posting individual rationale replies on the duplicative LGTM-style summary reviews; their concerns are already addressed (per the mapping above) and threads are not resolvable on PR-level comments.

@claude

claude Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Code Review: docs: document benchmark gate tuning plan

Overall: Solid documentation PR. The technical claims in the new section are accurate — I verified BOUNDARY, MAX_SAMPLE, BENCHER_EXIT_CODE, BENCHER_HAS_ALERT, step names (Warn if Bencher detected regression on main, Fail on non-regression Bencher error on main), the performance-regression label, and the per-measure threshold direction (lower/upper boundary assignments) all match .github/workflows/benchmark.yml exactly. The Note callout pointing back to the workflow as source of truth is good practice.


Issues Found

1. .lychee.toml exclusion doesn't match the URL in the new doc

The new exclude rule targets bencher.dev/console/projects/... but the new documentation (tuning sequence step 2, line 115) references https://bencher.dev/perf/react-on-rails-t8a9ncxo. Those are two different URL prefixes on bencher.dev, and only the console/projects variant is excluded.

That said, lychee only scans docs/ *.md react_on_rails_pro/*.md (per .github/workflows/check-markdown-links.yml), so internal/planning/library-benchmarking.md isn't checked today — the exclusion is purely defensive. But if the intent is to guard against future lychee failures on bencher.dev URLs, the pattern should be broadened (e.g., '^https://bencher\.dev') or the perf path should be added as a second entry. See inline comment.

2. Tuning Sequence step 2 is very dense

Lines 113–126 pack many nested conditions — qualifying-run definition, Jaccard formula, 0.20/0.40 thresholds, the consecutive-pair requirement, and small-sample caveats — into a single paragraph. This is the hardest part of the plan for a future reader to execute correctly. Sub-bullets (when/how to begin analysis, the overlap thresholds and their rationale, qualifying-run definition, small-sample escape hatch) would make the conditions easier to follow and audit. See inline comment.

3. Criterion 5 has a stale line-wrap

Line 166 reads ...rather than counting them as either real or noisy. Review the running rate after every 5 gate-triggering runs or at least with the sentence breaking awkwardly mid-clause. The surrounding lines wrap at ~120 characters; this one runs to ~155 before breaking. Minor, but inconsistent with the rest of the file.


Minor Observations

  • The 2-consecutive-run requirement is stated in both Tuning Sequence step 3 and Acceptance Criterion 2 with slightly different wording. This isn't wrong — one is the configuration goal, the other is the pre-condition gate — but a cross-reference between them would help a reader see they're the same requirement.
  • Criterion 4 introduces the 1-in-20 false-positive target before Criterion 5 defines who tracks it. The forward reference is fine but worth a parenthetical like "(tracked in criterion 5)" to prevent a reader from hunting backward.
  • The plan is entirely manual-process-dependent on Issue 3169. That's explicitly acknowledged and appropriate given the noise context — just noting it as a risk if Issue 3169 becomes stale.

Comment thread .lychee.toml Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread internal/planning/library-benchmarking.md Outdated
- Broaden lychee exclusion to cover both /console/projects/ and /perf/
  paths under bencher.dev so future scans of either URL prefix stay
  defensive without a follow-up edit.
- Restructure Tuning Sequence step 2 into sub-bullets (data source,
  qualifying run, Jaccard formula, when to start, noise floor, signal
  threshold, small-sample caveat) so each condition is independently
  scannable.
- Rewrap criterion 5 at the sentence boundary after "real or noisy."
  to fix the 173-character mid-clause line and match the surrounding
  ~120-character wrap.
- Cross-reference the 2-consecutive-run requirement between Tuning
  Sequence step 3 and Acceptance Criterion 2 so a reader sees they
  describe the same gate.
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@justin808

Copy link
Copy Markdown
Member Author

Auto-review round summary (post-029aa3893)

I (Claude Code) triaged the new claude[bot] review session that landed at 01:30 UTC (PR-level summary #4414146528 plus three inline comments). All four readability suggestions are addressed in commit a897d93; nothing was a blocker (the review itself opens with "Solid documentation PR" and verifies all technical claims against .github/workflows/benchmark.yml).

Just landed in a897d93

  1. .lychee.toml exclusion broadened [#3214146419] — regex now covers (console/projects|perf)/react-on-rails-t8a9ncxo so both URL prefixes are excluded defensively. The doc itself isn't in the lychee scan paths today (docs/ *.md react_on_rails_pro/*.md) — this is purely future-proofing.
  2. Tuning Sequence step 2 split into sub-bullets [#3214147330] — seven sub-bullets (Data source, Qualifying run, Jaccard formula, When to start, Noise floor 0.20, Signal threshold 0.40, Small-sample caveat). Substance preserved including the hysteresis explanation and Issue 3169 provenance.
  3. Criterion 5 line 166 reflowed [#3214147729] — wraps now at the sentence boundary after real or noisy. so the previously 173-character mid-clause break is gone; longest line in the criterion is now 124 characters.
  4. Cross-reference between Tuning Sequence step 3 and Acceptance Criterion 2 [#4414146528 "Minor Observations" TODO for first version #1] — both manual-tracking statements now point to each other so a reader sees they describe the same gate.

Skipped (with rationale)

  • PR-level summary review #4414146528 — its three inline issues are addressed in items 1-3 above; one of its two "Minor Observations" (the Criterion 4 forward reference) was already addressed in 029aa38 per my prior summary; the other (cross-reference between step 3 and Criterion 2) is item 4 above. The remaining "minor observation" (manual-process dependence on Issue 3169 as a stale-history risk) is an observation rather than an actionable change. The PR-level comment is not separately resolvable — its substance is captured by the now-resolved inline threads.

All three new inline review threads are resolved.

@claude

claude Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Review: docs: document benchmark gate tuning plan - see inline comments for details

@claude

claude Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Accuracy check passed: all values in Current Bencher Configuration verified against .github/workflows/benchmark.yml (BOUNDARY=0.95 line 558, MAX_SAMPLE=64 line 559, --err line 609, t_test lines 613-635, BENCHER_HAS_ALERT lines 681-692, both warning/error step names, and the re-enable instruction). Minor suggestions posted as inline comments.

Comment thread internal/planning/library-benchmarking.md Outdated
Comment thread .lychee.toml
Refactor the dense Main Gate Re-Enablement Plan intro paragraph into a
short rationale sentence followed by a bulleted list of standing
instructions (history wait, fallback, baseline-reset exception, archival
trigger). Make the archival mechanism explicit by spelling out that the
entire `## Main Gate Re-Enablement Plan` section (heading through end
of file, including the trailing Issue 3169 link) gets deleted once the
plan executes.

Addresses #3234 (comment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@justin808

Copy link
Copy Markdown
Member Author

Auto-review round summary (post-a897d9391)

I (Claude Code) triaged the new claude[bot] review session that landed at 06:24–06:25 UTC (PR-level pointer #4414622003, accuracy-check ack #4414623593, and two inline comments). One readability suggestion was actionable; everything else was either a positive acknowledgment or an explicit FYI.

Just landed in dfd4250

  1. Re-enablement plan intro split into intro + bulleted list [#3214433527] — the dense paragraph at lines 65–72 is now a one-line rationale followed by four bulleted standing instructions: Wait for history before tuning, Fallback if the gate flips back, Baseline reset exception, Archive when done. The archival mechanism is now explicit: delete the entire ## Main Gate Re-Enablement Plan section (heading and every subsection through end of file, including the trailing Issue 3169 link) in a follow-up commit once the gate has held for 30+ qualifying runs without breaching the false-positive target.

Skipped (with rationale)

  • .lychee.toml future-prefix FYI [#3214433549] — explicit FYI, not a request for change. The current regex '^https://bencher\.dev/(console/projects|perf)/react-on-rails-t8a9ncxo(/.*)?$' covers both URL shapes the PR ships. If future links use other prefixes (/console/organizations/..., /bencher/...), maintainers can add a sibling entry or broaden the alternation group then. Replied inline; thread resolved.
  • PR-level pointer comment [#4414622003] — pure pointer to the inline comments; no separate response needed.
  • PR-level accuracy-check ack [#4414623593] — positive verification that all BOUNDARY, MAX_SAMPLE, --err, t_test, BENCHER_HAS_ALERT, and step-name values match .github/workflows/benchmark.yml. No action required.

Both new inline review threads are resolved.

@claude

claude Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Code Review

Overall: LGTM with minor note — this is a well-structured documentation PR. The BENCHER_HAS_ALERT, BENCHER_EXIT_CODE, BOUNDARY, MAX_SAMPLE, and per-measure threshold configurations in the doc were all spot-checked against the actual .github/workflows/benchmark.yml run_bencher function and are accurate. The lychee exclusion regex is correctly scoped.

Strengths

  • The plan is concrete and measurable: 30-run baseline, explicit Jaccard overlap thresholds (0.20 noise floor / 0.40 signal threshold), 5-consecutive-clean-run bar, and 1-in-20 false-positive target.
  • The baseline snapshot note is a good defensive habit: the doc explicitly says "verify against the workflow" rather than claiming itself as the source of truth.
  • The "Archive when done" instruction keeps the plan self-cleaning — it won't linger indefinitely in the file.
  • The escalation path (boundary widening → 5 more runs → repeat × 2 → dedicated runners) avoids leaving the gate in warning mode indefinitely.

Issue

Duplicated gate condition (Acceptance Criteria 2 = Tuning Sequence step 3): the same (benchmark, measure) pair must alert on at least 2 consecutive runs requirement appears verbatim in both places. The doc acknowledges this ("restated here because it is also a pre-condition"), but if one copy is updated during tuning and the other is not, they will silently diverge and the plan becomes internally inconsistent. I'd suggest adding a short cross-reference marker in step 3 (e.g., "Keep in sync with Acceptance Criteria 2") and/or making one location authoritative by turning the other into a one-line forward reference.

Minor observations

  • The --threshold-upper-boundary _ notation is described as "disables the upper bound" — accurate per the workflow, but worth adding a parenthetical like "(Bencher treats underscore as a sentinel for disabled)" so future readers don't mistake it for a shell variable.
  • The archive instruction references "the trailing Issue 3169 link" — that link is currently the very last line of the file, so it is unambiguous now, but if content is ever appended below this section the instruction becomes stale. Consider phrasing it as "delete from this ## Main Gate Re-Enablement Plan heading through the last line of this file at the time of deletion" to be slightly more durable.

No blocking issues. The one real maintenance risk is the duplicated gate condition.

Issue 3169 to show the same `(benchmark, measure)` pair alerting on at least 2 consecutive runs before filing or
failing; a single noisy run does not trigger the gate. This is a manual gate: Bencher still alerts on the first run,
and the requirement is that a reviewer confirms recurrence in Issue 3169 before acting on it. (This is the same gate
stated in Tuning Sequence step 3; restated here because it is also a pre-condition for the 5-run clean-run count.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gate condition here (same (benchmark, measure) pair must alert on ≥ 2 consecutive runs) is intentionally duplicated from Tuning Sequence step 3. The duplication is acknowledged, but if one copy is updated during active tuning, both need to stay in sync. Consider adding a bold cross-reference in step 3 — e.g., "Keep in sync with Acceptance Criteria 2" — so editors are warned before they touch only one copy.

invalidate the old history.
- **Archive when done.** Once the hard gate has been restored and held for 30 or more qualifying runs without breaching
the false-positive target, delete the entire `## Main Gate Re-Enablement Plan` section (this heading and every
subsection through the end of the file, including the trailing Issue 3169 link) in a follow-up commit; the executed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The trailing Issue 3169 link" is unambiguous today because this section ends the file, but if content is ever appended below this heading the phrase becomes stale. A slightly more durable phrasing: "delete from this ## Main Gate Re-Enablement Plan heading through the last line of the file as it stands at deletion time".

@justin808 justin808 merged commit e63cc09 into main May 10, 2026
17 of 18 checks passed
@justin808 justin808 deleted the codex/plan-benchmark-gate-tuning branch May 10, 2026 07:34
justin808 added a commit that referenced this pull request May 11, 2026
…y-patch-DrhDk

* origin/main:
  docs: document legacy Webpacker compatibility shims (#3225)
  chore: sync address-review command and import prompt variant (#3271)
  feat: add REACT_ON_RAILS_BASE_PORT for concurrent worktree port management (#3142)
  feat: align renderer cache staging with bundle-hash layout (#3124)
  docs: document benchmark gate tuning plan (#3234)
  docs: document Gumroad RSC benchmark signal (#3233)
  docs: document migration proof artifact template (#3227)
  fix: support nested JavaScript package roots in doctor (#3220)
  docs: add Claude verify command (#3235)
  fix: serialize generated pack regeneration (#3231)
  docs: amend CHANGELOG entry for #3229 scope drift
  fix: validate selected JavaScript package manager
  fix: preserve causes when wrapping render errors (#3230)

# Conflicts:
#	CHANGELOG.md
justin808 added a commit that referenced this pull request May 12, 2026
* origin/main:
  docs: document RSC node renderer test setup (#3223)
  Clarify Pro pricing and license guidance (#3272)
  docs: add AI security scanner evaluation plan (#3236)
  fix: pin pnpm version in scaffolded CI when packageManager is absent (#3172) (#3174)
  docs: document RSC HTTP response ownership (#3222)
  docs: document legacy Webpacker compatibility shims (#3225)
  chore: sync address-review command and import prompt variant (#3271)
  feat: add REACT_ON_RAILS_BASE_PORT for concurrent worktree port management (#3142)
  feat: align renderer cache staging with bundle-hash layout (#3124)
  docs: document benchmark gate tuning plan (#3234)
  docs: document Gumroad RSC benchmark signal (#3233)
  docs: document migration proof artifact template (#3227)
  fix: support nested JavaScript package roots in doctor (#3220)
  docs: add Claude verify command (#3235)
  fix: serialize generated pack regeneration (#3231)

# Conflicts:
#	.gitignore
#	react_on_rails/spec/dummy/Procfile.dev
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant