Split benchmark tracking script#4176
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthrough
ChangesModular TrackBenchmarks extraction
Sequence Diagram(s)sequenceDiagram
participant Cli
participant BranchArgs
participant GithubCli
participant BencherRun
participant BencherRunner
Cli->>BranchArgs: branch_and_start_point_args(env)
alt confirmation mode
BranchArgs-->>Cli: [synthetic_branch, main start-point args]
else push
BranchArgs-->>Cli: ["main", []]
else pull_request
BranchArgs-->>Cli: [head_ref, base_sha start-point args]
else workflow_dispatch (not main)
BranchArgs->>GithubCli: capture compare main...branch
GithubCli-->>BranchArgs: merge-base SHA or failure
BranchArgs-->>Cli: [branch, start-point args ± hash]
end
Cli->>BencherRun: run_bencher!(runner, branch, start_point_args)
BencherRun->>BencherRunner: run(branch, start_point_args)
BencherRunner-->>BencherRun: exit_code, report, stderr
alt retry_without_start_point_hash?
BencherRun->>BencherRun: without_start_point_hash(args)
BencherRun->>BencherRunner: run(branch, stripped_args)
BencherRunner-->>BencherRun: retry_exit_code, report
end
BencherRun->>BencherRun: normalized_exit_code(exit_code, report)
BencherRun-->>Cli: final_exit_code, report
sequenceDiagram
participant Script as track_benchmarks.rb
participant Cli
participant BranchArgs
participant BencherRun
participant Summary
participant PrComments
participant PrReportPoster
participant Confirmation
participant RegressionPayloads
Script->>Cli: new(suite_name, report_marker).run
Cli->>BranchArgs: branch_and_start_point_args
BranchArgs-->>Cli: branch, start_point_args
Cli->>BencherRun: run_bencher!(runner, branch, args)
BencherRun-->>Cli: exit_code, report, stderr
alt retry eligibility met
Cli->>BencherRun: retry_without_start_point_hash
BencherRun-->>Cli: retry_exit_code, report
end
Cli->>BencherRun: normalized_exit_code(exit_code, report)
Cli->>Summary: rendered_report(report)
Cli->>Summary: post_report_to_summary(markdown, suite_name)
alt confirmation mode
Cli->>Confirmation: run(report, exit_code, markdown, suite_name)
Confirmation-->>Script: exit(status)
else PR mode
Cli->>PrComments: replace(markdown)
PrComments->>PrReportPoster: replace(markdown) if pull_request
end
alt main-push with nonzero exit
Cli->>RegressionPayloads: report_main_push_candidate
RegressionPayloads->>RegressionPayloads: write_candidate
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
+ci-status |
|
+ci-run-hosted |
CI StatusHead SHA: Only the required gate is active unless hosted CI is requested. |
Hosted CI RequestedTriggered 9 workflow(s) for View progress in the Actions tab. |
Greptile SummaryThis PR splits the monolithic
Confidence Score: 5/5Safe to merge — this is a pure structural refactoring with no logic changes and the added specs exercise the newly isolated units. All extracted modules faithfully reproduce the original script's logic. The load order in the new loader file respects every inter-module dependency. The compatibility shims in the entrypoint preserve the existing top-level API for require-only callers, and lazy evaluation of PrReportPoster env vars on non-PR runs is correctly maintained via block-yielded poster in PrComments.replace. The retry-without-start-point-hash path, the confirmation flow, and the candidate/confirmed hand-off payloads all match the pre-split behaviour line-for-line. Three well-structured specs are added that cover the units most amenable to unit testing. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["track_benchmarks.rb\n(__FILE__ == $PROGRAM_NAME)"] -->|"require_relative 'lib/track_benchmarks'"| B["lib/track_benchmarks.rb\n(loader)"]
B --> C["Config"]
B --> D["BranchArgs"]
B --> E["Summary"]
B --> F["BencherRun"]
B --> G["PrComments"]
B --> H["RegressionPayloads"]
B --> I["Confirmation"]
B --> J["Cli"]
A -->|"Cli.new(suite_name:, report_marker:).run"| J
J --> D
J --> F
J --> E
J --> G
J --> H
J --> I
H --> E
H --> C
I --> H
I --> E
I --> C
F --> E
subgraph "Compatibility shims (top-level methods)"
K["confirmation_mode?\nslugify\nrun_bencher!\nmain_push?\n..."]
end
A --> K
K -->|"delegates"| D
K -->|"delegates"| F
K -->|"delegates"| H
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["track_benchmarks.rb\n(__FILE__ == $PROGRAM_NAME)"] -->|"require_relative 'lib/track_benchmarks'"| B["lib/track_benchmarks.rb\n(loader)"]
B --> C["Config"]
B --> D["BranchArgs"]
B --> E["Summary"]
B --> F["BencherRun"]
B --> G["PrComments"]
B --> H["RegressionPayloads"]
B --> I["Confirmation"]
B --> J["Cli"]
A -->|"Cli.new(suite_name:, report_marker:).run"| J
J --> D
J --> F
J --> E
J --> G
J --> H
J --> I
H --> E
H --> C
I --> H
I --> E
I --> C
F --> E
subgraph "Compatibility shims (top-level methods)"
K["confirmation_mode?\nslugify\nrun_bencher!\nmain_push?\n..."]
end
A --> K
K -->|"delegates"| D
K -->|"delegates"| F
K -->|"delegates"| H
Reviews (1): Last reviewed commit: "Split benchmark tracking script" | Re-trigger Greptile |
Code Review — PR #4176: Split benchmark tracking scriptSummary: This PR cleanly refactors a 511-line monolithic script into focused modules under 1. Orchestration path (
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/lib/track_benchmarks/confirmation.rb`:
- Around line 13-18: The current implementation at line 13 blindly takes the
first glob result when searching for the candidate filename, which can select
the wrong file if multiple candidate files exist under the artifact directory
tree. Instead of using .first on the Dir.glob result, implement logic to
disambiguate between multiple matches by validating there is exactly one
candidate file or by selecting the correct one based on appropriate criteria
(such as shard or suite information). Add appropriate handling for the ambiguous
case where multiple candidates are found, either by raising an error or treating
it as inconclusive.
In `@benchmarks/lib/track_benchmarks/summary.rb`:
- Around line 21-35: The code in the display_rows method checks file existence
before reading, but a race condition can occur where the file is deleted or
becomes unreadable between the File.exist? check on line 21 and the File.read
call on line 23, causing unhandled Errno exceptions to crash the path. Extend
the rescue clause to catch file I/O errors (Errno exceptions) in addition to
JSON::ParserError, logging a warning and returning an empty array when file read
failures occur, ensuring graceful degradation with a warning instead of a crash.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e4f23a64-28a1-45a3-80b4-c3addffc697f
📒 Files selected for processing (13)
benchmarks/lib/track_benchmarks.rbbenchmarks/lib/track_benchmarks/bencher_run.rbbenchmarks/lib/track_benchmarks/branch_args.rbbenchmarks/lib/track_benchmarks/cli.rbbenchmarks/lib/track_benchmarks/config.rbbenchmarks/lib/track_benchmarks/confirmation.rbbenchmarks/lib/track_benchmarks/pr_comments.rbbenchmarks/lib/track_benchmarks/regression_payloads.rbbenchmarks/lib/track_benchmarks/summary.rbbenchmarks/spec/track_benchmarks/branch_args_spec.rbbenchmarks/spec/track_benchmarks/regression_payloads_spec.rbbenchmarks/spec/track_benchmarks/summary_spec.rbbenchmarks/track_benchmarks.rb
|
Address-review summary for PR #4176 Scan scope: full history (no prior address-review checkpoint). Outcomes:
Validation after fixes:
Remaining: current-head hosted CI, benchmark CI, and bot review reruns are pending after this push. |
|
+ci-status |
|
+ci-run-hosted |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6443ebc6f8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Code ReviewOverall: Clean, well-structured refactor. The module boundaries are sensible, the naming is consistent, and the new specs cover the right edge cases. A few things to resolve before merge: Two undocumented behavior changesThe PR description says "no intended behavior changes" but two real ones exist:
Minor observations
|
CI StatusHead SHA: Optimized hosted CI is enabled for this PR. |
|
+ci-status |
|
Address-review update for head
Validated locally after the final fix:
Confidence note:
|
|
+ci-run-hosted |
Hosted CI RequestedTriggered 9 workflow(s) for View progress in the Actions tab. |
Code ReviewGood structural work — the split into focused modules is clean and the new specs cover the previously un-tested confirmation/payload paths well. Two correctness issues and one maintenance trap to address before merge. Bug:
|
|
+ci-run-hosted |
|
Address-review update for head
Validated locally after this fix:
Confidence note:
|
|
+ci-status |
CI StatusHead SHA: Optimized hosted CI is enabled for this PR. |
Hosted CI RequestedTriggered 9 workflow(s) for View progress in the Actions tab. |
|
The modularization is clean and the intent (behavior-preserving split) is well-executed. Two correctness improvements over the original worth noting: the multiple-candidate inconclusive guard and the failure_context: 'during confirmation' fix in write_confirmed (the original would have said 'on main' in a confirmation-run error, which was misleading). Three things to flag:
|
Pro Node Renderer Benchmark Summary
▲/▼ non-zero change vs baseline · 0.0% exact/near-zero match · 🔴 significant regression · 🟢 significant improvement (tracked measures) · (n) = baseline |
Core Benchmark Summary
▲/▼ non-zero change vs baseline · 0.0% exact/near-zero match · 🔴 significant regression · 🟢 significant improvement (tracked measures) · (n) = baseline |
|
Address-review update for head Resolved current-head Claude threads without code changes:
Readiness evidence:
|
Pro (shard 2/2) Benchmark Summary
▲/▼ non-zero change vs baseline · 0.0% exact/near-zero match · 🔴 significant regression · 🟢 significant improvement (tracked measures) · (n) = baseline |
Pro (shard 1/2) Benchmark Summary
▲/▼ non-zero change vs baseline · 0.0% exact/near-zero match · 🔴 significant regression · 🟢 significant improvement (tracked measures) · (n) = baseline |
|
Batch closeout confidence note for head
Final batch state: ready-no-merge-authority. The batch goal has |
Summary
Fixes #3459 by splitting
benchmarks/track_benchmarks.rbinto focused modules without intended behavior changes:benchmarks/track_benchmarks.rb.benchmarks/lib/track_benchmarks/**.benchmarks/spec/track_benchmarks/**.Validation
script/ci-changes-detector origin/mainbundle exec rspec benchmarks/spec/track_benchmarks_spec.rb benchmarks/spec/track_benchmarksbundle exec rspec benchmarks/specbenchmarks/track_benchmarks.rb,benchmarks/lib/track_benchmarks.rb,benchmarks/lib/track_benchmarks/**/*.rb, and focused specs.bundle exec rubocop benchmarks/track_benchmarks.rb benchmarks/lib/track_benchmarks.rb benchmarks/lib/track_benchmarks benchmarks/spec/track_benchmarksBENCHMARK_JSON=/tmp/codex-missing-benchmark.json bundle exec ruby benchmarks/track_benchmarks.rbexits 1 as expected.Review / Churn Notes
codex review --base origin/maincompleted with no correctness/security/maintainability findings.claude -p '/code-review origin/main'completed with no correctness findings; it noted optional maintainability cleanup around compatibility shims and memoization, intentionally kept out of scope for a behavior-preserving split.bin/ci-local --changedwas attempted by the worker and failed in pre-existing repo-wide RuboCop offenses underreact_on_rails/spike/3313_prism_gemfile_rewriter/**, outside this lane's file-touch map.claude -p '/simplify origin/main'was run under a 180s timeout and exited with no output.Codex Decision Log
Dir.glob(...).first..firstcould confirm or clear the wrong benchmark payload when the artifact tree is malformed.Changelog
Not user-visible; no changelog entry.
Confidence note:
codex/3459-track-benchmarks-split.Summary by CodeRabbit
Merge Rationale
Maintainer merge authority was granted on 2026-06-24 after this PR reached ready state.
Reasons to merge:
benchmarks/track_benchmarks.rbinto focused modules/specs without intended benchmark behavior changes.d30b12e8186ca2f7bc983b5c76d8f8f957031e46is mergeable (mergeStateStatus: CLEAN) with 44 checks passed, 0 pending, and 0 failing.complete_allowed: true, no unknown fields, no violations), and unresolved review threads are 0.benchmarks/track_benchmarks.rb,benchmarks/lib/**, andbenchmarks/spec/**.Confidence note:
pr-ci-readiness, full hosted checks, review-thread triage, andscript/pr-merge-ledger 4176 --strict --pretty --changelog-classification not_user_visible./tmp/pr-4176-merge-ledger-premerge.json.