Skip to content

Commit 5c8cdcb

Browse files
authored
[codex] Split benchmark tracking helpers (#3791)
## Summary - Extract Bencher CLI command/report handling from `benchmarks/track_benchmarks.rb` into `BencherRunner`. - Extract PR benchmark report comment posting and stale-comment cleanup into `PrReportPoster`. - Add focused specs for the extracted helpers so the benchmark tracking workflow can be changed with less script-level risk. - Harden stale benchmark comment cleanup by filtering jq output to numeric GitHub comment IDs and warning on unexpected tokens before deleting anything. ## Scope Part of #3459. This intentionally does not close the issue because the workflow-matrix and k6/server-monitoring follow-ups remain. No `.github/workflows/*` files were changed, so this avoids the active #3785 / #3149 workflow lane. ## Validation - `bundle exec rspec benchmarks/spec` - `bundle exec rubocop benchmarks` - `(cd react_on_rails && bundle exec rubocop)` - `codex review --base origin/main` - `git diff --check` <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches CI benchmark reporting and GitHub PR comment deletion; behavior is largely preserved but report I/O and gh API paths changed, with regression risk mitigated by extensive specs and no workflow YAML edits. > > **Overview** > Refactors benchmark tracking by moving Bencher CLI execution and report handling out of `track_benchmarks.rb` into **`BencherRunner`**, and PR Markdown comment posting plus stale-comment cleanup into **`PrReportPoster`**. > > **`BencherRunner`** owns threshold/CLI args, runs `bencher run`, persists JSON via **tmp → atomic move**, and returns a structured result. Parse or persistence failures raise typed errors; malformed output is logged with **`Github.debug`** and bad report files are removed. **`run_bencher!`** in the script maps those failures to `::error::` and exits. > > **`PrReportPoster`** posts suite reports over **stdin** (avoids arg limits), sweeps older marker-matched comments using a pre-post cutoff, and **validates `owner/repo` and numeric PR numbers** before calling `gh`. Stale-ID listing now **keeps only numeric comment IDs** and warns on unexpected jq tokens so cleanup cannot delete with bogus paths. > > Non-PR runs avoid requiring `PR_NUMBER` via lazy poster init and `GITHUB_EVENT_NAME` defaulting to nil. Specs for the extracted helpers replace inlined script tests; **`BencherRunner::THRESHOLDS`** is the single source for threshold/table alignment checks. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 7867e98. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Automated benchmark runs now produce and persist JSON reports, post per-suite Markdown to pull requests, and remove stale benchmark comments. * **Bug Fixes** * CI now surfaces clear errors for malformed benchmark output, warns when perf-link context is missing, and reports failures when comment deletions fail. * **Tests** * New specs cover benchmark run/report parsing, CLI behaviors, retry paths, and PR comment lifecycle. * **Refactor** * Benchmark execution, report parsing, and PR-commenting responsibilities have been separated for clearer flow. <!-- end of auto-generated comment: release notes by coderabbit.ai --> ## Agent Merge Confidence Mode: accelerated-rc Score: 9/10 Auto-merge recommendation: yes Affected areas: benchmarks, CI benchmark reporting CI detector: `script/ci-changes-detector origin/main` -> Benchmark scripts; recommended lint. Validation run: - `bundle exec rspec benchmarks/spec/bencher_runner_spec.rb benchmarks/spec/pr_report_poster_spec.rb benchmarks/spec/track_benchmarks_spec.rb` -> 71 examples, 0 failures - `bundle exec rubocop benchmarks/lib/bencher_runner.rb benchmarks/lib/pr_report_poster.rb benchmarks/track_benchmarks.rb benchmarks/spec/bencher_runner_spec.rb benchmarks/spec/pr_report_poster_spec.rb benchmarks/spec/track_benchmarks_spec.rb` -> no offenses - `git diff --check` -> passed - `script/ci-changes-detector origin/main` -> benchmark scripts; recommended lint Review/check gate: - Claude review: complete for `7867e984dbe919d1d3faee4f6ac69120bdc6925b`, no unresolved current threads - CodeRabbit: approved/no actionable current blockers; docstring coverage warning treated as non-blocking for benchmark helper scripts - GitHub checks: complete for `7867e984dbe919d1d3faee4f6ac69120bdc6925b`; expected skips are non-selected/confirmation-only jobs Known residual risk: low benchmark-comment lifecycle risk, mitigated by focused helper specs and full current-head CI. Finalized by: Claude Code Review check `claude-review` for head `7867e984dbe919d1d3faee4f6ac69120bdc6925b`
1 parent e73221d commit 5c8cdcb

8 files changed

Lines changed: 934 additions & 243 deletions

benchmarks/lib/bencher_runner.rb

Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# frozen_string_literal: true
2+
3+
require "fileutils"
4+
require "json"
5+
require "open3"
6+
7+
require_relative "bencher_report"
8+
require_relative "github"
9+
10+
# Builds and runs the Bencher CLI invocation for benchmark tracking.
11+
class BencherRunner
12+
class ReportParseError < StandardError; end
13+
class PersistenceError < StandardError; end
14+
15+
Result = Struct.new(:stderr, :exit_code, :report, keyword_init: true)
16+
private_constant :Result
17+
18+
# Bencher dashboard project for React on Rails benchmark runs.
19+
PROJECT_SLUG = "react-on-rails-t8a9ncxo"
20+
private_constant :PROJECT_SLUG
21+
MAX_SAMPLE = "64" # String because it is passed verbatim as a CLI argument.
22+
private_constant :MAX_SAMPLE
23+
24+
# Per-measure t-test boundaries (the confidence level Bencher uses for its
25+
# prediction interval). Tuned from a sweep of recent main-branch reports so fewer
26+
# than 1/20 commits raise a false regression across all benchmarks: rps and p50
27+
# individually need ~0.9995 / ~0.9999 to clear that bar. failed_pct stays at 0.95
28+
# because healthy runs sit at ~0 with near-zero variance, so its boundary rarely
29+
# matters.
30+
# Bencher's t-test threshold is a prediction interval, so each one-sided boundary B
31+
# gives a per-test false-positive rate of ~(1 - B):
32+
# https://bencher.dev/docs/explanation/thresholds/
33+
# Direction: :lower for "regression = drop" measures (rps), :upper for
34+
# "regression = climb" measures (latency, failure rate).
35+
# p90/p99/max are intentionally NOT tracked: their tail noise can't meet the 1/20
36+
# target at any usable boundary. p90 stays in the summary table for visibility only.
37+
THRESHOLDS = [
38+
["rps", :lower, "0.9995"],
39+
["p50_latency", :upper, "0.9999"],
40+
["failed_pct", :upper, "0.95"]
41+
].freeze
42+
43+
def initialize(benchmark_json:, report_json:)
44+
@benchmark_json = benchmark_json
45+
@report_json = report_json
46+
end
47+
48+
# Returns a Result with :stderr, :exit_code, and :report accessors. The
49+
# private constant keeps callers from depending on the struct class name.
50+
# Raises PersistenceError on I/O failure, ReportParseError on malformed JSON output.
51+
def run(branch:, start_point_args:)
52+
# This Bencher CLI call is not wrapped in Timeout.timeout because that can leak
53+
# child processes. In CI it is bounded by the GitHub Actions job timeout for
54+
# .github/workflows/benchmark-suite.yml; the benchmark execution step has its
55+
# own narrower timeout-minutes before this reporting step runs.
56+
stdout, stderr, status = Open3.capture3(*args(branch, start_point_args))
57+
emit_stderr(stderr)
58+
report = persist_report(stdout)
59+
warn_on_missing_perf_link_context(report) if report
60+
Result.new(stderr:, exit_code: status.exitstatus, report:)
61+
end
62+
63+
private
64+
65+
attr_reader :benchmark_json, :report_json
66+
67+
def emit_stderr(stderr)
68+
return if stderr.empty?
69+
70+
warn stderr
71+
end
72+
73+
def threshold_args(measure, direction, boundary)
74+
# "_" is Bencher's sentinel for "no boundary on this side".
75+
lower, upper = direction == :lower ? [boundary, "_"] : ["_", boundary]
76+
[
77+
"--threshold-measure", measure,
78+
"--threshold-test", "t_test",
79+
"--threshold-max-sample-size", MAX_SAMPLE,
80+
"--threshold-lower-boundary", lower,
81+
"--threshold-upper-boundary", upper
82+
]
83+
end
84+
85+
def args(branch, start_point_args)
86+
[
87+
"bencher", "run",
88+
"--project", PROJECT_SLUG,
89+
"--branch", branch,
90+
*start_point_args,
91+
"--testbed", "github-actions",
92+
"--adapter", "json",
93+
"--file", benchmark_json,
94+
"--err",
95+
"--quiet",
96+
"--format", "json",
97+
*THRESHOLDS.flat_map { |measure, direction, boundary| threshold_args(measure, direction, boundary) }
98+
]
99+
end
100+
101+
# Writes Bencher stdout to disk atomically (tmp -> mv), then parses it.
102+
# On write/move failure the prior report at report_json is left untouched.
103+
# Empty Bencher stdout removes any stale prior report because there is no new output to preserve.
104+
# On parse failure the newly-written malformed report is removed so a future
105+
# retry starts clean rather than re-posting garbage.
106+
def persist_report(stdout)
107+
if stdout.empty?
108+
begin
109+
FileUtils.rm_f(report_json)
110+
rescue SystemCallError, IOError => e
111+
raise PersistenceError,
112+
"#{e.message} (Bencher produced no output; see stderr above for the run failure)"
113+
end
114+
return nil
115+
end
116+
117+
tmp_report_json = "#{report_json}.tmp"
118+
begin
119+
File.write(tmp_report_json, stdout)
120+
FileUtils.mv(tmp_report_json, report_json)
121+
rescue SystemCallError, IOError => e
122+
raise PersistenceError, e.message
123+
ensure
124+
# Always runs, including for exceptions that bypass the rescue block. After a successful mv the tmp
125+
# file no longer exists, so rm_f is a no-op; if write or mv raised it performs the cleanup.
126+
safe_remove_tmp(tmp_report_json)
127+
end
128+
129+
parse_and_cleanup_report(stdout)
130+
end
131+
132+
def parse_and_cleanup_report(stdout)
133+
parse_report(stdout)
134+
rescue ReportParseError
135+
Github.debug("Malformed Bencher output (first 300 chars): #{stdout.slice(0, 300).inspect}")
136+
# Only ReportParseError is cleaned up here. Unexpected parser bugs should propagate unchanged.
137+
# Remove malformed output so a future retry starts clean; the raw debugging
138+
# artifact is lost, but a bad report file is worse than no report file.
139+
begin
140+
FileUtils.rm_f(report_json)
141+
rescue StandardError => e
142+
Github.warning("Could not remove malformed Bencher report #{report_json}: #{e.message}")
143+
end
144+
raise
145+
end
146+
147+
def parse_report(stdout)
148+
BencherReport.parse(stdout, tracked_measures: THRESHOLDS.map(&:first))
149+
rescue BencherReport::FormatError, JSON::ParserError => e
150+
raise ReportParseError,
151+
"Bencher JSON report has an unexpected shape — re-verify against " \
152+
"benchmarks/spec/bencher_report_spec.rb before bumping the CLI pin. #{e.message}"
153+
end
154+
155+
def safe_remove_tmp(path)
156+
FileUtils.rm_f(path)
157+
rescue StandardError => e
158+
# Cleanup failures are non-fatal, so keep this broader than the persistence rescue.
159+
Github.warning("Could not remove temporary Bencher report #{path}: #{e.message}")
160+
end
161+
162+
def warn_on_missing_perf_link_context(report)
163+
return unless report.perf_links_unavailable?
164+
165+
Github.warning(
166+
"Bencher report listed benchmarks but no perf-link context " \
167+
"(project/branch/testbed uuid); benchmark names will render unlinked. Re-verify the " \
168+
"report shape against benchmarks/spec/bencher_perf_url_spec.rb before bumping the CLI pin."
169+
)
170+
end
171+
end

benchmarks/lib/github.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,10 @@ def notice(message)
1616
$stdout.puts "::notice::#{escape_workflow_command_data(message)}"
1717
end
1818

19+
def debug(message)
20+
$stdout.puts "::debug::#{escape_workflow_command_data(message)}"
21+
end
22+
1923
def escape_workflow_command_data(value)
2024
value.to_s
2125
# Escape percent first so the percent signs introduced below are not double-encoded.

benchmarks/lib/pr_report_poster.rb

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# frozen_string_literal: true
2+
3+
require "time"
4+
5+
require_relative "github"
6+
require_relative "github_cli"
7+
8+
# Posts the per-suite Bencher Markdown report to a pull request and cleans up
9+
# older comments with the same marker.
10+
class PrReportPoster
11+
REPOSITORY_SLUG_PATTERN = %r{\A[A-Za-z0-9][A-Za-z0-9_.-]*/[A-Za-z0-9][A-Za-z0-9_.-]*\z}
12+
private_constant :REPOSITORY_SLUG_PATTERN
13+
14+
def initialize(repository:, pr_number:, suite_name:, marker:)
15+
normalized_repository = repository.to_s
16+
# The regex allows ".." within component names, so reject any embedded path traversal.
17+
unless normalized_repository.match?(REPOSITORY_SLUG_PATTERN) && !normalized_repository.include?("..")
18+
raise ArgumentError, "repository must be in owner/repo format, got: #{normalized_repository.inspect}"
19+
end
20+
21+
normalized_pr_number = pr_number.to_s
22+
unless normalized_pr_number.match?(/\A\d+\z/)
23+
raise ArgumentError, "pr_number must be numeric, got: #{normalized_pr_number.inspect}"
24+
end
25+
26+
@repository = normalized_repository
27+
@pr_number = normalized_pr_number
28+
@suite_name = suite_name
29+
@marker = marker
30+
end
31+
32+
# GitHub Actions sets GITHUB_REPOSITORY natively. The workflow step must set
33+
# PR_NUMBER from the pull request event.
34+
def self.from_env(suite_name:, marker:)
35+
new(
36+
repository: required_repository,
37+
pr_number: required_pr_number,
38+
suite_name:,
39+
marker:
40+
)
41+
end
42+
43+
def self.required_repository
44+
ENV.fetch("GITHUB_REPOSITORY") do
45+
raise KeyError, "GITHUB_REPOSITORY env var is required (set by GitHub Actions)"
46+
end
47+
end
48+
private_class_method :required_repository
49+
50+
def self.required_pr_number
51+
ENV.fetch("PR_NUMBER") do
52+
raise KeyError, "PR_NUMBER env var is required (set it from the pull_request event in the workflow step)"
53+
end
54+
end
55+
private_class_method :required_pr_number
56+
57+
def replace(markdown)
58+
# Guard callers that use the poster without the script-level empty-report check.
59+
return if markdown.empty?
60+
61+
# Capture cutoff before posting so the stale-comment sweep only hits pre-existing
62+
# comments with the same marker, not the one this run is about to create.
63+
cutoff_ts = Time.now.utc.iso8601
64+
if post_comment(markdown)
65+
delete_stale_comments(before: cutoff_ts)
66+
else
67+
Github.warning("Failed to post #{suite_name} benchmark report comment; keeping prior comments in place.")
68+
end
69+
end
70+
71+
private
72+
73+
attr_reader :repository, :pr_number, :suite_name, :marker
74+
75+
def delete_stale_comments(before:)
76+
failed = 0
77+
stale_comment_ids(before:).each do |comment_id|
78+
$stdout.puts "Deleting stale #{suite_name} Bencher report comment #{comment_id}"
79+
failed += 1 unless GithubCli.run(
80+
"gh", "api", "-X", "DELETE", "repos/#{repository}/issues/comments/#{comment_id}",
81+
error_message: "Failed to delete stale #{suite_name} Bencher report comment #{comment_id}"
82+
)
83+
end
84+
return if failed.zero?
85+
86+
Github.warning(
87+
"Failed to delete #{failed} stale #{suite_name} Bencher report comment(s); " \
88+
"they may remain visible."
89+
)
90+
end
91+
92+
def stale_comment_ids(before:)
93+
# Marker + cutoff are passed via env so the jq filter reads them through `env.X`,
94+
# avoiding Ruby/JQ escaping mismatches from interpolated strings.
95+
stdout, status = GithubCli.capture(
96+
"gh", "api", "repos/#{repository}/issues/#{pr_number}/comments",
97+
"--paginate",
98+
# gh api --paginate applies --jq to each page independently, then concatenates stdout.
99+
# GitHub timestamps are fixed-width ISO-8601 strings, so lexical ordering matches time ordering.
100+
"--jq", ".[] | select(.body | startswith(env.MARKER)) | select(.created_at < env.CUTOFF_TS) | .id",
101+
env: { "MARKER" => marker, "CUTOFF_TS" => before }
102+
)
103+
unless status.success?
104+
# Cleanup is best-effort: stale comments should not fail an otherwise valid benchmark job.
105+
Github.warning("Failed to list stale #{suite_name} Bencher report comments; skipping cleanup.")
106+
return []
107+
end
108+
109+
comment_ids = stdout.lines.map(&:strip).reject(&:empty?)
110+
numeric_comment_ids = comment_ids.grep(/\A\d+\z/)
111+
non_numeric_comment_ids = comment_ids.grep_v(/\A\d+\z/)
112+
if non_numeric_comment_ids.any?
113+
if numeric_comment_ids.empty?
114+
Github.warning(
115+
"Stale #{suite_name} Bencher report comment listing returned no numeric IDs " \
116+
"(#{non_numeric_comment_ids.size} non-numeric token(s), " \
117+
"e.g. #{non_numeric_comment_ids.first.slice(0, 120).inspect}); skipping cleanup."
118+
)
119+
return []
120+
end
121+
122+
Github.warning(
123+
"Stale #{suite_name} Bencher report comment listing returned " \
124+
"#{non_numeric_comment_ids.size} non-numeric ID(s); ignoring those entries."
125+
)
126+
end
127+
128+
numeric_comment_ids
129+
end
130+
131+
def post_comment(markdown)
132+
# Send the body over stdin (--body-file -) rather than as a CLI argument so a
133+
# large report can't hit the OS argument-length limit.
134+
GithubCli.run(
135+
"gh", "pr", "comment", pr_number, "--repo", repository, "--body-file", "-",
136+
error_message: "Failed to post #{suite_name} benchmark report comment",
137+
stdin_data: "#{marker}\n#{markdown}"
138+
)
139+
end
140+
end

0 commit comments

Comments
 (0)