ci: integrate CodSpeed continuous benchmarking#9975
Conversation
Wire the existing criterion benches into CodSpeed (https://codspeed.io) for continuous performance tracking. CodSpeed runs benches under CPU simulation in CI and posts per-PR comparison reports vs. the base branch's latest main run. Highlights ========== - `criterion` workspace dependency renamed to `codspeed-criterion-compat`: a drop-in passthrough that wraps real criterion when running outside cargo-codspeed, so bench source code needs no changes (`use criterion::*` keeps working) and `cargo bench` locally is unaffected. - Two workflows: - `.github/workflows/codspeed.yml` runs on every push to main and populates the base-branch baseline. - `.github/workflows/codspeed-pr.yml` runs on PRs when a `bench:*` label is attached, so external contributors don't blindly burn CI capacity. Labels are namespaced per crate: bench:all # whole workspace bench:arrow # all of arrow's benches bench:parquet bench:arrow-cast # union - Sharded one job per `[[bench]]` target (~78 shards after exclusions). Required because (a) the full workspace produces >1000 individual benchmarks per upload, and (b) the parquet crate alone produces >1000 due to heavy criterion parameterization, both of which exceed CodSpeed's per-upload limit. Jobs in the same workflow are auto-aggregated by CodSpeed into a single report. Ref: https://codspeed.io/docs/features/sharded-benchmarks - Build-once / run-many topology: setup ─┐ ├──→ bench (matrix, N shards) build ─┘ `build` does the full-workspace `cargo codspeed build` exactly once and uploads `target/codspeed/` as a tar artifact (tar preserves the +x bit, which `actions/upload-artifact` strips otherwise). Each bench shard downloads the artifact and invokes `cargo codspeed run -p <crate> --bench <bench>`. No rebuild per shard, so CI cost scales with N shards × ~2 min instead of ×10 min. - Dynamic matrix: `setup` parses every workspace member's Cargo.toml for `[[bench]]` entries with awk + jq and emits a JSON `{crate, bench}` array, so new bench targets are picked up automatically without touching the workflow. - Auth: GitHub OIDC. No `CODSPEED_TOKEN` secret needed for the public repo; the workflow's `id-token` claim is what CodSpeed verifies. Exclusions ========== Ten bench targets currently fail at runtime (e.g. `merge_kernels` panics in `arrow-data/src/transform/primitive.rs:31`); these are pre-existing issues in the benches themselves, not the integration. They're listed in an `EXCLUDED_BENCHES` env in both workflows so the remaining ~78 shards run clean. Each excluded target should be fixed and removed from the list one by one. Prerequisites for activation ============================ - Install the CodSpeed GitHub App on `apache/arrow-rs`: https://github.com/apps/codspeed - Enroll the repository at https://codspeed.io (the OIDC integration is automatic for public repos; no secret token configuration required) Once both are done, the first push to main will populate the baseline and PRs labeled `bench:*` will receive automated CodSpeed comparison comments.
…anifests
Replace the per-workflow awk/jq Cargo.toml parsing and hardcoded crate
lists with a single shared script (.github/workflows/codspeed-matrix.sh)
that discovers every [[bench]] target across the workspace via
`cargo metadata`. New crates and bench targets are picked up automatically.
The known-broken exclusion list, previously duplicated as an
EXCLUDED_BENCHES env in both workflows, now lives next to each bench in
its crate's Cargo.toml:
[package.metadata.codspeed.benches]
merge_kernels = { skip = true }
cargo surfaces that table at .packages[].metadata.codspeed.benches and the
script drops any target flagged `skip = true`. Unknown `bench:<crate>`
label suffixes now error against the workspace member list rather than a
loose regex.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
191da5f to
c4e266f
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
regarding the failing benches, i have a PR to fix one of them:
for the others i wasn't able to reproduce a failure locally; does runtime error potentially indicate the benchmarks took too long to run?
for adding the app, it looks like opendal has done this so we can follow their lead:
fyi looks like there was a previous attempt to add to arrow-rs here:
but i guess we were stuck waiting on infra, which should be resolved now
| # Fork PR caveat: workflows triggered by `pull_request` from fork PRs | ||
| # do not get an OIDC token. For benches on fork PRs, push the branch | ||
| # to this repo and label it there. |
There was a problem hiding this comment.
i suppose this is an important caveat considering essentially all our PRs come from forks 🤔
| # Continuous benchmarking on CodSpeed. | ||
| # | ||
| # Runs the full workspace bench suite on every push to main, sharded | ||
| # one job per `[[bench]]` target. Sharding at this granularity is | ||
| # required because both crate-level shards (e.g. parquet alone has 16 | ||
| # bench targets producing >1000 benchmarks) and the workspace as a | ||
| # whole exceed CodSpeed's 1000-benchmark per-upload limit. Jobs in the | ||
| # same workflow are auto-aggregated by CodSpeed into one report. | ||
| # https://codspeed.io/docs/features/sharded-benchmarks |
There was a problem hiding this comment.
I do wonder how heavy this would be to run entire benchmark suite on every commit 🤔
| matrix: | ||
| config: ${{ fromJson(needs.setup.outputs.matrix) }} | ||
| steps: | ||
| - uses: actions/checkout@v6 |
There was a problem hiding this comment.
minor note: we should change the versions to hashes, but can do in a followup
…#10199) as identified by - #9975 this benchmark was panicking when run, because it was passing a 1-len array as an `ArrayRef` instead of a `Scalar` and thus indexing the 2nd/3rd element was causing out of bounds panic; when wrapped in `Scalar` the single element will be repeated so it wouldn't index out of bounds it would just get the first element
Summary
Wires the existing criterion benches in this workspace into CodSpeed for continuous performance tracking. CodSpeed runs benches under CPU simulation in CI and posts per-PR comparison reports vs. the base branch's latest main run.
This PR is opt-in once activated: the PR workflow only fires when a maintainer adds a
bench:*label, so external contributors don't blindly burn CI capacity. The main-push workflow keeps the baseline current.The integration has been validated end-to-end on a fork (
pydantic/arrow-rs): 3031 benchmarks captured from a single main run, PR runs produce clean comparison comments (e.g. "Merging this PR will not alter performance — ✅ 7 untouched benchmarks, ⏩ 3024 skipped benchmarks, comparing codspeed-smoke-test (5b1320a) with main (fcbe248)"). Public dashboard: https://codspeed.io/pydantic/arrow-rsDesign
Drop-in shim, no bench source changes
The
criterionworkspace dependency is renamed (via the[package]cargo trick) tocodspeed-criterion-compat. This is a CodSpeed-maintained passthrough — when not running undercargo codspeed, it forwards to real criterion, socargo benchlocally is unchanged and every existinguse criterion::*in every bench source file compiles unmodified.Sharded one job per
[[bench]]targetRequired for two reasons:
parquetcrate alone exceeds 1000 — per-crate sharding wasn't fine enough.Jobs within a single workflow are auto-aggregated by CodSpeed into one unified report.
Build once, run many
setupdiscovers every[[bench]]target across the workspace viacargo metadata(noCargo.tomltext parsing, no hardcoded crate list) and emits a JSON{crate, bench}array; new crates and bench targets are picked up automatically. Discovery and the skip filter live in a single shared script,.github/workflows/codspeed-matrix.sh, used by both workflows.buildruns the full-workspacecargo codspeed buildexactly once, packstarget/codspeed/into a tarball (tar preserves the +x bit;actions/upload-artifactstrips it otherwise), uploads as a 1-day artifact.cargo codspeed run -p <crate> --bench <bench>. No per-shard rebuild — CI cost scales with N × ~2 min instead of N × full build.Label-gated PRs
codspeed-pr.ymlfires onpull_request: [labeled, synchronize, opened, reopened]and only runs when the PR has at least onebench:*label:bench:all[[bench]]in the workspacebench:<crate>[[bench]]in that cratebench:<crate-a> bench:<crate-b>bench:<crate>suffixes are resolved to workspace members bycodspeed-matrix.sh, which errors on an unknown crate name. Authorization is implicit: only users with write access can add labels.While the label is attached, every push to the PR re-runs the suite (
synchronizeevent); re-runs cancel in-progress shards viaconcurrency: cancel-in-progress: true.OIDC auth
Public repo, no
CODSPEED_TOKENsecret required — the workflow'sid-token: writeclaim is what CodSpeed verifies. Workflows are repo-agnostic.Exclusions
Ten bench targets currently fail at runtime in this workspace — pre-existing issues in the bench targets themselves, not the integration. Each is marked skipped in its own crate's
Cargo.toml, next to where the bench is declared, so the skip list lives with the code that owns it:cargo metadatasurfaces that table at.packages[].metadata.codspeed.benches, andcodspeed-matrix.shdrops any target flaggedskip = true, leaving the remaining ~78 shards to run clean. Fix a target and delete its entry to bring it back; the list shrinks to zero over time. Current entries:arrow / merge_kernelsarrow-data/src/transform/primitive.rs:31:43arrow / buffer_bit_opsarrow / buffer_createarrow / sort_kernelarrow / string_run_builderarrow / primitive_run_accessorarrow-array / union_arrayarrow-cast / parse_dateparquet / row_selection_cursorparquet-variant-compute / variant_kernelsI'm happy to file separate upstream issues for each if helpful — or to drop the exclusion list entirely if maintainers prefer to investigate them all at once. The same
merge_kernelsexclusion was added by the official CodSpeed wizard's auto-generated PR (https://codspeed.io/docs/get-started/wizard), so this is consistent prior art.Prerequisites for activation
This PR adds the workflow files but they're inert until two repo-admin actions land:
apache/arrow-rs. This is what posts the PR comparison comment + status check.Once both are done, the first push to
mainwill populate the baseline and PRs labeledbench:*will receive automated comparison comments.CI cost notes
build, but only the bench shards for the labeled crates. A typicalbench:arrow-castrun is build + 3 shards.Test plan
cargo check --workspace --benches --features arrow/test_utils,arrow-schema/ffi,parquet/test_common,parquet/experimental,parquet/async,parquet/object_storepasses against this branchpydantic/arrow-rs: main baseline run captured 3031 benchmarks; PR run posts comparison comment correctly; per-shard sharding stays under the 1000-benchmark limitapache/arrow-rs, first main run populates baseline at https://codspeed.io/apache/arrow-rsbench:alland per-cratebench:<crate>labels in repo settingsbench:<crate>to a real PR; confirm comparison comment + status check appearReferences
🤖 Generated with Claude Code