ci: integrate CodSpeed continuous benchmarking by adriangb · Pull Request #9975 · apache/arrow-rs

adriangb · 2026-05-14T18:40:10Z

Closes Track performance changes in CI #6149

Summary

Wires the existing criterion benches in this workspace into CodSpeed for continuous performance tracking. CodSpeed runs benches under CPU simulation in CI and posts per-PR comparison reports vs. the base branch's latest main run.

This PR is opt-in once activated: the PR workflow only fires when a maintainer adds a bench:* label, so external contributors don't blindly burn CI capacity. The main-push workflow keeps the baseline current.

The integration has been validated end-to-end on a fork (pydantic/arrow-rs): 3031 benchmarks captured from a single main run, PR runs produce clean comparison comments (e.g. "Merging this PR will not alter performance — ✅ 7 untouched benchmarks, ⏩ 3024 skipped benchmarks, comparing codspeed-smoke-test (5b1320a) with main (fcbe248)"). Public dashboard: https://codspeed.io/pydantic/arrow-rs

Design

Drop-in shim, no bench source changes

The criterion workspace dependency is renamed (via the [package] cargo trick) to codspeed-criterion-compat. This is a CodSpeed-maintained passthrough — when not running under cargo codspeed, it forwards to real criterion, so cargo bench locally is unchanged and every existing use criterion::* in every bench source file compiles unmodified.

# Cargo.toml (workspace)
criterion = { package = "codspeed-criterion-compat", version = "4.6", default-features = false }

Sharded one job per `[[bench]]` target

Required for two reasons:

The full workspace produces well over 1000 individual benchmarks (criterion parameterizes heavily), which exceeds CodSpeed's per-upload limit.
Even the parquet crate alone exceeds 1000 — per-crate sharding wasn't fine enough.

Jobs within a single workflow are auto-aggregated by CodSpeed into one unified report.

Build once, run many

setup ─┐
       ├──→ bench (matrix, ~78 shards)
build ─┘

setup discovers every [[bench]] target across the workspace via cargo metadata (no Cargo.toml text parsing, no hardcoded crate list) and emits a JSON {crate, bench} array; new crates and bench targets are picked up automatically. Discovery and the skip filter live in a single shared script, .github/workflows/codspeed-matrix.sh, used by both workflows.
build runs the full-workspace cargo codspeed build exactly once, packs target/codspeed/ into a tarball (tar preserves the +x bit; actions/upload-artifact strips it otherwise), uploads as a 1-day artifact.
Each bench shard downloads the artifact, unpacks it, runs cargo codspeed run -p <crate> --bench <bench>. No per-shard rebuild — CI cost scales with N × ~2 min instead of N × full build.

Label-gated PRs

codspeed-pr.yml fires on pull_request: [labeled, synchronize, opened, reopened] and only runs when the PR has at least one bench:* label:

Label	Effect
`bench:all`	Every `[[bench]]` in the workspace
`bench:<crate>`	Every `[[bench]]` in that crate
`bench:<crate-a> bench:<crate-b>`	Union

bench:<crate> suffixes are resolved to workspace members by codspeed-matrix.sh, which errors on an unknown crate name. Authorization is implicit: only users with write access can add labels.

While the label is attached, every push to the PR re-runs the suite (synchronize event); re-runs cancel in-progress shards via concurrency: cancel-in-progress: true.

OIDC auth

Public repo, no CODSPEED_TOKEN secret required — the workflow's id-token: write claim is what CodSpeed verifies. Workflows are repo-agnostic.

Exclusions

Ten bench targets currently fail at runtime in this workspace — pre-existing issues in the bench targets themselves, not the integration. Each is marked skipped in its own crate's Cargo.toml, next to where the bench is declared, so the skip list lives with the code that owns it:

# arrow/Cargo.toml
[package.metadata.codspeed.benches]
merge_kernels = { skip = true }   # broken at runtime, fix and remove

cargo metadata surfaces that table at .packages[].metadata.codspeed.benches, and codspeed-matrix.sh drops any target flagged skip = true, leaving the remaining ~78 shards to run clean. Fix a target and delete its entry to bring it back; the list shrinks to zero over time. Current entries:

Target	Observed failure mode
`arrow / merge_kernels`	panics at `arrow-data/src/transform/primitive.rs:31:43`
`arrow / buffer_bit_ops`	runtime error
`arrow / buffer_create`	runtime error
`arrow / sort_kernel`	runtime error
`arrow / string_run_builder`	runtime error
`arrow / primitive_run_accessor`	runtime error
`arrow-array / union_array`	runtime error
`arrow-cast / parse_date`	runtime error
`parquet / row_selection_cursor`	runtime error
`parquet-variant-compute / variant_kernels`	intermittent

I'm happy to file separate upstream issues for each if helpful — or to drop the exclusion list entirely if maintainers prefer to investigate them all at once. The same merge_kernels exclusion was added by the official CodSpeed wizard's auto-generated PR (https://codspeed.io/docs/get-started/wizard), so this is consistent prior art.

Prerequisites for activation

This PR adds the workflow files but they're inert until two repo-admin actions land:

Install the CodSpeed GitHub App on apache/arrow-rs. This is what posts the PR comparison comment + status check.
Enroll the repository at https://codspeed.io. OIDC is automatic for public repos — no secret token configuration required.

Once both are done, the first push to main will populate the baseline and PRs labeled bench:* will receive automated comparison comments.

CI cost notes

Main-push workflow: 1 build + 78 shards. Build job dominates wallclock (~10 min); shards run in parallel and download from one artifact, ~2 min each.
PR workflow: same build, but only the bench shards for the labeled crates. A typical bench:arrow-cast run is build + 3 shards.
Per-target bench binaries are bundled in one ~1-2 GB artifact (well under GitHub's 5 GB free-tier limit).

Test plan

cargo check --workspace --benches --features arrow/test_utils,arrow-schema/ffi,parquet/test_common,parquet/experimental,parquet/async,parquet/object_store passes against this branch
End-to-end validation on pydantic/arrow-rs: main baseline run captured 3031 benchmarks; PR run posts comparison comment correctly; per-shard sharding stays under the 1000-benchmark limit
After merge and CodSpeed-App install on apache/arrow-rs, first main run populates baseline at https://codspeed.io/apache/arrow-rs
Create the bench:all and per-crate bench:<crate> labels in repo settings
Add bench:<crate> to a real PR; confirm comparison comment + status check appear

References

CodSpeed docs: https://codspeed.io/docs
Sharded benchmarks: https://codspeed.io/docs/features/sharded-benchmarks
Compat shim source: https://github.com/CodSpeedHQ/codspeed-rust
Prior auto-generated wizard PR on pydantic fork: Add CodSpeed continuous performance benchmarking pydantic/arrow-rs#11 (single-shard; hit the >1000 limit, which this PR resolves)

🤖 Generated with Claude Code

Wire the existing criterion benches into CodSpeed (https://codspeed.io) for continuous performance tracking. CodSpeed runs benches under CPU simulation in CI and posts per-PR comparison reports vs. the base branch's latest main run. Highlights ========== - `criterion` workspace dependency renamed to `codspeed-criterion-compat`: a drop-in passthrough that wraps real criterion when running outside cargo-codspeed, so bench source code needs no changes (`use criterion::*` keeps working) and `cargo bench` locally is unaffected. - Two workflows: - `.github/workflows/codspeed.yml` runs on every push to main and populates the base-branch baseline. - `.github/workflows/codspeed-pr.yml` runs on PRs when a `bench:*` label is attached, so external contributors don't blindly burn CI capacity. Labels are namespaced per crate: bench:all # whole workspace bench:arrow # all of arrow's benches bench:parquet bench:arrow-cast # union - Sharded one job per `[[bench]]` target (~78 shards after exclusions). Required because (a) the full workspace produces >1000 individual benchmarks per upload, and (b) the parquet crate alone produces >1000 due to heavy criterion parameterization, both of which exceed CodSpeed's per-upload limit. Jobs in the same workflow are auto-aggregated by CodSpeed into a single report. Ref: https://codspeed.io/docs/features/sharded-benchmarks - Build-once / run-many topology: setup ─┐ ├──→ bench (matrix, N shards) build ─┘ `build` does the full-workspace `cargo codspeed build` exactly once and uploads `target/codspeed/` as a tar artifact (tar preserves the +x bit, which `actions/upload-artifact` strips otherwise). Each bench shard downloads the artifact and invokes `cargo codspeed run -p <crate> --bench <bench>`. No rebuild per shard, so CI cost scales with N shards × ~2 min instead of ×10 min. - Dynamic matrix: `setup` parses every workspace member's Cargo.toml for `[[bench]]` entries with awk + jq and emits a JSON `{crate, bench}` array, so new bench targets are picked up automatically without touching the workflow. - Auth: GitHub OIDC. No `CODSPEED_TOKEN` secret needed for the public repo; the workflow's `id-token` claim is what CodSpeed verifies. Exclusions ========== Ten bench targets currently fail at runtime (e.g. `merge_kernels` panics in `arrow-data/src/transform/primitive.rs:31`); these are pre-existing issues in the benches themselves, not the integration. They're listed in an `EXCLUDED_BENCHES` env in both workflows so the remaining ~78 shards run clean. Each excluded target should be fixed and removed from the list one by one. Prerequisites for activation ============================ - Install the CodSpeed GitHub App on `apache/arrow-rs`: https://github.com/apps/codspeed - Enroll the repository at https://codspeed.io (the OIDC integration is automatic for public repos; no secret token configuration required) Once both are done, the first push to main will populate the baseline and PRs labeled `bench:*` will receive automated CodSpeed comparison comments.

…anifests Replace the per-workflow awk/jq Cargo.toml parsing and hardcoded crate lists with a single shared script (.github/workflows/codspeed-matrix.sh) that discovers every [[bench]] target across the workspace via `cargo metadata`. New crates and bench targets are picked up automatically. The known-broken exclusion list, previously duplicated as an EXCLUDED_BENCHES env in both workflows, now lives next to each bench in its crate's Cargo.toml: [package.metadata.codspeed.benches] merge_kernels = { skip = true } cargo surfaces that table at .packages[].metadata.codspeed.benches and the script drops any target flagged `skip = true`. Unknown `bench:<crate>` label suffixes now error against the workspace member list rather than a loose regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Jefffrey

regarding the failing benches, i have a PR to fix one of them:

#10199

for the others i wasn't able to reproduce a failure locally; does runtime error potentially indicate the benchmarks took too long to run?

for adding the app, it looks like opendal has done this so we can follow their lead:

apache/opendal#5280

fyi looks like there was a previous attempt to add to arrow-rs here:

#6150

but i guess we were stuck waiting on infra, which should be resolved now

Jefffrey · 2026-06-23T13:50:34Z

+# Fork PR caveat: workflows triggered by `pull_request` from fork PRs
+# do not get an OIDC token. For benches on fork PRs, push the branch
+# to this repo and label it there.


i suppose this is an important caveat considering essentially all our PRs come from forks 🤔

Jefffrey · 2026-06-23T13:51:45Z

+# Continuous benchmarking on CodSpeed.
+#
+# Runs the full workspace bench suite on every push to main, sharded
+# one job per `[[bench]]` target. Sharding at this granularity is
+# required because both crate-level shards (e.g. parquet alone has 16
+# bench targets producing >1000 benchmarks) and the workspace as a
+# whole exceed CodSpeed's 1000-benchmark per-upload limit. Jobs in the
+# same workflow are auto-aggregated by CodSpeed into one report.
+# https://codspeed.io/docs/features/sharded-benchmarks


I do wonder how heavy this would be to run entire benchmark suite on every commit 🤔

Jefffrey · 2026-06-23T13:52:04Z

+      matrix:
+        config: ${{ fromJson(needs.setup.outputs.matrix) }}
+    steps:
+      - uses: actions/checkout@v6


minor note: we should change the versions to hashes, but can do in a followup

…#10199) as identified by - #9975 this benchmark was panicking when run, because it was passing a 1-len array as an `ArrayRef` instead of a `Scalar` and thus indexing the 2nd/3rd element was causing out of bounds panic; when wrapped in `Scalar` the single element will be repeated so it wouldn't index out of bounds it would just get the first element

adriangb marked this pull request as ready for review May 26, 2026 23:24

adriangb and others added 2 commits May 26, 2026 18:24

adriangb force-pushed the codspeed-integration branch from 191da5f to c4e266f Compare May 26, 2026 23:25

github-actions Bot added parquet Changes to the parquet crate arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels May 26, 2026

Jefffrey mentioned this pull request Jun 23, 2026

Fix merge_kernels benchmark panic due to not wrapping with Scalar #10199

Merged

Jefffrey reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: integrate CodSpeed continuous benchmarking#9975

ci: integrate CodSpeed continuous benchmarking#9975
adriangb wants to merge 2 commits into
apache:mainfrom
pydantic:codspeed-integration

adriangb commented May 14, 2026 •

edited by Jefffrey

Loading

Uh oh!

Jefffrey left a comment

Uh oh!

Jefffrey Jun 23, 2026

Uh oh!

Jefffrey Jun 23, 2026

Uh oh!

Jefffrey Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

adriangb commented May 14, 2026 • edited by Jefffrey Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Drop-in shim, no bench source changes

Sharded one job per [[bench]] target

Build once, run many

Label-gated PRs

OIDC auth

Exclusions

Prerequisites for activation

CI cost notes

Test plan

References

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Jefffrey Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adriangb commented May 14, 2026 •

edited by Jefffrey

Loading

Sharded one job per `[[bench]]` target