ci: Fix bencher.dev thresholds#459
Merged
Merged
Conversation
dc9f973 to
8edcace
Compare
8edcace to
1ad4bd0
Compare
arthurpaulino
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes erroneous alerts for the bencher.dev benchmarks on main, which were due to a brittle threshold reset mechanism between Lean toolchain bumps. This PR instead sets a manual threshold reset action, which triggers either by a
!bencher-thresholds-reset <ix-compile|aiur|all>PR comment or a manualworkflow_dispatchaction after merge.These should be used whenever a new baseline is expected to be set by a PR, either from a performance improvement (e.g. lowering Aiur FFTs) or regression (e.g. more constants added to Mathlib on a toolchain bump). The baselines are split by workload, so changes to
ix-compileneed not affect the baseline foraiur. Currently all thresholds for a given workload are reset together, so any new metrics from the first few runs on a new baseline should be carefully reviewed for any performance changes.Future work: Add Zisk and SP1 benchmarks to bencher.dev and integrate with the new threshold/alert system.
Note
Before merge, we'll need to update the testbed for Ix compilation from
warp-ubuntu-x64-32xtoix-compile-x64-32xvia the bencher API/web console to ensure the history is preserved.