Skip to content

Commit 4b0af6c

Browse files
ci: Fix bencher.dev thresholds (#459)
1 parent c88df18 commit 4b0af6c

4 files changed

Lines changed: 164 additions & 67 deletions

File tree

Lines changed: 44 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,18 @@
11
name: Track benchmarks on bencher
22
description: >-
3-
Run `bencher run` for the `ix` project, windowing deterministic "stepping"
4-
measures to commits since the last lean-toolchain bump (and dropping them on
5-
the bump commit via --thresholds-reset, so a toolchain step lands without a
6-
spurious alert). The calling job must check out with fetch-depth: 0.
3+
Run `bencher run` for the `ix` project. Every measure's baseline is windowed
4+
to the data points recorded since the workload's `baseline-reset-<workload>`
5+
tag, so a toolchain bump (or any permanent shift) is recalibrated by moving
6+
that one tag — via the !bencher-thresholds-reset PR comment or the
7+
bench-thresholds-reset workflow. The calling job must check out with
8+
fetch-depth: 0 and fetch-tags: true (the tag is read here).
79
810
inputs:
911
testbed:
10-
description: Bencher testbed name.
12+
description: Bencher testbed slug.
13+
required: true
14+
workload:
15+
description: Workload key for the `baseline-reset-<workload>` tag (e.g. ix-compile, aiur).
1116
required: true
1217
file:
1318
description: Bencher Metric Format JSON file to upload.
@@ -18,56 +23,52 @@ inputs:
1823
github-token:
1924
description: GITHUB_TOKEN, for the --github-actions comment.
2025
required: true
21-
always-thresholds:
22-
description: >-
23-
Raw `--threshold-measure …` flags for measures tracked on every commit
24-
(non-stepping). Word-split, so keep each flag a single token.
25-
required: false
26-
default: ""
27-
stepping-measures:
26+
thresholds:
2827
description: >-
29-
Space-separated measure names that step on a toolchain bump (e.g.
30-
"file-size constants" or "fft-cost"). Windowed to commits-since-bump and
31-
omitted on the bump commit.
32-
required: false
33-
default: ""
28+
`--threshold-*` flags for every tracked measure. Use the literal token
29+
`__WINDOW__` in place of each `--threshold-max-sample-size` value; it is
30+
replaced with the data-point count since the baseline anchor. Word-split,
31+
so keep each flag a single token.
32+
required: true
3433

3534
runs:
3635
using: composite
3736
steps:
38-
- uses: bencherdev/bencher@v0.6.7
37+
- uses: bencherdev/bencher@v0.6.8
3938
- shell: bash
4039
env:
4140
BENCHER_KEY: ${{ inputs.key }}
4241
GH_TOKEN: ${{ inputs.github-token }}
4342
TESTBED: ${{ inputs.testbed }}
43+
WORKLOAD: ${{ inputs.workload }}
4444
FILE: ${{ inputs.file }}
45-
ALWAYS_THRESHOLDS: ${{ inputs.always-thresholds }}
46-
STEPPING_MEASURES: ${{ inputs.stepping-measures }}
45+
THRESHOLDS: ${{ inputs.thresholds }}
4746
run: |
48-
# Sample only commits since lean-toolchain last changed, so a stepping
49-
# measure's baseline window never straddles a toolchain bump.
50-
last_bump=$(git log -1 --format=%H -- lean-toolchain)
51-
sample=$(git rev-list --count "${last_bump}..HEAD")
52-
[ "$sample" -gt 64 ] && sample=64
53-
echo "commits since last lean-toolchain bump (capped 64): $sample"
54-
# On the bump commit (sample=0) omit the stepping measures entirely;
55-
# --thresholds-reset then clears their models so the step doesn't alert.
56-
# Between bumps these measures are deterministic, so pin them exactly
57-
# (0% on both sides): any change in either direction is a real circuit
58-
# or closure change, not noise, and should alert.
59-
step_thresholds=()
60-
if [ "$sample" -gt 0 ]; then
61-
for m in $STEPPING_MEASURES; do
62-
step_thresholds+=(
63-
--threshold-measure "$m"
64-
--threshold-test percentage
65-
--threshold-max-sample-size "$sample"
66-
--threshold-upper-boundary 0
67-
--threshold-lower-boundary 0
68-
)
69-
done
47+
# Baseline window: data points for this benchmark since the workload's
48+
# `baseline-reset-<workload>` tag (moved on a toolchain bump or any
49+
# permanent shift). The same window applies to every measure. We count
50+
# actual data points, not commits, so the baseline never reaches past
51+
# the anchor (a failed run produces no metric, so a commit count would
52+
# overshoot). The matrix uploads one report per benchmark, so count this
53+
# benchmark's reports (its name is the upload's top-level key). The ix
54+
# project is public, so no token is needed. With no tag yet, fall back
55+
# to Bencher's 64-point window.
56+
sample=64
57+
if reset_epoch=$(git log -1 --format=%ct "baseline-reset-$WORKLOAD" 2>/dev/null); then
58+
bench=$(jq -r '[keys[]][0]' "$FILE")
59+
sample=$(curl -fsS "https://api.bencher.dev/v0/projects/ix/reports?branch=${GITHUB_REF_NAME}&testbed=${TESTBED}&start_time=$((reset_epoch * 1000))&per_page=255" \
60+
| jq --arg b "$bench" '[.[] | select([.results[]?[]?.benchmark.name] | index($b))] | length')
7061
fi
62+
sample=${sample:-0}
63+
[ "$sample" -gt 64 ] && sample=64
64+
echo "baseline window for '$WORKLOAD': $sample data point(s)"
65+
66+
# No data points since the reset → omit thresholds so the reset commit
67+
# lands without an alert (--thresholds-reset clears stale models); they
68+
# re-arm next run. Otherwise substitute the window into each measure.
69+
thresholds=()
70+
[ "$sample" -gt 0 ] && thresholds=( ${THRESHOLDS//__WINDOW__/$sample} )
71+
7172
bencher run \
7273
--project ix \
7374
--key "$BENCHER_KEY" \
@@ -76,7 +77,6 @@ runs:
7677
--testbed "$TESTBED" \
7778
--adapter json \
7879
--github-actions "$GH_TOKEN" \
79-
$ALWAYS_THRESHOLDS \
80-
"${step_thresholds[@]}" \
80+
"${thresholds[@]}" \
8181
--thresholds-reset \
8282
--file "$FILE"
Lines changed: 37 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ jobs:
8181
- uses: actions/checkout@v6
8282
with:
8383
fetch-depth: 0
84+
fetch-tags: true # bencher-track reads the baseline-reset tag
8485
- uses: actions/cache/restore@v5
8586
with:
8687
path: ~/.local/bin
@@ -138,24 +139,30 @@ jobs:
138139
}
139140
}
140141
EOF
141-
# Upload compile metrics. file-size/constants step on a toolchain bump, so
142-
# they're the bump-windowed stepping measures; compile-time/throughput ride
143-
# their normal 64-window (throughput's regression is a drop, hence its
144-
# lower boundary).
142+
# Upload compile metrics. Every measure shares the per-workload baseline
143+
# window (data points since the ix-compile reset tag): file-size/constants
144+
# are deterministic, pinned exactly (0/0); compile-time rides a 5% upper
145+
# bound; throughput a 5% lower bound (its regression is a drop).
145146
- uses: ./.github/actions/bencher-track
146147
with:
147-
testbed: warp-ubuntu-x64-32x
148+
testbed: ix-compile-x64-32x
149+
workload: ix-compile
148150
file: benchmark.json
149151
key: ${{ secrets.BENCHER_API_KEY }}
150152
github-token: ${{ secrets.GITHUB_TOKEN }}
151-
stepping-measures: file-size constants
152-
always-thresholds: |
153+
thresholds: |
153154
--threshold-measure compile-time --threshold-test percentage
154-
--threshold-max-sample-size 64 --threshold-upper-boundary 0.05
155+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary 0.05
155156
--threshold-lower-boundary _
156157
--threshold-measure throughput --threshold-test percentage
157-
--threshold-max-sample-size 64 --threshold-upper-boundary _
158+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary _
158159
--threshold-lower-boundary 0.05
160+
--threshold-measure file-size --threshold-test percentage
161+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary 0
162+
--threshold-lower-boundary 0
163+
--threshold-measure constants --threshold-test percentage
164+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary 0
165+
--threshold-lower-boundary 0
159166
160167
# Restore each matrix job's `.ixe` from the cache and run the Aiur execute + prove
161168
# benchmark over selected constants. No compiler run here.
@@ -174,7 +181,8 @@ jobs:
174181
steps:
175182
- uses: actions/checkout@v6
176183
with:
177-
fetch-depth: 0 # full history for the toolchain-bump sample-size query
184+
fetch-depth: 0 # full history for the baseline-anchor lookup
185+
fetch-tags: true # bencher-track reads the baseline-reset tag
178186
- uses: actions/cache/restore@v5
179187
with:
180188
path: ~/.local/bin
@@ -230,30 +238,37 @@ jobs:
230238
map_values(to_entries | map({(.key): {value: .value}}) | add)
231239
' results.json > aiur.json
232240
cat aiur.json
233-
# Upload Aiur metrics. fft-cost and constants are deterministic and step on a
234-
# toolchain bump (a different stdlib changes the constants' circuits and
235-
# closure sizes), so they're the bump-windowed stepping measures — same
236-
# treatment compile gives file-size/constants. prove-time/execute-time,
241+
# Upload Aiur metrics. Every measure shares the per-workload baseline
242+
# window (data points since the aiur reset tag). constants is deterministic
243+
# → pinned exactly (0/0). fft-cost is deterministic but only ever drops on
244+
# a real Aiur win, so it rides an upper-only 5% bound (flag a regression,
245+
# let wins through) rather than a hard pin. prove-time/execute-time,
237246
# peak-rss (texray's proving RSS high-water mark), and throughput
238-
# (constants/prove-time, like compile's, where a drop is the regression)
239-
# are noisy wall-clock and ride their normal 64-window.
247+
# (constants/prove-time, where a drop is the regression) are noisy
248+
# wall-clock and ride percentage bounds.
240249
- uses: ./.github/actions/bencher-track
241250
with:
242251
testbed: aiur-typecheck-x64-32x
252+
workload: aiur
243253
file: aiur.json
244254
key: ${{ secrets.BENCHER_API_KEY }}
245255
github-token: ${{ secrets.GITHUB_TOKEN }}
246-
stepping-measures: fft-cost constants
247-
always-thresholds: |
256+
thresholds: |
257+
--threshold-measure constants --threshold-test percentage
258+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary 0
259+
--threshold-lower-boundary 0
260+
--threshold-measure fft-cost --threshold-test percentage
261+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary 0.05
262+
--threshold-lower-boundary _
248263
--threshold-measure prove-time --threshold-test percentage
249-
--threshold-max-sample-size 64 --threshold-upper-boundary 0.10
264+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary 0.10
250265
--threshold-lower-boundary _
251266
--threshold-measure execute-time --threshold-test percentage
252-
--threshold-max-sample-size 64 --threshold-upper-boundary 0.10
267+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary 0.10
253268
--threshold-lower-boundary _
254269
--threshold-measure peak-rss --threshold-test percentage
255-
--threshold-max-sample-size 64 --threshold-upper-boundary 0.10
270+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary 0.10
256271
--threshold-lower-boundary _
257272
--threshold-measure throughput --threshold-test percentage
258-
--threshold-max-sample-size 64 --threshold-upper-boundary _
273+
--threshold-max-sample-size __WINDOW__ --threshold-upper-boundary _
259274
--threshold-lower-boundary 0.10
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
name: Bencher thresholds reset
2+
3+
# Move `baseline-reset-<workload>` tag(s) so bencher-track re-windows those
4+
# measures from a chosen commit — after a toolchain bump or any permanent shift.
5+
# Other workloads are unaffected; historical data stays in Bencher, only the
6+
# alert baseline is re-anchored. Two ways to trigger:
7+
#
8+
# - Merged PR carrying a maintainer comment `!bencher-thresholds-reset <workload|all>`:
9+
# fires automatically on merge, anchored at the merge commit — no need to
10+
# race a manual run afterward. A bare command (no workload) is ignored.
11+
# - workflow_dispatch: reset the chosen workload at a given commit (default
12+
# HEAD); does not read PR contents. For ad-hoc or bootstrap resets.
13+
#
14+
# pull_request_target runs in the base-repo context with a write token (works
15+
# for fork PRs too). Safe here because the job runs no PR-provided code — only
16+
# `gh api` calls against the trusted event payload.
17+
on:
18+
workflow_dispatch:
19+
inputs:
20+
workload:
21+
description: Workload baseline to reset
22+
required: true
23+
type: choice
24+
options: [ix-compile, aiur, all]
25+
sha:
26+
description: "Commit to anchor to (default: HEAD)"
27+
required: false
28+
pull_request_target:
29+
types: [closed]
30+
31+
permissions:
32+
contents: write
33+
pull-requests: write
34+
35+
jobs:
36+
reset:
37+
if: github.event_name == 'workflow_dispatch' || github.event.pull_request.merged == true
38+
runs-on: ubuntu-latest
39+
env:
40+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
41+
REPO: ${{ github.repository }}
42+
EVENT: ${{ github.event_name }}
43+
INPUT_WORKLOAD: ${{ inputs.workload }}
44+
INPUT_SHA: ${{ inputs.sha }}
45+
HEAD_SHA: ${{ github.sha }}
46+
PR: ${{ github.event.pull_request.number }}
47+
MERGE_SHA: ${{ github.event.pull_request.merge_commit_sha }}
48+
steps:
49+
- run: |
50+
valid="ix-compile aiur"
51+
if [ "$EVENT" = workflow_dispatch ]; then
52+
# Reset the chosen workload(s) at the given commit; no PR scan.
53+
sha="${INPUT_SHA:-$HEAD_SHA}"
54+
[ "$INPUT_WORKLOAD" = all ] && workloads="$valid" || workloads="$INPUT_WORKLOAD"
55+
else
56+
# Reset workloads named in maintainer `!bencher-thresholds-reset`
57+
# comments, anchored at the merge commit. Arg required, so a bare
58+
# command matches nothing and is ignored.
59+
sha="$MERGE_SHA"
60+
workloads=$(gh api "repos/$REPO/issues/$PR/comments" --paginate \
61+
--jq '.[] | select(.author_association | IN("OWNER","MEMBER","COLLABORATOR")) | .body' \
62+
| grep -oiE '!bencher-thresholds-reset[[:space:]]+[a-z0-9 -]+' \
63+
| sed -E 's/^!bencher-thresholds-reset[[:space:]]+//I' \
64+
| tr '[:upper:] ' '[:lower:]\n' | sort -u)
65+
echo "$workloads" | grep -qx all && workloads="$valid"
66+
fi
67+
68+
done_list=""
69+
for w in $workloads; do
70+
case " $valid " in *" $w "*) ;; *) echo "skip unknown workload: $w"; continue ;; esac
71+
tag="baseline-reset-$w"
72+
echo "Anchoring $tag -> $sha"
73+
gh api -X PATCH "repos/$REPO/git/refs/tags/$tag" -f sha="$sha" -F force=true \
74+
|| gh api -X POST "repos/$REPO/git/refs" -f ref="refs/tags/$tag" -f sha="$sha"
75+
done_list="$done_list $w"
76+
done
77+
[ -z "$done_list" ] && { echo "Nothing to reset."; exit 0; }
78+
79+
# Confirm on the PR for the merge-triggered path.
80+
[ "$EVENT" = pull_request_target ] && gh pr comment "$PR" --repo "$REPO" \
81+
--body "♻️ Baseline reset to \`${sha:0:7}\` for:$done_list"
82+
true

Benchmarks/Typecheck.lean

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ limit; bound a run with an external `timeout` if needed.
5959
The JSON is a neutral, flat shape (`{ "<name>": { "constants": …, "fft-cost": …,
6060
"execute-time": …, "prove-time": …, "throughput": … } }`, where `prove-time` and
6161
`throughput` appear only for proven constants); any bencher-specific reshaping
62-
is the caller's job (see `.github/workflows/aiur-bench.yml`).
62+
is the caller's job (see `.github/workflows/bench-main.yml`).
6363
-/
6464

6565
open Lean (Json Name)

0 commit comments

Comments
 (0)