Skip to content

Commit f6c0289

Browse files
authored
ci: run pre-publish benchmark gate on every PR (#1072)
* ci: run pre-publish benchmark gate on every PR Mirrors the regression gate from publish.yml so PRs catch performance regressions before merge instead of at release time. The gate measures the PR-built native artifact as version "dev", appends to the benchmark history files in-job, and fails if any metric regresses beyond threshold vs the most recent release baseline. Reuses the existing native-host-build artifact (linux-x64), runs in parallel with the rest of CI, and preserves CODEGRAPH_FAST_SKIP_DIAG so fast-skip rejection reasons surface in CI logs when issues happen. JSON results upload on always() for post-failure debugging. * test(bench): mark 3.9.6 incremental Full build as known regression (#1072) The benchmark regression guard fails on the same root cause as the existing 3.9.6 known regressions ('Build ms/file', 'No-op rebuild'): PR #1016 expanded AST_TYPE_MAPS from 3 to 23 languages, causing zero-AST-row files to return astNodes: undefined and trigger a full-corpus re-parse. The regression shows up in INCREMENTAL-BENCHMARKS under the 'Full build' label as well, which was missed when PR #1038 added the other 3.9.6 entries. Both engines regressed: native 2148 -> 2986 (+39%), wasm 7563 -> 14036 (+86%). Fixed by PR #1038; reclears with v3.9.7+ benchmark data. Refs #1036, #1037, #1038. * ci: scope benchmark gate to PRs and add per-step timeouts (#1072) Address Greptile review feedback: - Add 'if: github.event_name == "pull_request"' to skip the gate on push-to-main. The merged PR already passed the gate on this same diff, so re-running on the merge commit doubles CI minutes per landed change with no new signal. Mirrors publish.yml's 'if: github.event_name != "push"' skip. - Add per-step 'timeout-minutes: 20' to the four benchmark execution steps (build/resolution/query/incremental). Without explicit timeouts, a hung script would hold the runner for the GitHub default (6 hours), blocking the ci-pipeline gate for the entire duration. * fix(bench): bypass version-gap check when latest is 'dev' (#1072) parseSemver('dev') returns null, so effectiveGap('dev', anyRelease) is Infinity — the > MAX_VERSION_GAP check silently rejected every dev → release pairing and findLatestPair fell through to compare the two most recent real releases. The per-PR regression gate was running on static historical data instead of the PR's own dev numbers. Bypass the gap check when latestVersion === 'dev' so dev is always compared against the most recent comparable release. Real releases still respect MAX_VERSION_GAP to avoid stale baselines. * fix(bench): look up KNOWN_REGRESSIONS by baseline version when latest is dev (#1072) KNOWN_REGRESSIONS keys are anchored to the release where the regression was first observed (e.g. '3.9.6:No-op rebuild'). When the per-PR gate runs 'dev' as latest, lookups using 'dev:Foo' never match — defeating the exemption mechanism for every documented regression. Fall back to the baseline (previous) version's key when latest is 'dev', so a single '3.9.6:Foo' entry covers both '3.9.6 vs 3.9.5' (release-time) and 'dev vs 3.9.6' (per-PR) until the next release clears the regression and the entry is pruned by the existing stale-entry test. * test(bench): mark 3.9.6 1-file rebuild as known regression (#1072) The native 1-file rebuild regressed from 78ms to ~116ms (build) and 54ms to ~81ms (incremental) when #1069 made backfillNativeDroppedFiles run on every successful orchestrator pass — including incrementals — to repair file_hashes/nodes rows for unsupported-extension files. #1070 fixed the orchestrator side, but the JS-side call stayed unconditional, wasting ~45ms per incremental on the codegraph corpus. Fix is tracked in PR #1082, which gates the backfill call on `isFullBuild || removedCount > 0`. Adding the known-regression entry lets this PR's gate pass while #1082 ships through review; the existing stale-entry test will warn once 3.9.7 lands and the entry stays past its useful life. * fix(bench): cover import resolution in per-PR gate The resolveEntries filter excluded 'dev' from the candidate pool, so nativeBatchMs / jsFallbackMs regressions introduced by a PR would silently pass the gate. Switch to findLatestPair (which already pairs dev → most-recent comparable release) so the import-resolution sub-benchmark gets the same per-PR coverage as the rest of the gate. * test(bench): mark 3.10.0 fnDeps depth 1 as known regression CI run 25708925467 surfaced fnDeps depth 1 at 24.7 → 44.9ms (+82%) on the per-PR gate, above the 50% NOISY_METRIC_THRESHOLD. The fn_deps Rust implementation and JS wrapper are byte-for-byte unchanged since v3.9.6 (already documented in the NOISY_METRICS docstring); the ~20ms absolute delta is the shared-runner noise floor on a sub-30ms baseline. Same pattern as the existing 3.10.0:No-op rebuild / 3.10.0:1-file rebuild entries; remove once 3.11.0+ data confirms stabilization under the warmup + 5-sample methodology.
1 parent 592e23b commit f6c0289

2 files changed

Lines changed: 244 additions & 21 deletions

File tree

.github/workflows/ci.yml

Lines changed: 154 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,9 +295,162 @@ jobs:
295295
- name: Check compilation
296296
run: cargo check --workspace
297297

298+
# ── Pre-publish benchmark gate ──
299+
#
300+
# Mirrors the gate in publish.yml so every PR catches regressions before
301+
# merge instead of at release time. Measures the PR-built native artifact
302+
# against the local source as version "dev", appends to the benchmark
303+
# history files (in-job only — never committed from CI), and runs the
304+
# regression guard against the most recent release baseline.
305+
#
306+
# Long-running but parallel with the rest of CI, so it does not extend
307+
# the critical path for fast-failing checks (lint/typecheck/test).
308+
pre-publish-benchmark:
309+
name: Pre-publish benchmark gate
310+
# Only run on PRs — push-to-main re-runs the same benchmarks the merged
311+
# PR already gated on, doubling CI minutes per landed change with no new
312+
# signal. Mirrors the `if: github.event_name != 'push'` skip in
313+
# publish.yml's equivalent gate.
314+
if: github.event_name == 'pull_request'
315+
needs: native-host-build
316+
runs-on: ubuntu-latest
317+
env:
318+
# Surface why detectNoChanges returns false on each fast-skip pre-flight
319+
# so we can pinpoint the cause of the ~2s incremental-rebuild regression
320+
# observed in CI but not locally (#1066). Remove once root cause is fixed.
321+
CODEGRAPH_FAST_SKIP_DIAG: "1"
322+
323+
steps:
324+
- uses: actions/checkout@v6
325+
with:
326+
fetch-depth: 0
327+
328+
- uses: actions/setup-node@v6
329+
with:
330+
node-version: "22"
331+
cache: "npm"
332+
333+
- name: Setup Python (for resolution benchmark + tracer validation)
334+
uses: actions/setup-python@v6
335+
with:
336+
python-version: "3.12"
337+
338+
- name: Setup Go (for resolution benchmark + tracer validation)
339+
uses: actions/setup-go@v6
340+
with:
341+
go-version: "stable"
342+
cache: false
343+
344+
- name: Download PR-built native addon
345+
uses: actions/download-artifact@v8
346+
with:
347+
name: native-host-ubuntu-latest
348+
path: crates/codegraph-core/
349+
350+
- name: Install dependencies
351+
timeout-minutes: 20
352+
shell: bash
353+
run: |
354+
for attempt in 1 2 3; do
355+
npm install && break
356+
if [ "$attempt" -lt 3 ]; then
357+
echo "::warning::npm install attempt $attempt failed, retrying in 15s..."
358+
sleep 15
359+
else
360+
echo "::error::npm install failed after 3 attempts"
361+
exit 1
362+
fi
363+
done
364+
365+
- name: Install native addon over published binary
366+
run: node scripts/ci-install-native.mjs
367+
368+
# Build dist/ so benchmarks load the same compiled JS that ships to npm.
369+
# Historical baselines (v3.9.6 and earlier) were measured against dist/
370+
# via the post-publish --npm path; running against src/ with --strip-types
371+
# changes the load path and introduces version-to-version noise unrelated
372+
# to the code under test (#1055).
373+
- name: Build TypeScript
374+
run: npm run build
375+
376+
- name: Run build benchmark
377+
timeout-minutes: 20
378+
run: |
379+
STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
380+
node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/benchmark.ts --version dev --dist > benchmark-result.json
381+
382+
- name: Run resolution benchmark
383+
timeout-minutes: 20
384+
run: |
385+
STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
386+
node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/resolution-benchmark.ts --version dev --dist > resolution-result.json
387+
388+
- name: Gate on resolution thresholds
389+
timeout-minutes: 30
390+
run: npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts --reporter=verbose
391+
392+
- name: Run tracer validation (same-file edge recall)
393+
timeout-minutes: 10
394+
run: npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts --reporter=verbose
395+
396+
- name: Merge resolution into build result
397+
run: |
398+
node -e "
399+
const fs = require('fs');
400+
const build = JSON.parse(fs.readFileSync('benchmark-result.json', 'utf8'));
401+
const resolution = JSON.parse(fs.readFileSync('resolution-result.json', 'utf8'));
402+
build.resolution = resolution;
403+
fs.writeFileSync('benchmark-result.json', JSON.stringify(build, null, 2));
404+
"
405+
406+
- name: Run query benchmark
407+
timeout-minutes: 20
408+
run: |
409+
STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
410+
node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/query-benchmark.ts --version dev --dist > query-benchmark-result.json
411+
412+
- name: Run incremental benchmark
413+
timeout-minutes: 20
414+
run: |
415+
STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
416+
node $STRIP_FLAG --import ./scripts/ts-resolve-loader.js scripts/incremental-benchmark.ts --version dev --dist > incremental-benchmark-result.json
417+
418+
- name: Update build report
419+
run: |
420+
STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
421+
node $STRIP_FLAG scripts/update-benchmark-report.ts benchmark-result.json
422+
423+
- name: Update query report
424+
run: |
425+
STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
426+
node $STRIP_FLAG scripts/update-query-report.ts query-benchmark-result.json
427+
428+
- name: Update incremental report
429+
run: |
430+
STRIP_FLAG=$(node -e "const [M]=process.versions.node.split('.').map(Number); console.log(M>=23?'--strip-types':'--experimental-strip-types')")
431+
node $STRIP_FLAG scripts/update-incremental-report.ts incremental-benchmark-result.json
432+
433+
- name: Regression guard
434+
env:
435+
RUN_REGRESSION_GUARD: "1"
436+
run: npm run test:regression-guard
437+
438+
# Always upload raw JSON so a failed regression guard is debuggable
439+
# without re-running the full benchmark suite locally.
440+
- name: Upload benchmark JSON results
441+
if: always()
442+
uses: actions/upload-artifact@v7
443+
with:
444+
name: benchmark-results-json
445+
path: |
446+
benchmark-result.json
447+
query-benchmark-result.json
448+
incremental-benchmark-result.json
449+
if-no-files-found: warn
450+
298451
ci-pipeline:
299452
if: always()
300-
needs: [lint, native-host-build, test, typecheck, audit, verify-imports, rust-check, parity]
453+
needs: [lint, native-host-build, test, typecheck, audit, verify-imports, rust-check, parity, pre-publish-benchmark]
301454
runs-on: ubuntu-latest
302455
name: CI Testing Pipeline
303456
steps:

tests/benchmarks/regression-guard.test.ts

Lines changed: 90 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -100,9 +100,18 @@ const SKIP_VERSIONS = new Set(['3.8.0']);
100100
* Format: "version:metric-label" (must match the label passed to checkRegression).
101101
* Resolution keys use: "version:resolution <lang> precision" or "version:resolution <lang> recall".
102102
*
103-
* Entries fire only when `latest.version` matches the prefix, so once a
104-
* version is no longer the latest in committed history its entries become
105-
* dead weight and should be removed (last pruned: 3.9.0/3.9.1/3.9.2).
103+
* The `version` is the release where the regression was first observed.
104+
* When the per-PR gate runs `dev` against that release as baseline, the
105+
* exemption applies via the baseline-version fallback in assertNoRegressions
106+
* (and the resolution loop) — so a single `3.9.6:Foo` entry covers both
107+
* `3.9.6 vs 3.9.5` and every subsequent `dev vs 3.9.6` comparison until
108+
* the next release clears the regression and the entry is pruned.
109+
*
110+
* Entries fire only when `latest.version` matches the prefix (or, for `dev`
111+
* latest, when `previous.version` matches via the baseline fallback). Once
112+
* a version is no longer the latest in committed history and no longer the
113+
* baseline used for `dev` comparisons, its entries become dead weight and
114+
* should be removed (last pruned: 3.9.0/3.9.1/3.9.2).
106115
*
107116
* - 3.9.6:Build ms/file / 3.9.6:No-op rebuild — WASM full build regressed
108117
* (#1036) when PR #1016 expanded AST_TYPE_MAPS from 3 to 23 languages,
@@ -146,6 +155,17 @@ const SKIP_VERSIONS = new Set(['3.8.0']);
146155
* sub-100ms metric is still razor-thin. Exempt this release; remove
147156
* once 3.11.0+ data confirms CI numbers stabilize under the new
148157
* methodology.
158+
*
159+
* - 3.10.0:fnDeps depth 1 — CI variance on a sub-30ms metric (24.7ms native
160+
* baseline). The fn_deps Rust implementation, fnDepsData JS wrapper, and
161+
* DB schema/indexes are all byte-for-byte unchanged since v3.9.6 (already
162+
* documented in NOISY_METRICS above). Measured runs land at +40–60%
163+
* typically (within the 50% NOISY_METRIC_THRESHOLD), but the per-PR gate
164+
* has seen outliers up to +82% (run 25708925467: 24.7 → 44.9ms) on
165+
* shared runners — the ~20ms absolute delta is the runner noise floor.
166+
* Exempt this release; remove once 3.11.0+ data confirms stabilization
167+
* under the warmup + 5-sample methodology already applied to incremental
168+
* benchmarks.
149169
*/
150170
const KNOWN_REGRESSIONS = new Set([
151171
'3.9.6:Build ms/file',
@@ -155,6 +175,7 @@ const KNOWN_REGRESSIONS = new Set([
155175
'3.9.6:resolution haskell recall',
156176
'3.10.0:No-op rebuild',
157177
'3.10.0:1-file rebuild',
178+
'3.10.0:fnDeps depth 1',
158179
]);
159180

160181
/**
@@ -272,6 +293,13 @@ function findLatestPair<T extends { version: string }>(
272293
if (!hasEngine(history[latestIdx])) continue;
273294

274295
const latestVersion = history[latestIdx].version;
296+
// 'dev' represents the current PR build (rolling entry — see
297+
// scripts/update-benchmark-report.ts). It has no parseable semver,
298+
// so effectiveGap('dev', anyRelease) returns Infinity — without this
299+
// bypass, the gap check below would skip dev entirely and the loop
300+
// would silently fall through to compare two real releases instead
301+
// of dev vs the latest release, defeating the per-PR gate.
302+
const isDevLatest = latestVersion === 'dev';
275303

276304
// Find previous non-dev entry with data for this engine, skipping
277305
// versions with known unreliable benchmark data and versions that
@@ -284,7 +312,10 @@ function findLatestPair<T extends { version: string }>(
284312
if (entry.version === 'dev') continue;
285313
if (SKIP_VERSIONS.has(entry.version)) continue;
286314
if (!hasEngine(entry)) continue;
287-
if (effectiveGap(latestVersion, entry.version) > MAX_VERSION_GAP) continue;
315+
// Skip the gap check when comparing dev → release: dev is always
316+
// the current build, so the most recent comparable release is the
317+
// correct baseline regardless of feature-expansion distance.
318+
if (!isDevLatest && effectiveGap(latestVersion, entry.version) > MAX_VERSION_GAP) continue;
288319
return { latest: history[latestIdx], previous: entry };
289320
}
290321
// No valid baseline for this latest — try the next candidate
@@ -329,11 +360,27 @@ function thresholdFor(label: string): number {
329360
return NOISY_METRICS.has(label) ? NOISY_METRIC_THRESHOLD : REGRESSION_THRESHOLD;
330361
}
331362

332-
function assertNoRegressions(checks: (RegressionCheck | null)[], version?: string) {
363+
function assertNoRegressions(
364+
checks: (RegressionCheck | null)[],
365+
version?: string,
366+
baselineVersion?: string,
367+
) {
333368
const real = checks.filter(Boolean) as RegressionCheck[];
334369
const regressions = real.filter((c) => {
335370
if (c.pctChange <= thresholdFor(c.label)) return false;
336371
if (version && KNOWN_REGRESSIONS.has(`${version}:${c.label}`)) return false;
372+
// When `latest` is the rolling 'dev' build, KNOWN_REGRESSIONS entries
373+
// are anchored to the release where the regression was first observed
374+
// (e.g. '3.9.6:No-op rebuild'), not to 'dev'. Fall back to the baseline
375+
// version so a regression introduced before release N stays exempt for
376+
// every PR comparing dev → N until release N+1 clears it.
377+
if (
378+
version === 'dev' &&
379+
baselineVersion &&
380+
KNOWN_REGRESSIONS.has(`${baselineVersion}:${c.label}`)
381+
) {
382+
return false;
383+
}
337384
return true;
338385
});
339386

@@ -493,6 +540,7 @@ describe.runIf(RUN_REGRESSION_GUARD)('Benchmark regression guard', () => {
493540
checkRegression(`1-file rebuild`, cur.oneFileRebuildMs, prev.oneFileRebuildMs),
494541
],
495542
latest.version,
543+
previous.version,
496544
);
497545
});
498546
}
@@ -530,6 +578,7 @@ describe.runIf(RUN_REGRESSION_GUARD)('Benchmark regression guard', () => {
530578
),
531579
],
532580
latest.version,
581+
previous.version,
533582
);
534583
});
535584
}
@@ -559,22 +608,32 @@ describe.runIf(RUN_REGRESSION_GUARD)('Benchmark regression guard', () => {
559608
checkRegression(`1-file rebuild`, cur.oneFileRebuildMs, prev.oneFileRebuildMs),
560609
],
561610
latest.version,
611+
previous.version,
562612
);
563613
});
564614
}
565615

566-
// Resolve benchmarks (not engine-specific)
567-
const resolveEntries = incrementalHistory.filter(
568-
(e) => e.resolve != null && e.version !== 'dev' && !SKIP_VERSIONS.has(e.version),
616+
// Resolve benchmarks (not engine-specific). Keep `dev` in the candidate
617+
// pool so the per-PR gate (which produces a `dev` resolve entry) covers
618+
// import resolution; previously the filter dropped dev outright, leaving
619+
// nativeBatchMs / jsFallbackMs blind to PR-introduced regressions.
620+
const resolvePair = findLatestPair(
621+
incrementalHistory.filter((e) => e.resolve != null),
622+
(e) => e.resolve != null,
569623
);
570-
if (resolveEntries.length >= 2) {
571-
test(`import resolution — ${resolveEntries[0].version} vs ${resolveEntries[1].version}`, () => {
572-
const cur = resolveEntries[0].resolve!;
573-
const prev = resolveEntries[1].resolve!;
574-
assertNoRegressions([
575-
checkRegression(`Native batch resolve`, cur.nativeBatchMs, prev.nativeBatchMs),
576-
checkRegression(`JS fallback resolve`, cur.jsFallbackMs, prev.jsFallbackMs),
577-
]);
624+
if (resolvePair) {
625+
const { latest: latestRes, previous: previousRes } = resolvePair;
626+
test(`import resolution — ${latestRes.version} vs ${previousRes.version}`, () => {
627+
const cur = latestRes.resolve!;
628+
const prev = previousRes.resolve!;
629+
assertNoRegressions(
630+
[
631+
checkRegression(`Native batch resolve`, cur.nativeBatchMs, prev.nativeBatchMs),
632+
checkRegression(`JS fallback resolve`, cur.jsFallbackMs, prev.jsFallbackMs),
633+
],
634+
latestRes.version,
635+
previousRes.version,
636+
);
578637
});
579638
}
580639

@@ -589,8 +648,8 @@ describe.runIf(RUN_REGRESSION_GUARD)('Benchmark regression guard', () => {
589648

590649
test('has resolve data to compare', () => {
591650
expect(
592-
resolveEntries.length >= 2,
593-
'No import-resolution benchmark data with ≥2 non-dev entries to compare',
651+
resolvePair != null,
652+
'No import-resolution benchmark data with ≥2 comparable entries',
594653
).toBe(true);
595654
});
596655
});
@@ -644,10 +703,18 @@ describe.runIf(RUN_REGRESSION_GUARD)('Benchmark regression guard', () => {
644703
const prv = prevRes[lang];
645704
if (!cur || !prv) continue;
646705

706+
// When latest is 'dev' (per-PR build), KNOWN_REGRESSIONS keys
707+
// are anchored to the baseline release where the regression was
708+
// first observed, not to 'dev' — fall back to previousRes.version.
709+
const isDev = latestRes.version === 'dev';
710+
647711
const precDrop = prv.precision - cur.precision;
648712
if (precDrop > PRECISION_DROP_PP) {
649713
const key = `${latestRes.version}:resolution ${lang} precision`;
650-
if (!KNOWN_REGRESSIONS.has(key)) {
714+
const fallbackKey = `${previousRes.version}:resolution ${lang} precision`;
715+
const isKnown =
716+
KNOWN_REGRESSIONS.has(key) || (isDev && KNOWN_REGRESSIONS.has(fallbackKey));
717+
if (!isKnown) {
651718
regressions.push(
652719
` ${lang} precision: ${(prv.precision * 100).toFixed(1)}% → ${(cur.precision * 100).toFixed(1)}% (−${(precDrop * 100).toFixed(1)}pp, threshold ${(PRECISION_DROP_PP * 100).toFixed(0)}pp)`,
653720
);
@@ -657,7 +724,10 @@ describe.runIf(RUN_REGRESSION_GUARD)('Benchmark regression guard', () => {
657724
const recDrop = prv.recall - cur.recall;
658725
if (recDrop > RECALL_DROP_PP) {
659726
const key = `${latestRes.version}:resolution ${lang} recall`;
660-
if (!KNOWN_REGRESSIONS.has(key)) {
727+
const fallbackKey = `${previousRes.version}:resolution ${lang} recall`;
728+
const isKnown =
729+
KNOWN_REGRESSIONS.has(key) || (isDev && KNOWN_REGRESSIONS.has(fallbackKey));
730+
if (!isKnown) {
661731
regressions.push(
662732
` ${lang} recall: ${(prv.recall * 100).toFixed(1)}% → ${(cur.recall * 100).toFixed(1)}% (−${(recDrop * 100).toFixed(1)}pp, threshold ${(RECALL_DROP_PP * 100).toFixed(0)}pp)`,
663733
);

0 commit comments

Comments
 (0)