Skip to content

Commit cc330d6

Browse files
authored
internal(bench): Reduce benchmark variance for tighter CI results (#3880)
* internal(bench-react): Reduce benchmark variance for tighter CI results Tighten convergent config (15/10 warmup, 80/60 max iterations, 2%/3% CI targets), add Chromium stability flags, double-GC between scenarios with longer pauses, tune CI system (CPU governor, swap off, robust server wait). Made-with: Cursor * internal(bench): Add system tuning to Node benchmark CI Same CPU governor and swap tuning as bench-react for consistent results. Made-with: Cursor * internal(bench): Pin benchmarks to CPU cores via taskset Config tuning alone didn't reduce variance — CI runner noise from CPU migration and shared-infrastructure scheduling is the dominant factor. Pin benchmark processes to cores 0,1 via taskset to eliminate L1/L2 cache thrashing from core migration. Moderate warmup/iteration counts back to reasonable levels since extra iterations can't fix environmental noise. Made-with: Cursor
1 parent f57e925 commit cc330d6

6 files changed

Lines changed: 74 additions & 37 deletions

File tree

.cursor/rules/benchmarking.mdc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,11 +63,11 @@ Use this mapping when deciding which React benchmark scenarios are relevant to a
6363

6464
| Category | Scenarios | Typical run-to-run spread |
6565
|---|---|---|
66-
| **Stable** | `getlist-*`, `update-entity`, `ref-stability-*` | 2–5% |
67-
| **Moderate** | `update-user-*`, `update-entity-sorted` | 5–10% |
68-
| **Volatile** | `memory-mount-unmount-cycle`, `startup-*`, `(react commit)` suffixes | 10–25% |
66+
| **Stable** | `getlist-*`, `update-entity`, `ref-stability-*` | <2% |
67+
| **Moderate** | `update-user-*`, `update-entity-sorted`, `update-entity-multi-view`, `move-item` | 2–4% |
68+
| **Volatile** | `memory-mount-unmount-cycle`, `startup-*`, `(react commit)` suffixes | 5–15% |
6969

70-
Regressions >5% on stable scenarios or >15% on volatile scenarios are worth investigating.
70+
CI convergence targets: 2% (small scenarios), 3% (large scenarios). Reported margins should not exceed 5%. Regressions >5% on stable scenarios or >10% on moderate scenarios are worth investigating.
7171

7272
### Profiling / tracing (opt + deopt investigation)
7373

.github/workflows/benchmark-react.yml

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,22 @@ jobs:
4848
run: npx playwright install chromium --with-deps
4949
- name: Build packages
5050
run: yarn build:benchmark-react
51+
- name: Tune system for benchmarking
52+
run: |
53+
# Pin CPU governor to performance mode (reduces frequency scaling jitter)
54+
for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
55+
echo performance | sudo tee "$gov" 2>/dev/null || true
56+
done
57+
# Disable swap to prevent memory pressure variance
58+
sudo swapoff -a || true
5159
- name: Run benchmark
5260
run: |
5361
yarn workspace example-benchmark-react preview &
54-
sleep 10
55-
cd examples/benchmark-react && yarn bench | tee react-bench-output.json
62+
for i in $(seq 1 30); do
63+
curl -sf http://localhost:5173/ > /dev/null && break
64+
sleep 1
65+
done
66+
cd examples/benchmark-react && taskset -c 0,1 yarn bench | tee react-bench-output.json
5667
5768
# PR comments on changes
5869
- name: Store benchmark result (PR)

.github/workflows/benchmark.yml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,16 @@ jobs:
4444
run: ./scripts/ci-install.sh examples/benchmark
4545
- name: Build packages
4646
run: yarn build:benchmark
47+
- name: Tune system for benchmarking
48+
run: |
49+
# Pin CPU governor to performance mode (reduces frequency scaling jitter)
50+
for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
51+
echo performance | sudo tee "$gov" 2>/dev/null || true
52+
done
53+
# Disable swap to prevent memory pressure variance
54+
sudo swapoff -a || true
4755
- name: Run benchmark
48-
run: yarn workspace example-benchmark start | tee output.txt
56+
run: taskset -c 0,1 yarn workspace example-benchmark start | tee output.txt
4957

5058
# PR comments on changes
5159
- name: Store benchmark result (PR)

examples/benchmark-react/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The repo has two benchmark suites:
1414
- **What we measure:** Wall-clock time from triggering an action (e.g. `init(100)` or `updateUser('user0')`) until a MutationObserver detects the expected DOM change in the benchmark container. Optionally we also record React Profiler commit duration and, with `BENCH_TRACE=true`, Chrome trace duration.
1515
- **Why:** Scenarios are chosen to exercise areas where caching strategies differ: shared-entity updates, referential stability, and derived-view memoization. See [js-framework-benchmark "How the duration is measured"](https://github.com/krausest/js-framework-benchmark/wiki/How-the-duration-is-measured) for a similar timeline-based approach.
1616
- **Statistical:** Warmup runs are discarded; we report median and 95% CI (as percentage of median). Timing scenarios (navigation and mutation) use **convergent mode**: a single page load per scenario, with warmup iterations followed by adaptive measurement iterations where each iteration produces one sample and convergence is checked inline. This eliminates page-reload overhead between samples for faster, lower-variance results. Deterministic scenarios (ref-stability) run once. Memory scenarios use a separate outer loop with a fresh page per round.
17-
- **No CPU throttling:** Runs at native speed with more samples for statistical significance rather than artificial slowdown. Convergent timing scenarios use 5 warmup + up to 50 measurement iterations (small) or 3 warmup + up to 40 (large). Early stopping triggers when 95% CI margin drops below the target percentage.
17+
- **No CPU throttling:** Runs at native speed with more samples for statistical significance rather than artificial slowdown. Convergent timing scenarios use 8 warmup + up to 60 measurement iterations (small) or 5 warmup + up to 50 (large). Early stopping triggers when 95% CI margin drops below the target percentage (2% small / 3% large in CI). CI pins the benchmark to dedicated CPU cores via `taskset` to reduce scheduling noise.
1818

1919
## Comparison philosophy
2020

@@ -98,11 +98,11 @@ Run: **2026-03-22**, Linux (WSL2), `yarn build:benchmark-react`, static preview
9898

9999
| Category | Scenarios | Typical run-to-run spread |
100100
|---|---|---|
101-
| **Stable** | `getlist-*`, `update-entity`, `update-entity-sorted`, `ref-stability-*` | 2-5% |
102-
| **Moderate** | `update-user-*`, `update-entity-multi-view`, `list-detail-switch-10` | 5-10% |
103-
| **Volatile** | `memory-mount-unmount-cycle`, `startup-*`, `(react commit)` suffixes | 10-25% |
101+
| **Stable** | `getlist-*`, `update-entity`, `update-entity-sorted`, `ref-stability-*` | <2% |
102+
| **Moderate** | `update-user-*`, `update-entity-multi-view`, `list-detail-switch-10`, `move-item` | 2-4% |
103+
| **Volatile** | `memory-mount-unmount-cycle`, `startup-*`, `(react commit)` suffixes | 5-15% |
104104

105-
Regressions >5% on stable scenarios or >15% on volatile scenarios are worth investigating.
105+
CI convergence targets: 2% (small scenarios), 3% (large scenarios). Reported margins should not exceed 5%. Regressions >5% on stable scenarios or >10% on moderate scenarios are worth investigating.
106106

107107
## Interpreting results
108108

@@ -197,9 +197,9 @@ Regressions >5% on stable scenarios or >15% on volatile scenarios are worth inve
197197

198198
Scenarios are classified as `small` or `large` based on their cost:
199199

200-
- **Small** (convergent: 5 warmup + 5–50 measurement iterations): `getlist-100`, `update-entity`, `invalidate-and-resolve`, `unshift-item`, `delete-item`
200+
- **Small** (convergent: 8 warmup + 10–60 measurement iterations): `getlist-100`, `update-entity`, `invalidate-and-resolve`, `unshift-item`, `delete-item`
201201
- **Small** (deterministic, single run): `ref-stability-*`
202-
- **Large** (convergent: 3 warmup + 5–40 measurement iterations): `getlist-500`, `getlist-500-sorted`, `update-user`, `update-user-10000`, `update-entity-sorted`, `update-entity-multi-view`, `list-detail-switch-10`
202+
- **Large** (convergent: 5 warmup + 10–50 measurement iterations): `getlist-500`, `getlist-500-sorted`, `update-user`, `update-user-10000`, `update-entity-sorted`, `update-entity-multi-view`, `list-detail-switch-10`
203203
- **Memory** (opt-in, 1 warmup + 3 measurement rounds): `memory-mount-unmount-cycle` — run with `--action memory`
204204

205205
Timing scenarios use convergent mode (single page load, inline convergence per scenario). Each group uses its own warmup/measurement config. Use `--size` to run only one group.

examples/benchmark-react/bench/runner.ts

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -541,7 +541,7 @@ async function runScenario(
541541
// Convergent scenario runner (single page load, inline stat-sig convergence)
542542
// ---------------------------------------------------------------------------
543543

544-
const CONVERGENT_GC_INTERVAL = 15;
544+
const CONVERGENT_GC_INTERVAL = 8;
545545

546546
async function runScenarioConvergent(
547547
page: Page,
@@ -575,8 +575,12 @@ async function runScenarioConvergent(
575575
const isWarmup = subIdx < config.warmup;
576576
const measureIdx = subIdx - config.warmup;
577577

578-
// Periodic GC to prevent heap pressure accumulation on long runs
578+
// Periodic double-GC to prevent heap pressure accumulation on long runs
579579
if (cdp && subIdx > 0 && subIdx % CONVERGENT_GC_INTERVAL === 0) {
580+
try {
581+
await cdp.send('HeapProfiler.collectGarbage');
582+
} catch {}
583+
await page.waitForTimeout(30);
580584
try {
581585
await cdp.send('HeapProfiler.collectGarbage');
582586
} catch {}
@@ -667,7 +671,7 @@ async function launchBenchChromium(): Promise<{
667671
}> {
668672
const launchOpts = {
669673
headless: true,
670-
args: buildV8LaunchArgs(),
674+
args: buildLaunchArgs(),
671675
};
672676

673677
if (BENCH_V8_TRACE) {
@@ -709,7 +713,13 @@ async function launchBenchChromium(): Promise<{
709713
};
710714
}
711715

712-
function buildV8LaunchArgs(): string[] {
716+
function buildLaunchArgs(): string[] {
717+
const args = [
718+
'--disable-background-timer-throttling',
719+
'--disable-renderer-backgrounding',
720+
'--disable-backgrounding-occluded-windows',
721+
'--disable-hang-monitor',
722+
];
713723
const jsFlags: string[] = [];
714724
if (BENCH_V8_TRACE) {
715725
jsFlags.push('--trace-opt', '--trace-deopt');
@@ -719,8 +729,8 @@ function buildV8LaunchArgs(): string[] {
719729
fs.mkdirSync(V8_LOG_DIR, { recursive: true });
720730
jsFlags.push('--prof', `--logfile=${V8_LOG_DIR}/v8-%p.log`);
721731
}
722-
if (jsFlags.length === 0) return [];
723-
return [`--js-flags=${jsFlags.join(' ')}`];
732+
if (jsFlags.length > 0) args.push(`--js-flags=${jsFlags.join(' ')}`);
733+
return args;
724734
}
725735

726736
function reportV8Logs(): void {
@@ -798,11 +808,15 @@ async function runRound(
798808
const cdp = await context.newCDPSession(page);
799809

800810
for (const scenario of libScenarios) {
801-
// Force GC before each scenario to reduce variance from prior allocations
811+
// Double-GC before each scenario to reduce variance from prior allocations
802812
try {
803813
await cdp.send('HeapProfiler.collectGarbage');
804814
} catch {}
805-
await page.waitForTimeout(200);
815+
await page.waitForTimeout(100);
816+
try {
817+
await cdp.send('HeapProfiler.collectGarbage');
818+
} catch {}
819+
await page.waitForTimeout(400);
806820

807821
done++;
808822
const prefix = opts.showProgress ? `[${done}/${total}] ` : '';
@@ -924,7 +938,11 @@ async function main() {
924938
try {
925939
await cdp.send('HeapProfiler.collectGarbage');
926940
} catch {}
927-
await page.waitForTimeout(200);
941+
await page.waitForTimeout(100);
942+
try {
943+
await cdp.send('HeapProfiler.collectGarbage');
944+
} catch {}
945+
await page.waitForTimeout(400);
928946

929947
process.stderr.write(` ${scenario.name}...\n`);
930948
try {

examples/benchmark-react/bench/scenarios.ts

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -22,16 +22,16 @@ const defaultOpsPerRound = parseInt(process.env.BENCH_OPS_PER_ROUND ?? '5', 10);
2222
export const RUN_CONFIG: Record<ScenarioSize, RunProfile> = {
2323
small: {
2424
warmup: 2,
25-
minMeasurement: 3,
26-
maxMeasurement: 15,
27-
targetMarginPct: process.env.CI ? 4 : 6,
25+
minMeasurement: 5,
26+
maxMeasurement: 20,
27+
targetMarginPct: process.env.CI ? 2 : 6,
2828
opsPerRound: defaultOpsPerRound,
2929
},
3030
large: {
3131
warmup: 1,
32-
minMeasurement: 3,
33-
maxMeasurement: 10,
34-
targetMarginPct: process.env.CI ? 6 : 10,
32+
minMeasurement: 5,
33+
maxMeasurement: 15,
34+
targetMarginPct: process.env.CI ? 3 : 10,
3535
opsPerRound: defaultOpsPerRound,
3636
},
3737
};
@@ -47,16 +47,16 @@ export interface ConvergentProfile {
4747

4848
export const CONVERGENT_CONFIG: Record<ScenarioSize, ConvergentProfile> = {
4949
small: {
50-
warmup: 5,
51-
minMeasurement: 5,
52-
maxMeasurement: 50,
53-
targetMarginPct: process.env.CI ? 4 : 6,
50+
warmup: 8,
51+
minMeasurement: 10,
52+
maxMeasurement: 60,
53+
targetMarginPct: process.env.CI ? 2 : 6,
5454
},
5555
large: {
56-
warmup: 3,
57-
minMeasurement: 5,
58-
maxMeasurement: 40,
59-
targetMarginPct: process.env.CI ? 6 : 10,
56+
warmup: 5,
57+
minMeasurement: 10,
58+
maxMeasurement: 50,
59+
targetMarginPct: process.env.CI ? 3 : 10,
6060
},
6161
};
6262

0 commit comments

Comments
 (0)