You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Launches VS Code via Playwright Electron, opens the chat panel, sends a message with a mock LLM response, and measures timing, layout, and rendering metrics. By default, downloads VS Code 1.115.0 as a baseline, benchmarks it, then benchmarks the local dev build and compares.
@@ -42,6 +42,7 @@ Launches VS Code via Playwright Electron, opens the chat panel, sends a message
42
42
|`--build <path\|ver>`| local dev | Build to test. Accepts path or version (`1.110.0`, `insiders`). |
43
43
|`--baseline-build <ver>`|`1.115.0`| Version to download and compare against. |
|`--resume <path>`| — | Resume a previous run, adding more iterations to increase confidence. |
45
46
|`--threshold <frac>`|`0.2`| Regression threshold (0.2 = flag if 20% slower). |
46
47
|`--verbose`| — | Print per-run details including response content. |
47
48
@@ -52,10 +53,37 @@ Launches VS Code via Playwright Electron, opens the chat panel, sends a message
52
53
npm run perf:chat -- --build 1.110.0 --baseline-build 1.115.0 --runs 5
53
54
```
54
55
56
+
### Resuming a run for more confidence
57
+
58
+
When results exceed the threshold but aren't statistically significant, the tool prints a `--resume` hint. Use it to add more iterations to an existing run:
59
+
60
+
```bash
61
+
# Initial run with 3 iterations — may be inconclusive:
62
+
npm run perf:chat -- --scenario text-only --runs 3
63
+
64
+
# Add 3 more runs to the same results file (both test + baseline):
65
+
npm run perf:chat -- --resume .chat-perf-data/2026-04-14T02-15-14/results.json --runs 3
66
+
67
+
# Keep adding until confidence is reached:
68
+
npm run perf:chat -- --resume .chat-perf-data/2026-04-14T02-15-14/results.json --runs 5
69
+
```
70
+
71
+
`--resume` loads the previous `results.json` and its associated `baseline-*.json`, runs N more iterations for both builds, merges rawRuns, recomputes stats, and re-runs the comparison. The updated files are written back in-place. You can resume multiple times — samples accumulate.
72
+
73
+
### Statistical significance
74
+
75
+
Regression detection uses **Welch's t-test** to avoid false positives from noisy measurements. A metric is only flagged as `REGRESSION` when it both exceeds the threshold AND is statistically significant (p < 0.05). Otherwise it's reported as `(likely noise — p=X, not significant)`.
76
+
77
+
With typical variance (cv ≈ 20%), you need:
78
+
-**n ≥ 5** per build to detect a 35% regression at 95% confidence
79
+
-**n ≥ 10** per build to detect a 20% regression reliably
Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Use 5+ runs to get stable results.
111
+
Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Baseline comparison uses **Welch's t-test** on raw run values to determine statistical significance before flagging regressions. Use 5+ runs to get stable results.
Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.
0 commit comments