Skip to content

Commit a4a562b

Browse files
committed
PR
1 parent 620c954 commit a4a562b

4 files changed

Lines changed: 663 additions & 25 deletions

File tree

.github/skills/chat-perf/SKILL.md

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ npm run perf:chat-leak -- --messages 20 --verbose
2828

2929
## Perf regression test
3030

31-
**Script:** `scripts/chat-perf/test-chat-perf-regression.js`
31+
**Script:** `scripts/chat-perf/test-chat-perf-regression.js`
3232
**npm:** `npm run perf:chat`
3333

3434
Launches VS Code via Playwright Electron, opens the chat panel, sends a message with a mock LLM response, and measures timing, layout, and rendering metrics. By default, downloads VS Code 1.115.0 as a baseline, benchmarks it, then benchmarks the local dev build and compares.
@@ -42,6 +42,7 @@ Launches VS Code via Playwright Electron, opens the chat panel, sends a message
4242
| `--build <path\|ver>` | local dev | Build to test. Accepts path or version (`1.110.0`, `insiders`). |
4343
| `--baseline-build <ver>` | `1.115.0` | Version to download and compare against. |
4444
| `--no-baseline` || Skip baseline comparison entirely. |
45+
| `--resume <path>` || Resume a previous run, adding more iterations to increase confidence. |
4546
| `--threshold <frac>` | `0.2` | Regression threshold (0.2 = flag if 20% slower). |
4647
| `--verbose` || Print per-run details including response content. |
4748

@@ -52,10 +53,37 @@ Launches VS Code via Playwright Electron, opens the chat panel, sends a message
5253
npm run perf:chat -- --build 1.110.0 --baseline-build 1.115.0 --runs 5
5354
```
5455

56+
### Resuming a run for more confidence
57+
58+
When results exceed the threshold but aren't statistically significant, the tool prints a `--resume` hint. Use it to add more iterations to an existing run:
59+
60+
```bash
61+
# Initial run with 3 iterations — may be inconclusive:
62+
npm run perf:chat -- --scenario text-only --runs 3
63+
64+
# Add 3 more runs to the same results file (both test + baseline):
65+
npm run perf:chat -- --resume .chat-perf-data/2026-04-14T02-15-14/results.json --runs 3
66+
67+
# Keep adding until confidence is reached:
68+
npm run perf:chat -- --resume .chat-perf-data/2026-04-14T02-15-14/results.json --runs 5
69+
```
70+
71+
`--resume` loads the previous `results.json` and its associated `baseline-*.json`, runs N more iterations for both builds, merges rawRuns, recomputes stats, and re-runs the comparison. The updated files are written back in-place. You can resume multiple times — samples accumulate.
72+
73+
### Statistical significance
74+
75+
Regression detection uses **Welch's t-test** to avoid false positives from noisy measurements. A metric is only flagged as `REGRESSION` when it both exceeds the threshold AND is statistically significant (p < 0.05). Otherwise it's reported as `(likely noise — p=X, not significant)`.
76+
77+
With typical variance (cv ≈ 20%), you need:
78+
- **n ≥ 5** per build to detect a 35% regression at 95% confidence
79+
- **n ≥ 10** per build to detect a 20% regression reliably
80+
81+
Confidence levels reported: `high` (p < 0.01), `medium` (p < 0.05), `low` (p < 0.1), `none`.
82+
5583
### Exit codes
5684

57-
- `0` — all metrics within threshold
58-
- `1` — regression detected or runs failed
85+
- `0` — all metrics within threshold, or exceeding threshold but not statistically significant
86+
- `1`statistically significant regression detected, or all runs failed
5987

6088
### Scenarios
6189

@@ -80,11 +108,11 @@ npm run perf:chat -- --build 1.110.0 --baseline-build 1.115.0 --runs 5
80108

81109
### Statistics
82110

83-
Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Use 5+ runs to get stable results.
111+
Results use **IQR-based outlier removal** and **median** (not mean) to handle startup jitter. The **coefficient of variation (cv)** is reported — under 15% is stable, over 15% gets a ⚠ warning. Baseline comparison uses **Welch's t-test** on raw run values to determine statistical significance before flagging regressions. Use 5+ runs to get stable results.
84112

85113
## Memory leak check
86114

87-
**Script:** `scripts/chat-perf/test-chat-mem-leaks.js`
115+
**Script:** `scripts/chat-perf/test-chat-mem-leaks.js`
88116
**npm:** `npm run perf:chat-leak`
89117

90118
Launches one VS Code session, sends N messages sequentially, forces GC between each, and measures renderer heap and DOM node count. Uses **linear regression** on the samples to compute per-message growth rate, which is compared against a threshold.

.github/workflows/chat-perf.yml

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
name: Chat Performance Comparison
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
baseline_commit:
7+
description: 'Baseline commit SHA or version (e.g. "1.115.0", "abc1234")'
8+
required: true
9+
type: string
10+
test_commit:
11+
description: 'Test commit SHA or version (e.g. "main", "abc1234")'
12+
required: true
13+
type: string
14+
runs:
15+
description: 'Runs per scenario (default: 7 for statistical significance)'
16+
required: false
17+
type: number
18+
default: 7
19+
scenarios:
20+
description: 'Comma-separated scenario list (empty = all)'
21+
required: false
22+
type: string
23+
default: ''
24+
threshold:
25+
description: 'Regression threshold fraction (default: 0.2 = 20%)'
26+
required: false
27+
type: number
28+
default: 0.2
29+
30+
permissions:
31+
contents: read
32+
33+
concurrency:
34+
group: chat-perf-${{ github.run_id }}
35+
cancel-in-progress: true
36+
37+
jobs:
38+
chat-perf:
39+
name: Chat Perf – ${{ inputs.baseline_commit }} vs ${{ inputs.test_commit }}
40+
runs-on: ubuntu-latest
41+
timeout-minutes: 120
42+
steps:
43+
- name: Checkout test commit
44+
uses: actions/checkout@v6
45+
with:
46+
ref: ${{ inputs.test_commit }}
47+
48+
- name: Setup Node.js
49+
uses: actions/setup-node@v6
50+
with:
51+
node-version-file: .nvmrc
52+
53+
- name: Install system dependencies
54+
run: |
55+
sudo apt update -y
56+
sudo apt install -y \
57+
build-essential pkg-config \
58+
libx11-dev libx11-xcb-dev libxkbfile-dev \
59+
libnotify-bin libkrb5-dev \
60+
xvfb sqlite3 \
61+
libnss3 libatk1.0-0 libatk-bridge2.0-0 \
62+
libcups2 libdrm2 libxcomposite1 libxdamage1 \
63+
libxrandr2 libgbm1 libpango-1.0-0 libcairo2 \
64+
libasound2 libxshmfence1 libgtk-3-0
65+
66+
- name: Install dependencies
67+
run: npm ci
68+
env:
69+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
70+
71+
- name: Install build dependencies
72+
run: npm ci
73+
working-directory: build
74+
75+
- name: Transpile source
76+
run: npm run transpile-client
77+
78+
- name: Install Playwright Chromium
79+
run: npx playwright install chromium
80+
81+
- name: Run chat perf comparison
82+
id: perf
83+
run: |
84+
SCENARIO_ARGS=""
85+
if [[ -n "${{ inputs.scenarios }}" ]]; then
86+
IFS=',' read -ra SCENS <<< "${{ inputs.scenarios }}"
87+
for s in "${SCENS[@]}"; do
88+
SCENARIO_ARGS="$SCENARIO_ARGS --scenario $(echo "$s" | xargs)"
89+
done
90+
fi
91+
92+
xvfb-run node scripts/chat-perf/test-chat-perf-regression.js \
93+
--baseline-build "${{ inputs.baseline_commit }}" \
94+
--build "${{ inputs.test_commit }}" \
95+
--runs ${{ inputs.runs }} \
96+
--threshold ${{ inputs.threshold }} \
97+
--ci \
98+
$SCENARIO_ARGS \
99+
2>&1 | tee perf-output.log
100+
101+
# Extract exit code from the script (tee masks it)
102+
exit ${PIPESTATUS[0]}
103+
continue-on-error: true
104+
105+
- name: Write job summary
106+
if: always()
107+
run: |
108+
if [[ -f .chat-perf-data/ci-summary.md ]]; then
109+
cat .chat-perf-data/ci-summary.md >> "$GITHUB_STEP_SUMMARY"
110+
else
111+
echo "⚠️ No summary file generated. Check perf-output.log artifact." >> "$GITHUB_STEP_SUMMARY"
112+
fi
113+
114+
- name: Zip diagnostic outputs
115+
if: always()
116+
run: |
117+
# Find the most recent timestamped run directory
118+
RUN_DIR=$(ls -td .chat-perf-data/20*/ 2>/dev/null | head -1)
119+
if [[ -n "$RUN_DIR" ]]; then
120+
# Zip everything: results JSON, CPU profiles, traces, heap snapshots
121+
cd .chat-perf-data
122+
zip -r ../chat-perf-artifacts.zip \
123+
"$(basename "$RUN_DIR")"/ \
124+
ci-summary.md \
125+
baseline-*.json \
126+
2>/dev/null || true
127+
cd ..
128+
fi
129+
130+
- name: Upload perf artifacts
131+
if: always()
132+
uses: actions/upload-artifact@v7
133+
with:
134+
name: chat-perf-${{ inputs.baseline_commit }}-vs-${{ inputs.test_commit }}
135+
path: |
136+
chat-perf-artifacts.zip
137+
perf-output.log
138+
retention-days: 30
139+
140+
- name: Fail on regression
141+
if: steps.perf.outcome == 'failure'
142+
run: |
143+
echo "::error::Chat performance regression detected. See job summary for details."
144+
exit 1

scripts/chat-perf/common/utils.js

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -399,6 +399,108 @@ function removeOutliers(values) {
399399
return sorted.filter(v => v >= lo && v <= hi);
400400
}
401401

402+
/**
403+
* Regularized incomplete beta function I_x(a, b) via continued fraction.
404+
* Used for computing t-distribution CDF / p-values.
405+
* @param {number} x
406+
* @param {number} a
407+
* @param {number} b
408+
* @returns {number}
409+
*/
410+
function betaIncomplete(x, a, b) {
411+
if (x <= 0) { return 0; }
412+
if (x >= 1) { return 1; }
413+
// Use symmetry relation when x > (a+1)/(a+b+2) for better convergence
414+
if (x > (a + 1) / (a + b + 2)) {
415+
return 1 - betaIncomplete(1 - x, b, a);
416+
}
417+
// Log-beta via Stirling: lnBeta(a,b) = lnGamma(a)+lnGamma(b)-lnGamma(a+b)
418+
const lnBeta = lnGamma(a) + lnGamma(b) - lnGamma(a + b);
419+
const front = Math.exp(Math.log(x) * a + Math.log(1 - x) * b - lnBeta) / a;
420+
// Lentz's continued fraction
421+
const maxIter = 200;
422+
const eps = 1e-14;
423+
let c = 1, d = 1 - (a + b) * x / (a + 1);
424+
if (Math.abs(d) < eps) { d = eps; }
425+
d = 1 / d;
426+
let result = d;
427+
for (let m = 1; m <= maxIter; m++) {
428+
// Even step
429+
let num = m * (b - m) * x / ((a + 2 * m - 1) * (a + 2 * m));
430+
d = 1 + num * d; if (Math.abs(d) < eps) { d = eps; } d = 1 / d;
431+
c = 1 + num / c; if (Math.abs(c) < eps) { c = eps; }
432+
result *= d * c;
433+
// Odd step
434+
num = -(a + m) * (a + b + m) * x / ((a + 2 * m) * (a + 2 * m + 1));
435+
d = 1 + num * d; if (Math.abs(d) < eps) { d = eps; } d = 1 / d;
436+
c = 1 + num / c; if (Math.abs(c) < eps) { c = eps; }
437+
const delta = d * c;
438+
result *= delta;
439+
if (Math.abs(delta - 1) < eps) { break; }
440+
}
441+
return front * result;
442+
}
443+
444+
/**
445+
* Log-gamma via Lanczos approximation.
446+
* @param {number} z
447+
* @returns {number}
448+
*/
449+
function lnGamma(z) {
450+
const g = 7;
451+
const coef = [0.99999999999980993, 676.5203681218851, -1259.1392167224028,
452+
771.32342877765313, -176.61502916214059, 12.507343278686905,
453+
-0.13857109526572012, 9.9843695780195716e-6, 1.5056327351493116e-7];
454+
if (z < 0.5) {
455+
return Math.log(Math.PI / Math.sin(Math.PI * z)) - lnGamma(1 - z);
456+
}
457+
z -= 1;
458+
let x = coef[0];
459+
for (let i = 1; i < g + 2; i++) { x += coef[i] / (z + i); }
460+
const t = z + g + 0.5;
461+
return 0.5 * Math.log(2 * Math.PI) + (z + 0.5) * Math.log(t) - t + Math.log(x);
462+
}
463+
464+
/**
465+
* Two-tailed p-value from t-distribution.
466+
* @param {number} t - t-statistic
467+
* @param {number} df - degrees of freedom
468+
* @returns {number}
469+
*/
470+
function tDistPValue(t, df) {
471+
const x = df / (df + t * t);
472+
return betaIncomplete(x, df / 2, 0.5);
473+
}
474+
475+
/**
476+
* Welch's t-test for two independent samples (unequal variance).
477+
* @param {number[]} a - Sample 1 (e.g., baseline values)
478+
* @param {number[]} b - Sample 2 (e.g., current values)
479+
* @returns {{ t: number, df: number, pValue: number, significant: boolean, confidence: string } | null}
480+
*/
481+
function welchTTest(a, b) {
482+
if (a.length < 2 || b.length < 2) { return null; }
483+
const meanA = a.reduce((s, v) => s + v, 0) / a.length;
484+
const meanB = b.reduce((s, v) => s + v, 0) / b.length;
485+
const varA = a.reduce((s, v) => s + (v - meanA) ** 2, 0) / (a.length - 1);
486+
const varB = b.reduce((s, v) => s + (v - meanB) ** 2, 0) / (b.length - 1);
487+
const seA = varA / a.length;
488+
const seB = varB / b.length;
489+
const seDiff = Math.sqrt(seA + seB);
490+
if (seDiff === 0) { return null; }
491+
const t = (meanB - meanA) / seDiff;
492+
// Welch-Satterthwaite degrees of freedom
493+
const df = (seA + seB) ** 2 / ((seA ** 2) / (a.length - 1) + (seB ** 2) / (b.length - 1));
494+
const pValue = tDistPValue(t, df);
495+
const significant = pValue < 0.05;
496+
let confidence;
497+
if (pValue < 0.01) { confidence = 'high'; }
498+
else if (pValue < 0.05) { confidence = 'medium'; }
499+
else if (pValue < 0.1) { confidence = 'low'; }
500+
else { confidence = 'none'; }
501+
return { t: Math.round(t * 100) / 100, df: Math.round(df * 10) / 10, pValue: Math.round(pValue * 1000) / 1000, significant, confidence };
502+
}
503+
402504
/**
403505
* Compute robust stats for a metric array.
404506
* @param {number[]} raw
@@ -482,6 +584,7 @@ const METRIC_DEFS = [
482584
['instructionCollectionTime', 'timing', 'ms'],
483585
['agentInvokeTime', 'timing', 'ms'],
484586
['heapDelta', 'memory', 'MB'],
587+
['gcDurationMs', 'memory', 'ms'],
485588
['layoutCount', 'rendering', ''],
486589
['recalcStyleCount', 'rendering', ''],
487590
['forcedReflowCount', 'rendering', ''],
@@ -504,6 +607,7 @@ module.exports = {
504607
median,
505608
removeOutliers,
506609
robustStats,
610+
welchTTest,
507611
linearRegressionSlope,
508612
summarize,
509613
markDuration,

0 commit comments

Comments
 (0)