You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Stabilize benchmark scoring with robust runtime aggregation, deterministic spec ordering, and higher CI sampling (#4402)
Benchmark output was overly sensitive to run-to-run jitter, causing
inconsistent values even when commits did not change benchmark-relevant
code. This updates the benchmark engine to produce a more stable central
value per metric.
- **Aggregation model update**
- Replaced simple arithmetic averaging of runtime samples with an
outlier-resistant estimator:
- **5+ iterations**: trimmed mean (drop min/max)
- **1–4 iterations**: median
- Applied consistently across top-level runtime stages and nested
per-validator / per-rule / per-emitter metrics.
- **Deterministic execution order**
- Spec discovery now sorts directories before execution to remove
ordering variance from run output.
- **CI sampling update**
- Increased benchmark workflow measured iterations from **5** to **15**
(warmup remains **1**) to align with higher sample-count benchmarking
recommendations and reduce noise in comparisons.
- **Benchmark package coverage/docs**
- Added focused unit tests for aggregation behavior.
- Documented the new aggregation strategy in benchmark README.
```ts
// New runtime aggregation behavior
export function aggregateDurations(values: number[]): number {
const sorted = [...values].sort((a, b) => a - b);
if (sorted.length >= 5) return average(sorted.slice(1, -1)); // trimmed mean
return median(sorted); // small-sample robust center
}
```
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Co-authored-by: Timothee Guerin <tiguerin@microsoft.com>
Improve benchmark result stability by using outlier-resistant runtime aggregation (trimmed mean for 5+ iterations and median for smaller samples), run specs in a deterministic order, and increase CI benchmark measured iterations from 5 to 15 for stronger statistical confidence.
0 commit comments