- all implementations must expose equivalent API behavior validated by parity tests
- identical endpoint semantics: status, headers, and body contract
- identical seed state before parity and benchmark runs
- identical load profile for each benchmark target
- Start target service
- Wait for readiness endpoint
- Run parity checks against target
- Only if parity passes, run load benchmark
- Persist raw metrics and summary artifacts
- Docker + Docker Compose for service orchestration
- Go parity runner (
cmd/parity-test) - shell scripts in
scripts/for orchestration hyperfinebenchmark engine (optional viaBENCH_ENGINE=hyperfine)benchstatstatistical comparison for quality gates- policy file:
stats-policy.json - Python 3 report and normalization tooling in
scripts/
- warmup requests: 100 (legacy engine path)
- benchmark requests per run: 300
- runs per target: 3 (median reported)
- thresholds and required metrics are defined in
stats-policy.json make ci-benchmark-quality-checkenforces policy locally and in CI- benchstat comparisons are evaluated against policy baseline framework (
baselineby default) - manual CI benchmark runs use bounded workflow inputs (
frameworkssubset,runs1..10,benchmark_requests50..1000)
- raw run outputs:
results/latest/raw/ - normalized summary:
results/latest/summary.json - markdown report:
results/latest/report.md - quality summary:
results/latest/benchmark-quality-summary.json - optional tool artifacts:
results/latest/tooling/benchstat/*.txt
- update this changelog whenever benchmark process, tooling, schema, thresholds, runtime constraints, or interpretation rules change
- classify each entry as
comparability-impactingornon-comparability-impacting - for
comparability-impactingchanges, include migration notes and baseline reset guidance - do not publish new benchmark claims without a corresponding changelog entry when methodology or version changed
Use one row per change with required fields:
version | date (UTC) | change_type | summary | comparability_impact | required_action
| version | date (UTC) | change_type | summary | comparability_impact | required_action |
|---|---|---|---|---|---|
| 1.1.0 | 2026-02-07 | policy | Added publication fairness disclaimer template and README/report sync policy checks | comparability-impacting | Rebaseline external comparisons and reference this version in publication notes |
| 1.0.0 | 2026-02-05 | baseline | Established parity-gated benchmark workflow, schema validation, and quality gates | comparability-impacting | Treat pre-1.0 outputs as non-comparable to current policy |
- treat parity failures as correctness blockers, not performance regressions
- compare medians first, then inspect distribution variance
- use benchstat deltas and policy thresholds for pass/fail interpretation
- annotate environment drift (host type, CPU, memory, Docker version) in report notes