You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The benchmark must produce public-facing output that is credible, boringly auditable, and impossible to confuse with hand-picked marketing numbers.
Goal
Add benchmark reporting that converts raw run artifacts into a full report plus a README-safe summary block gated on reproducible data.
Acceptance criteria
README-safe summaries present strict tool + model pairings, for example python-docs-mcp-server + gpt-4o, instead of tool-only rows.
Generated tables include an error/timeout rate column so unstable tools cannot hide behind surviving-query correctness.
A report generator reads raw benchmark artifacts and writes docs/benchmarks/results/<run-id>/REPORT.md.
The report includes methodology link, corpus hash, repo commit, model/client matrix, competitor manifest, correctness by category, token counts after client rewrap, latency median/p95, failures/exclusions, and environment metadata.
A README summary template is generated separately and includes only compact tables plus links to the methodology and raw result bundle.
The generator refuses to produce a README-ready block if required metadata, corpus hash, or raw result files are missing.
Context
Parent: #63. Methodology:
docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md.The benchmark must produce public-facing output that is credible, boringly auditable, and impossible to confuse with hand-picked marketing numbers.
Goal
Add benchmark reporting that converts raw run artifacts into a full report plus a README-safe summary block gated on reproducible data.
Acceptance criteria
python-docs-mcp-server + gpt-4o, instead of tool-only rows.docs/benchmarks/results/<run-id>/REPORT.md.Scope boundaries
In scope:
Out of scope:
Forbidden-territory reminder
Do not modify MCP tool names, parameters, return shapes,
schema.sql,.github/workflows/,pyproject.tomlproject metadata,.planning/POSITIONING.md, the README hero section,LICENSE,SECURITY.md, or existing tests by weakening/deleting assertions.README edits, if any, must be below the install/tooling sections and must not touch the hero.
Validation commands
Run any new reporting tests directly as well.
PR template
Use
Refs #63, notCloses #63.The PR must include:
Recovery
If report generation needs result fields not specified by the runner, stop and comment with the missing fields so the runner issue can be corrected.
Effort estimate
4-6 hours.