Skip to content

[v0.5.0] benchmark reporting — generate raw report and README-safe summary #74

@ayhammouda

Description

@ayhammouda

Context

Parent: #63. Methodology: docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md.

The benchmark must produce public-facing output that is credible, boringly auditable, and impossible to confuse with hand-picked marketing numbers.

Goal

Add benchmark reporting that converts raw run artifacts into a full report plus a README-safe summary block gated on reproducible data.

Acceptance criteria

  • README-safe summaries present strict tool + model pairings, for example python-docs-mcp-server + gpt-4o, instead of tool-only rows.
  • Generated tables include an error/timeout rate column so unstable tools cannot hide behind surviving-query correctness.
  • A report generator reads raw benchmark artifacts and writes docs/benchmarks/results/<run-id>/REPORT.md.
  • The report includes methodology link, corpus hash, repo commit, model/client matrix, competitor manifest, correctness by category, token counts after client rewrap, latency median/p95, failures/exclusions, and environment metadata.
  • A README summary template is generated separately and includes only compact tables plus links to the methodology and raw result bundle.
  • The generator refuses to produce a README-ready block if required metadata, corpus hash, or raw result files are missing.
  • Tests cover report generation, missing metadata failure, failed competitor disclosure, and aggregate/per-model separation.
  • README itself is not updated with benchmark claims until real data exists.

Scope boundaries

In scope:

  • Report generator.
  • README-safe summary template generation.
  • Tests for gating and disclosure rules.

Out of scope:

  • Running providers.
  • Scoring answer correctness manually.
  • Editing README with final benchmark results.

Forbidden-territory reminder

Do not modify MCP tool names, parameters, return shapes, schema.sql, .github/workflows/, pyproject.toml project metadata, .planning/POSITIONING.md, the README hero section, LICENSE, SECURITY.md, or existing tests by weakening/deleting assertions.

README edits, if any, must be below the install/tooling sections and must not touch the hero.

Validation commands

uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor

Run any new reporting tests directly as well.

PR template

Use Refs #63, not Closes #63.

The PR must include:

  • Example generated report from fixture data.
  • Test output.
  • Confirmation that README claim publication is still gated on real benchmark data.

Recovery

If report generation needs result fields not specified by the runner, stop and comment with the missing fields so the runner issue can be corrected.

Effort estimate

4-6 hours.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-readyIssue passed AGENT-EXECUTION-PIPELINE.md §10 pre-flight; scoped for an autonomous agentenhancementNew feature or requestpriority:P2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions