[v0.5.0] benchmark reporting — generate raw report and README-safe summary

## Context

Parent: #63. Methodology: `docs/benchmarks/PUBLIC-BENCHMARK-METHODOLOGY.md`.

The benchmark must produce public-facing output that is credible, boringly auditable, and impossible to confuse with hand-picked marketing numbers.

## Goal

Add benchmark reporting that converts raw run artifacts into a full report plus a README-safe summary block gated on reproducible data.

## Acceptance criteria

- [ ] README-safe summaries present strict tool + model pairings, for example `python-docs-mcp-server + gpt-4o`, instead of tool-only rows.
- [ ] Generated tables include an error/timeout rate column so unstable tools cannot hide behind surviving-query correctness.
- [ ] A report generator reads raw benchmark artifacts and writes `docs/benchmarks/results/<run-id>/REPORT.md`.
- [ ] The report includes methodology link, corpus hash, repo commit, model/client matrix, competitor manifest, correctness by category, token counts after client rewrap, latency median/p95, failures/exclusions, and environment metadata.
- [ ] A README summary template is generated separately and includes only compact tables plus links to the methodology and raw result bundle.
- [ ] The generator refuses to produce a README-ready block if required metadata, corpus hash, or raw result files are missing.
- [ ] Tests cover report generation, missing metadata failure, failed competitor disclosure, and aggregate/per-model separation.
- [ ] README itself is not updated with benchmark claims until real data exists.

## Scope boundaries

In scope:
- Report generator.
- README-safe summary template generation.
- Tests for gating and disclosure rules.

Out of scope:
- Running providers.
- Scoring answer correctness manually.
- Editing README with final benchmark results.

## Forbidden-territory reminder

Do not modify MCP tool names, parameters, return shapes, `schema.sql`, `.github/workflows/`, `pyproject.toml` project metadata, `.planning/POSITIONING.md`, the README hero section, `LICENSE`, `SECURITY.md`, or existing tests by weakening/deleting assertions.

README edits, if any, must be below the install/tooling sections and must not touch the hero.

## Validation commands

```bash
uv run ruff check src/ tests/
uv run pyright src/
uv run pytest --tb=short -q
uv run python-docs-mcp-server doctor
```

Run any new reporting tests directly as well.

## PR template

Use `Refs #63`, not `Closes #63`.

The PR must include:
- Example generated report from fixture data.
- Test output.
- Confirmation that README claim publication is still gated on real benchmark data.

## Recovery

If report generation needs result fields not specified by the runner, stop and comment with the missing fields so the runner issue can be corrected.

## Effort estimate

4-6 hours.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.5.0] benchmark reporting — generate raw report and README-safe summary #74

Context

Goal

Acceptance criteria

Scope boundaries

Forbidden-territory reminder

Validation commands

PR template

Recovery

Effort estimate

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[v0.5.0] benchmark reporting — generate raw report and README-safe summary #74

Description

Context

Goal

Acceptance criteria

Scope boundaries

Forbidden-territory reminder

Validation commands

PR template

Recovery

Effort estimate

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions