|
| 1 | +# Public Benchmark Methodology |
| 2 | + |
| 3 | +**Status:** Draft methodology for issue |
| 4 | +[#63](https://github.com/ayhammouda/python-docs-mcp-server/issues/63). |
| 5 | +Do not publish comparative claims until the harness has produced reproducible |
| 6 | +data from this methodology. |
| 7 | + |
| 8 | +## Purpose |
| 9 | + |
| 10 | +The v0.5.0 public benchmark compares `python-docs-mcp-server` with eligible |
| 11 | +Python documentation MCPs and a no-MCP baseline on a fixed 50-question Python |
| 12 | +documentation evaluation. The benchmark reports correctness, token cost, and |
| 13 | +latency, with enough detail for a clean clone to reproduce the run. |
| 14 | + |
| 15 | +This is an evidence artifact, not marketing copy. If the data is boring or |
| 16 | +unfavorable, publish the data honestly and adjust the product claims. |
| 17 | + |
| 18 | +## Evidence Flow |
| 19 | + |
| 20 | +```mermaid |
| 21 | +flowchart LR |
| 22 | + Corpus[50-question corpus] --> Runner[Benchmark runner] |
| 23 | + Competitors[Pinned MCP competitors] --> Runner |
| 24 | + Baseline[No-MCP baseline] --> Runner |
| 25 | + Runner --> Transcripts[Raw transcripts] |
| 26 | + Transcripts --> Tokens[Token counting after client rewrap] |
| 27 | + Transcripts --> Rubric[Correctness scoring] |
| 28 | + Transcripts --> Latency[Latency summary] |
| 29 | + Tokens --> Report[Public report] |
| 30 | + Rubric --> Report |
| 31 | + Latency --> Report |
| 32 | +``` |
| 33 | + |
| 34 | +## Systems Under Test |
| 35 | + |
| 36 | +The final competitor set is locked at execution time. A competitor is eligible |
| 37 | +only if all of these are true: |
| 38 | + |
| 39 | +- It exposes Python standard library documentation retrieval or search. |
| 40 | +- It can be run or queried reproducibly from a clean clone. |
| 41 | +- Its version, package, image, endpoint, or commit can be pinned. |
| 42 | +- Its terms allow benchmark use. |
| 43 | +- It does not require private, undocumented access. |
| 44 | + |
| 45 | +The initial candidate matrix is: |
| 46 | + |
| 47 | +- `python-docs-mcp-server` |
| 48 | +- Context7 |
| 49 | +- GitMCP |
| 50 | +- DeepWiki |
| 51 | +- Ref.tools |
| 52 | +- no-MCP baseline |
| 53 | + |
| 54 | +The no-MCP baseline uses the same model and question prompts, but no retrieved |
| 55 | +documentation context. It measures parametric model behavior, not another docs |
| 56 | +tool. |
| 57 | + |
| 58 | +## Corpus Design |
| 59 | + |
| 60 | +The corpus contains exactly 50 questions. Each question must include: |
| 61 | + |
| 62 | +- Stable question ID. |
| 63 | +- Category. |
| 64 | +- Python version or version pair. |
| 65 | +- Prompt shown to the model. |
| 66 | +- Official-docs answer key. |
| 67 | +- Required citations or source sections. |
| 68 | +- Expected answer properties. |
| 69 | +- Known ambiguity notes, if any. |
| 70 | + |
| 71 | +Distribution: |
| 72 | + |
| 73 | +- 15 exact-symbol questions. |
| 74 | +- 10 concept or API-usage questions. |
| 75 | +- 15 cross-version questions, led by `compare_versions`-style diffs. |
| 76 | +- 5 PEP-adjacent questions where the official stdlib docs or "What's New" |
| 77 | + pages contain the required answer. |
| 78 | +- 5 applied questions that require selecting the right stdlib API from the |
| 79 | + documentation. |
| 80 | + |
| 81 | +The corpus must avoid questions whose answer requires private knowledge, |
| 82 | +external package documentation, non-stdlib behavior, or unreleased CPython |
| 83 | +changes. |
| 84 | + |
| 85 | +## Source Of Truth |
| 86 | + |
| 87 | +Correctness is scored against official Python documentation generated from |
| 88 | +CPython source at pinned tags. When a question concerns version behavior, the |
| 89 | +answer key must cite the exact relevant version or version pair. |
| 90 | + |
| 91 | +Allowed truth sources: |
| 92 | + |
| 93 | +- CPython documentation source at pinned commit or tag. |
| 94 | +- Generated official docs for the same Python version. |
| 95 | +- Official "What's New" pages for PEP-adjacent behavior. |
| 96 | + |
| 97 | +Disallowed truth sources: |
| 98 | + |
| 99 | +- Blog posts. |
| 100 | +- Search snippets. |
| 101 | +- LLM-generated explanations. |
| 102 | +- Third-party mirrors unless used only as a convenience link and verified |
| 103 | + against CPython source. |
| 104 | + |
| 105 | +## Prompting Rules |
| 106 | + |
| 107 | +Every system under test receives the same user question. The only allowed |
| 108 | +difference is the documentation context supplied by that system. |
| 109 | + |
| 110 | +The model prompt must require: |
| 111 | + |
| 112 | +- A concise answer. |
| 113 | +- Version-specific wording when the question names a version. |
| 114 | +- No unsupported claims. |
| 115 | +- A short citation to the retrieved section when the system provides one. |
| 116 | + |
| 117 | +The prompt must not reveal the answer key, rubric, or expected winning system. |
| 118 | + |
| 119 | +## Correctness Scoring |
| 120 | + |
| 121 | +Each answer receives one score: |
| 122 | + |
| 123 | +- `1.0`: Correct, version-aware, and includes all required answer properties. |
| 124 | +- `0.5`: Partially correct, but misses a required nuance, version condition, or |
| 125 | + citation. |
| 126 | +- `0.0`: Incorrect, unsupported, materially incomplete, or answers the wrong |
| 127 | + version. |
| 128 | + |
| 129 | +For public reporting, include both: |
| 130 | + |
| 131 | +- Mean correctness score. |
| 132 | +- Per-category correctness score. |
| 133 | + |
| 134 | +Any answer that appears correct but lacks evidence from the supplied docs is |
| 135 | +marked in the raw results and discussed separately. The benchmark should reward |
| 136 | +grounded answers, not confident autocomplete. |
| 137 | + |
| 138 | +## Token Measurement |
| 139 | + |
| 140 | +Token methodology follows roadmap decision 5.8 and ADR-006: |
| 141 | + |
| 142 | +- Use Claude token counting as the primary metric. |
| 143 | +- Measure after client-side rewrap, not only raw MCP payload bytes. |
| 144 | +- Record raw payload tokens separately as diagnostic data. |
| 145 | +- Report serialization latency alongside token counts. |
| 146 | + |
| 147 | +Primary token count: |
| 148 | + |
| 149 | +1. Capture the MCP tool response or baseline prompt context. |
| 150 | +2. Pass it through the same client-side wrapping path used by the benchmark |
| 151 | + client. |
| 152 | +3. Count the resulting message envelope with Claude token counting. |
| 153 | + |
| 154 | +If a client cannot expose its exact wrapped message envelope, the report must |
| 155 | +say so and mark that result as an approximation. Approximate counts must not be |
| 156 | +used for headline claims. |
| 157 | + |
| 158 | +## Latency Measurement |
| 159 | + |
| 160 | +Latency is wall-clock time measured per question from request dispatch to final |
| 161 | +answer receipt. |
| 162 | + |
| 163 | +Report: |
| 164 | + |
| 165 | +- Median. |
| 166 | +- p95. |
| 167 | +- Minimum and maximum. |
| 168 | +- Cold-run and warm-run separation where the system has a cache or index. |
| 169 | + |
| 170 | +Index build time is not part of per-query latency. It may be reported as setup |
| 171 | +cost in a separate section. |
| 172 | + |
| 173 | +## Reproducibility |
| 174 | + |
| 175 | +The public harness must run from a clean clone with one command after dependency |
| 176 | +installation. It must write: |
| 177 | + |
| 178 | +- Competitor manifest with pinned versions. |
| 179 | +- Corpus file. |
| 180 | +- Raw transcripts. |
| 181 | +- Raw scoring records. |
| 182 | +- Token-count records. |
| 183 | +- Latency records. |
| 184 | +- Summary report. |
| 185 | + |
| 186 | +Result files must include enough metadata to rerun or audit them: |
| 187 | + |
| 188 | +- Repository commit. |
| 189 | +- Python version. |
| 190 | +- Operating system. |
| 191 | +- Model name and provider. |
| 192 | +- MCP client or adapter version. |
| 193 | +- Competitor versions. |
| 194 | +- Timestamp. |
| 195 | + |
| 196 | +## Honesty Rules |
| 197 | + |
| 198 | +- No comparative claim enters README, PyPI, launch copy, or social posts before |
| 199 | + public results exist. |
| 200 | +- Do not drop failed systems silently. If a competitor cannot run, report the |
| 201 | + failure reason and exclude it from scored comparisons. |
| 202 | +- Do not change the corpus after seeing results unless the change is documented |
| 203 | + and the whole benchmark is rerun. |
| 204 | +- Do not tune prompts per competitor. |
| 205 | +- Do not report approximate token counts as exact. |
| 206 | + |
| 207 | +## Harness Work Packages |
| 208 | + |
| 209 | +Once this methodology is accepted, the implementation can be split into smaller |
| 210 | +agent-ready issues: |
| 211 | + |
| 212 | +1. Corpus schema and fixture loader. |
| 213 | +2. Baseline runner and transcript format. |
| 214 | +3. `python-docs-mcp-server` runner. |
| 215 | +4. Competitor manifest and adapter skeletons. |
| 216 | +5. Correctness scorer with manual-adjudication hooks. |
| 217 | +6. Claude token-count integration after client rewrap. |
| 218 | +7. Latency recorder and report generator. |
| 219 | + |
| 220 | +Each work package should reference this methodology and use `Refs #63`, not |
| 221 | +`Closes #63`, until the full benchmark has produced public data. |
0 commit comments