Skip to content

Commit 80a093c

Browse files
[v0.5.0] benchmark — public token/correctness/latency harness vs docs MCPs (human-led) (#70)
* agent: docs: define public benchmark methodology * agent: deps: refresh pyjwt lock --------- Co-authored-by: bluecloud-gilfoyle[bot] <262642412+bluecloud-gilfoyle[bot]@users.noreply.github.com>
1 parent 8601093 commit 80a093c

2 files changed

Lines changed: 224 additions & 3 deletions

File tree

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# Public Benchmark Methodology
2+
3+
**Status:** Draft methodology for issue
4+
[#63](https://github.com/ayhammouda/python-docs-mcp-server/issues/63).
5+
Do not publish comparative claims until the harness has produced reproducible
6+
data from this methodology.
7+
8+
## Purpose
9+
10+
The v0.5.0 public benchmark compares `python-docs-mcp-server` with eligible
11+
Python documentation MCPs and a no-MCP baseline on a fixed 50-question Python
12+
documentation evaluation. The benchmark reports correctness, token cost, and
13+
latency, with enough detail for a clean clone to reproduce the run.
14+
15+
This is an evidence artifact, not marketing copy. If the data is boring or
16+
unfavorable, publish the data honestly and adjust the product claims.
17+
18+
## Evidence Flow
19+
20+
```mermaid
21+
flowchart LR
22+
Corpus[50-question corpus] --> Runner[Benchmark runner]
23+
Competitors[Pinned MCP competitors] --> Runner
24+
Baseline[No-MCP baseline] --> Runner
25+
Runner --> Transcripts[Raw transcripts]
26+
Transcripts --> Tokens[Token counting after client rewrap]
27+
Transcripts --> Rubric[Correctness scoring]
28+
Transcripts --> Latency[Latency summary]
29+
Tokens --> Report[Public report]
30+
Rubric --> Report
31+
Latency --> Report
32+
```
33+
34+
## Systems Under Test
35+
36+
The final competitor set is locked at execution time. A competitor is eligible
37+
only if all of these are true:
38+
39+
- It exposes Python standard library documentation retrieval or search.
40+
- It can be run or queried reproducibly from a clean clone.
41+
- Its version, package, image, endpoint, or commit can be pinned.
42+
- Its terms allow benchmark use.
43+
- It does not require private, undocumented access.
44+
45+
The initial candidate matrix is:
46+
47+
- `python-docs-mcp-server`
48+
- Context7
49+
- GitMCP
50+
- DeepWiki
51+
- Ref.tools
52+
- no-MCP baseline
53+
54+
The no-MCP baseline uses the same model and question prompts, but no retrieved
55+
documentation context. It measures parametric model behavior, not another docs
56+
tool.
57+
58+
## Corpus Design
59+
60+
The corpus contains exactly 50 questions. Each question must include:
61+
62+
- Stable question ID.
63+
- Category.
64+
- Python version or version pair.
65+
- Prompt shown to the model.
66+
- Official-docs answer key.
67+
- Required citations or source sections.
68+
- Expected answer properties.
69+
- Known ambiguity notes, if any.
70+
71+
Distribution:
72+
73+
- 15 exact-symbol questions.
74+
- 10 concept or API-usage questions.
75+
- 15 cross-version questions, led by `compare_versions`-style diffs.
76+
- 5 PEP-adjacent questions where the official stdlib docs or "What's New"
77+
pages contain the required answer.
78+
- 5 applied questions that require selecting the right stdlib API from the
79+
documentation.
80+
81+
The corpus must avoid questions whose answer requires private knowledge,
82+
external package documentation, non-stdlib behavior, or unreleased CPython
83+
changes.
84+
85+
## Source Of Truth
86+
87+
Correctness is scored against official Python documentation generated from
88+
CPython source at pinned tags. When a question concerns version behavior, the
89+
answer key must cite the exact relevant version or version pair.
90+
91+
Allowed truth sources:
92+
93+
- CPython documentation source at pinned commit or tag.
94+
- Generated official docs for the same Python version.
95+
- Official "What's New" pages for PEP-adjacent behavior.
96+
97+
Disallowed truth sources:
98+
99+
- Blog posts.
100+
- Search snippets.
101+
- LLM-generated explanations.
102+
- Third-party mirrors unless used only as a convenience link and verified
103+
against CPython source.
104+
105+
## Prompting Rules
106+
107+
Every system under test receives the same user question. The only allowed
108+
difference is the documentation context supplied by that system.
109+
110+
The model prompt must require:
111+
112+
- A concise answer.
113+
- Version-specific wording when the question names a version.
114+
- No unsupported claims.
115+
- A short citation to the retrieved section when the system provides one.
116+
117+
The prompt must not reveal the answer key, rubric, or expected winning system.
118+
119+
## Correctness Scoring
120+
121+
Each answer receives one score:
122+
123+
- `1.0`: Correct, version-aware, and includes all required answer properties.
124+
- `0.5`: Partially correct, but misses a required nuance, version condition, or
125+
citation.
126+
- `0.0`: Incorrect, unsupported, materially incomplete, or answers the wrong
127+
version.
128+
129+
For public reporting, include both:
130+
131+
- Mean correctness score.
132+
- Per-category correctness score.
133+
134+
Any answer that appears correct but lacks evidence from the supplied docs is
135+
marked in the raw results and discussed separately. The benchmark should reward
136+
grounded answers, not confident autocomplete.
137+
138+
## Token Measurement
139+
140+
Token methodology follows roadmap decision 5.8 and ADR-006:
141+
142+
- Use Claude token counting as the primary metric.
143+
- Measure after client-side rewrap, not only raw MCP payload bytes.
144+
- Record raw payload tokens separately as diagnostic data.
145+
- Report serialization latency alongside token counts.
146+
147+
Primary token count:
148+
149+
1. Capture the MCP tool response or baseline prompt context.
150+
2. Pass it through the same client-side wrapping path used by the benchmark
151+
client.
152+
3. Count the resulting message envelope with Claude token counting.
153+
154+
If a client cannot expose its exact wrapped message envelope, the report must
155+
say so and mark that result as an approximation. Approximate counts must not be
156+
used for headline claims.
157+
158+
## Latency Measurement
159+
160+
Latency is wall-clock time measured per question from request dispatch to final
161+
answer receipt.
162+
163+
Report:
164+
165+
- Median.
166+
- p95.
167+
- Minimum and maximum.
168+
- Cold-run and warm-run separation where the system has a cache or index.
169+
170+
Index build time is not part of per-query latency. It may be reported as setup
171+
cost in a separate section.
172+
173+
## Reproducibility
174+
175+
The public harness must run from a clean clone with one command after dependency
176+
installation. It must write:
177+
178+
- Competitor manifest with pinned versions.
179+
- Corpus file.
180+
- Raw transcripts.
181+
- Raw scoring records.
182+
- Token-count records.
183+
- Latency records.
184+
- Summary report.
185+
186+
Result files must include enough metadata to rerun or audit them:
187+
188+
- Repository commit.
189+
- Python version.
190+
- Operating system.
191+
- Model name and provider.
192+
- MCP client or adapter version.
193+
- Competitor versions.
194+
- Timestamp.
195+
196+
## Honesty Rules
197+
198+
- No comparative claim enters README, PyPI, launch copy, or social posts before
199+
public results exist.
200+
- Do not drop failed systems silently. If a competitor cannot run, report the
201+
failure reason and exclude it from scored comparisons.
202+
- Do not change the corpus after seeing results unless the change is documented
203+
and the whole benchmark is rerun.
204+
- Do not tune prompts per competitor.
205+
- Do not report approximate token counts as exact.
206+
207+
## Harness Work Packages
208+
209+
Once this methodology is accepted, the implementation can be split into smaller
210+
agent-ready issues:
211+
212+
1. Corpus schema and fixture loader.
213+
2. Baseline runner and transcript format.
214+
3. `python-docs-mcp-server` runner.
215+
4. Competitor manifest and adapter skeletons.
216+
5. Correctness scorer with manual-adjudication hooks.
217+
6. Claude token-count integration after client rewrap.
218+
7. Latency recorder and report generator.
219+
220+
Each work package should reference this methodology and use `Refs #63`, not
221+
`Closes #63`, until the full benchmark has produced public data.

uv.lock

Lines changed: 3 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)