Skip to content

Commit 71d682a

Browse files
committed
Merge branch 'main' into mrtr-tests
2 parents a8fa736 + acc70de commit 71d682a

65 files changed

Lines changed: 4166 additions & 1681 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/new-sep/SKILL.md

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
---
2+
name: new-sep
3+
description: >-
4+
Scaffold a sep-NNNN.yaml requirement-traceability file for the MCP
5+
conformance repo from a SEP PR's spec diff. Runs the new-sep CLI, then
6+
parses the modelcontextprotocol/modelcontextprotocol spec diff to populate
7+
`requirements[]` with the RFC 2119 sentences and proposed check IDs.
8+
argument-hint: '<sep-number>'
9+
---
10+
11+
# new-sep: SEP traceability YAML scaffolding
12+
13+
You are bootstrapping a `sep-NNNN.yaml` file for a new SEP in the MCP conformance repo. The output is the requirement-traceability file specified by SEP-2484: a YAML that maps each normative sentence from the SEP's spec diff to a `check:` ID (testable) or an `excluded:` reason (not testable). The CLI gets the skeleton; you fill in the rows by reading the spec diff.
14+
15+
## Step 0: Pre-flight checks
16+
17+
Before doing anything else, verify GitHub CLI authentication:
18+
19+
```bash
20+
gh auth status 2>&1
21+
```
22+
23+
If this fails, stop immediately and tell the user:
24+
25+
> GitHub authentication is required for this skill. Please run `gh auth login` first, then re-run.
26+
27+
Verify you're running inside the conformance repo:
28+
29+
```bash
30+
test -f package.json && jq -r '.name' package.json
31+
```
32+
33+
The name should be `@modelcontextprotocol/conformance`. If not, stop and ask the user to `cd` into the conformance repo first.
34+
35+
## Step 1: Parse arguments
36+
37+
Extract from the user's input:
38+
39+
- **sep-number** (required): the SEP number, e.g. `2164`. This is also the PR number in `modelcontextprotocol/modelcontextprotocol` by convention.
40+
41+
## Step 2: Generate the skeleton
42+
43+
Run the CLI:
44+
45+
```bash
46+
npm run --silent build
47+
node dist/index.js new-sep <NNNN>
48+
```
49+
50+
(For development against a non-built source tree: `npx tsx src/index.ts new-sep ...`.)
51+
52+
The CLI writes `src/seps/sep-<NNNN>.yaml` with `sep`, `spec_url`, and two TODO `requirements[]` rows. Capture the output path from the CLI's `Wrote …` line and remember it as `$YAML`.
53+
54+
If the CLI errors with "does not change any docs/specification/draft/\*.mdx", the SEP's spec changes landed in a separate PR — ask the user for the spec file path and rerun with `--spec-path docs/specification/draft/<path>`. Do not guess.
55+
56+
## Step 3: Fetch the spec diff
57+
58+
`AGENTS.md` (lines 64–72) is explicit that severity must come from the spec text itself, not the SEP markdown or the conformance PR description:
59+
60+
```bash
61+
gh api "repos/modelcontextprotocol/modelcontextprotocol/pulls/<NNNN>/files" \
62+
--jq '.[] | select(.filename | test("^docs/specification/draft/.*\\.mdx$")) | {filename, patch}'
63+
```
64+
65+
For each file, pull the added (`+`-prefixed) lines from `patch`. If `patch` is truncated for a large file, fall back to fetching the whole file at the PR's head ref:
66+
67+
```bash
68+
gh api "repos/modelcontextprotocol/modelcontextprotocol/contents/<path>?ref=<sep-branch>" \
69+
--jq '.content' | base64 -d
70+
```
71+
72+
## Step 4: Extract RFC 2119 requirements
73+
74+
Walk the added lines and identify sentences containing the keywords: **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **MAY**, **OPTIONAL**.
75+
76+
**Quote the whole sentence**, not just the matched line. The matched word may sit inside a bullet point whose lead-in sentence supplies the keyword by inheritance — e.g.:
77+
78+
> Servers SHOULD return standard JSON-RPC errors for common failure cases:
79+
>
80+
> - Resource not found: -32602 (Invalid Params)
81+
82+
The bullet inherits `SHOULD`. The yaml row should quote the _combined_ obligation: `'Servers SHOULD return standard JSON-RPC errors for common failure cases: Resource not found: -32602 (Invalid Params)'` — see `src/seps/sep-2164.yaml` for the canonical example.
83+
84+
**Regex alone is insufficient** (this is called out in Issue #243). Read for context: pronouns, "the server", and "such cases" all refer back to the lead-in.
85+
86+
## Step 5: Map severity → check vs. excluded
87+
88+
From `AGENTS.md:50-56`:
89+
90+
| Keyword | Severity | YAML field |
91+
| ---------------------------------------------- | -------- | -------------------------- |
92+
| MUST / MUST NOT / SHALL / SHALL NOT / REQUIRED | FAILURE | `check: sep-<NNNN>-<slug>` |
93+
| SHOULD / SHOULD NOT | WARNING | `check: sep-<NNNN>-<slug>` |
94+
| MAY / OPTIONAL || _no row — skip entirely_ |
95+
96+
MAY / OPTIONAL sentences are noted in Step 4 only so you consciously skip them — they never produce a yaml row.
97+
98+
A row is `excluded:` when a MUST/SHOULD requirement can't be protocol-observed by the harness. Do **not** write any `excluded:` row on your own authority — every exclusion goes through Step 6.
99+
100+
While classifying, sort each MUST/SHOULD row into one of three buckets:
101+
102+
- **`check:`** — observably testable on the wire.
103+
- **clearly-excluded** — you're confident it can't be observed (e.g. "clients SHOULD also accept -32002" when the harness only drives servers).
104+
- **borderline** — you'd default to `check:` but observability is questionable. Markers:
105+
- _Internal state_ — verbs like _record_, _store_, _associate_, _track_, _cache_. The harness sees wire traffic, not memory; usually only observable via a downstream row already in your list.
106+
- _UI / human-facing__display_, _show_, _render_, _prompt the user_.
107+
- _Precondition phrasing_ — "Before doing X, the implementation MUST Y" where X is itself another row.
108+
109+
Slug convention: lowercase-kebab, derived from the verb phrase. Examples from `sep-2164.yaml`: `no-empty-contents`, `error-code`. Same `id` is used for SUCCESS and FAILURE (`AGENTS.md:52`).
110+
111+
## Step 6: Confirm exclusions with the user
112+
113+
Nothing becomes `excluded:` without sign-off. Two rounds:
114+
115+
**Round 1 — clearly-excluded, single batch question.** One `AskUserQuestion` listing all clearly-excluded rows in the question body (slug + one-line reason each). Options:
116+
117+
- `Exclude all as listed (Recommended)`
118+
- `Flip all to check:`
119+
- `Let me adjust per-row` — if chosen, append these rows to round 2.
120+
121+
Skip this round if the bucket is empty.
122+
123+
**Round 2 — borderline, one question per row.** One `AskUserQuestion` call with a question per borderline row (loop in batches of 4 if needed). For each:
124+
125+
- header: the proposed slug
126+
- question: quote the requirement sentence + your one-line observability concern
127+
- options (list `check:` first — it's the default for borderline):
128+
- `check:` — keep as a testable check
129+
- `excluded: <reason>` — drop to excluded with your stated reason
130+
- `merge into <other-slug>` — offer when the row is a precondition for another row already in the list
131+
132+
Apply the answers before writing. For any `excluded:` outcome, write the reason verbatim into the yaml and add an `issue:` URL if the user supplies one. A `merge` outcome means: drop this row, and append its `text:` to the surviving row's `text:` separated by `/` so the traceability isn't lost.
133+
134+
## Step 7: Rewrite the YAML
135+
136+
Replace the two TODO rows the CLI generated with one row per extracted requirement. Preserve the CLI's quoting style (single quotes, two-space indent — see `src/seps/sep-2164.yaml`).
137+
138+
**Key order within each row** — for `check:` rows the **`check:` key comes first**, then `text:`, then any optional `url:`. Scanning the left margin should reveal every check ID without reading the quoted sentences. For `excluded:` rows the order is **`text:` first**, then `excluded:`, then optional `issue:` — there's no ID to scan for, so lead with the requirement.
139+
140+
**Row order in the file** — all `check:` rows first (in spec-diff order), then **all `excluded:` rows grouped at the bottom**, separated from the checks by **one blank line**. Do not interleave.
141+
142+
```yaml
143+
requirements:
144+
- check: sep-NNNN-first-slug
145+
text: '...'
146+
- check: sep-NNNN-second-slug
147+
text: '...'
148+
149+
- text: '...'
150+
excluded: 'reason'
151+
issue: https://github.com/modelcontextprotocol/conformance/issues/<NNNN>
152+
```
153+
154+
If a requirement is ambiguous or you're not confident, leave it as a `TODO:` row rather than guessing — humans review this yaml before scenarios get written.
155+
156+
Also fix the `spec_url`: the CLI emits the page URL with no anchor. If the requirements you extracted live under a specific spec subsection (e.g. `#error-handling`), append it.
157+
158+
If a requirement comes from a **different spec page** than `spec_url` (the SEP touched multiple `.mdx` files — the CLI prints these as "PR also changes N other spec file(s)"), give that row a full `url:` override:
159+
160+
```yaml
161+
- check: sep-NNNN-slug
162+
text: '...'
163+
url: https://modelcontextprotocol.io/specification/draft/other/page#anchor
164+
```
165+
166+
A row's effective spec reference is `row.url ?? file.spec_url`.
167+
168+
Write the result back to `$YAML`.
169+
170+
## Step 8: Suggest a host scenario
171+
172+
`AGENTS.md` prefers **fewer scenarios with more checks** over one-scenario-per-check. Before telling the user to write a new scenario, look for an existing one the new checks could be folded into.
173+
174+
Determine the suite directory from the requirement subjects ("MCP clients MUST…" → `client/`, "Servers MUST…" → `server/`, "authorization servers MUST…" → `authorization-server/`; a SEP may map to more than one). Then search that directory for scenarios touching the same spec area:
175+
176+
```bash
177+
rg -l -i '<domain-term>|<domain-term-2>' src/scenarios/<suite>/ --type ts
178+
```
179+
180+
Pick 2–3 domain terms from the SEP's subject matter (for a discovery SEP: `metadata`, `well-known`; for an auth-response SEP: `redirect`, `callback`, `pkce`). For each hit, pull the scenario's `name`/`description` to confirm relevance:
181+
182+
```bash
183+
rg -A1 'name:|description:' <hit.ts>
184+
```
185+
186+
If you find a plausible host, recommend it by path. If nothing fits, say so explicitly — a new scenario file is then the right call.
187+
188+
## Step 9: Hand-off
189+
190+
Report to the user, in this order:
191+
192+
1. Path to the generated yaml.
193+
2. Row counts: "`N check:` rows, `M excluded:` rows" — and note which exclusions the user signed off in Step 6.
194+
3. Any requirements you left as `TODO:` and why.
195+
4. **Host-scenario recommendation** from Step 8 — either "consider adding these checks to `src/scenarios/<suite>/<file>.ts` (it already exercises _X_)" or "no existing scenario covers this area; a new file is appropriate".
196+
5. Remaining next steps the user owns:
197+
- add the checks to the host scenario (or create one) under `src/scenarios/{client,server,authorization-server}/`,
198+
- register any new scenario in `src/scenarios/index.ts` (`AGENTS.md:48`),
199+
- add a passing example to the everything-client/server and a negative test, per `AGENTS.md:74-81`.
200+
201+
Do **not** generate or edit scenario `.ts` files or touch `src/scenarios/index.ts`. The skill's scope ends at the yaml plus the recommendation.

.github/dependabot.yml

Lines changed: 0 additions & 29 deletions
This file was deleted.

AGENTS.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# AGENTS.md
2+
3+
Guidance for AI agents (and humans) contributing to the MCP conformance test framework.
4+
5+
## What this repo is
6+
7+
A test harness that exercises MCP SDK implementations against the protocol spec. The coverage number that matters here is **spec coverage** — how much of the protocol the scenarios test.
8+
9+
Uses **npm** (not pnpm/yarn). Don't commit `pnpm-lock.yaml` or `yarn.lock`.
10+
11+
## Where to start
12+
13+
**Open an issue first** — whether you've hit a bug in the harness or want to propose a new scenario. For scenarios, sketch which part of the spec you want to cover and roughly how; for bugs, include the command you ran and the output. Either way, a short discussion up front beats review churn on a PR that overlaps existing work or heads in a direction we're not going.
14+
15+
**Don't point an agent at the repo and ask it to "find bugs."** Generic bug-hunting on a test harness produces low-signal PRs (typo fixes, unused-variable cleanups, speculative refactors). If you want to contribute via an agent, give it a concrete target:
16+
17+
- Pick a specific MUST or SHOULD from the [MCP spec](https://modelcontextprotocol.io/specification/) that has no scenario yet, and ask the agent to draft one.
18+
- Pick an [open issue](https://github.com/modelcontextprotocol/conformance/issues) and work on that.
19+
20+
The valuable contribution here is **spec coverage**, not harness polish.
21+
22+
## Scenario design: fewer scenarios, more checks
23+
24+
**The strongest rule in this repo:** prefer one scenario with many checks over many scenarios with one check each.
25+
26+
Why:
27+
28+
- Each scenario often spins up its own HTTP server. These suites run in CI on every push for every SDK, so per-scenario overhead multiplies fast.
29+
- Less code to maintain and update when the spec shifts.
30+
- Progress on making an SDK better shows up as "pass 7/10 checks" rather than "pass 1 test, fail another" — finer-grained signal from the same run.
31+
32+
### Granularity heuristic
33+
34+
Ask: **"Would it make sense for someone to implement a server/client that does just this scenario?"**
35+
36+
If two scenarios would always be implemented together, merge them. Examples:
37+
38+
- `tools/list` + a simple `tools/call` → one scenario
39+
- All content-type variants (image, audio, mixed, resource) → one scenario
40+
- Full OAuth flow with token refresh → one scenario, not separate "basic" + "refresh" scenarios. A client that passes "basic" but not "refresh" just shows up as passing N−2 checks.
41+
42+
Keep scenarios separate when they're genuinely independent features or when they're mutually exclusive (e.g., an SDK should support writing a server that _doesn't_ implement certain stateful features).
43+
44+
### When a PR adds scenarios
45+
46+
- Start with **one end-to-end scenario** covering the happy path with many checks along the way.
47+
- Don't add "step 1 only" and "step 1+2" as separate scenarios — the second subsumes the first.
48+
- Register the scenario in the appropriate suite list in `src/scenarios/index.ts` (`core`, `extensions`, `backcompat`, etc.).
49+
50+
## Check conventions
51+
52+
- **Same `id` for SUCCESS and FAIL.** A check should use one slug and flip `status` + `errorMessage`, not branch into `foo-success` vs `foo-failure` slugs.
53+
- **Optimize for Ctrl+F on the slug.** Repetitive check blocks are fine — easier to find the failing one than to unwind a clever helper.
54+
- Reuse `ConformanceCheck` and other types from `src/types.ts` rather than defining parallel shapes.
55+
- Include `specReferences` pointing to the relevant spec section.
56+
- **Severity follows the spec keyword:** MUST / MUST NOT → `FAILURE`; SHOULD / SHOULD NOT → `WARNING`. (CI treats WARNING as a failure, so Tier-1 SDKs still need to satisfy SHOULDs — see #245.)
57+
58+
## Descriptions and wording
59+
60+
Be precise about what's **required** vs **optional**. A scenario description that tests optional behavior should make that clear — e.g. "Tests that a client _that wants a refresh token_ handles offline_access scope…" not "Tests that a client handles offline_access scope…". Don't accidentally promote a MAY/SHOULD to a MUST in the prose.
61+
62+
When in doubt about spec details (OAuth parameters, audiences, grant types), check the actual spec in `modelcontextprotocol` rather than guessing.
63+
64+
## Reviewing PRs
65+
66+
### SEP scenarios
67+
68+
Verify requirement levels against the SEP's **spec diff** — the change to `docs/specification/draft/` in the SEP's PR — not the SEP markdown summary or the conformance PR's description. The keyword that governs check severity is the one in the spec text; a bullet under a "Servers SHOULD…" sentence is SHOULD-level even if the SEP's title says "standardize."
69+
70+
```sh
71+
gh api "repos/modelcontextprotocol/modelcontextprotocol/contents/docs/specification/draft/<path>?ref=<sep-branch>" --jq '.content' | base64 -d
72+
```
73+
74+
### Adding a new SEP
75+
76+
Scaffold the requirement-traceability YAML with:
77+
78+
```sh
79+
npx @modelcontextprotocol/conformance new-sep <NNNN>
80+
```
81+
82+
The command looks up PR #`<NNNN>` in `modelcontextprotocol/modelcontextprotocol` (SEP numbers are PR numbers), derives `spec_url` from the `docs/specification/draft/*.mdx` file it changes, and writes `src/seps/sep-<NNNN>.yaml` with TODO `requirements[]` rows. Use `--spec-path` or `--spec-url` to skip the lookup. The `new-sep` Claude Code skill drives the same flow end-to-end, parses the spec diff, and fills in the requirement rows.
83+
84+
## Examples: prove it passes and fails
85+
86+
A new scenario should come with:
87+
88+
1. **A passing example** — usually by extending `examples/clients/typescript/everything-client.ts` or the everything-server, not a new file.
89+
2. **A negative test** — a deliberately-broken implementation in `examples/{clients,servers}/typescript/` plus a vitest case asserting the check emits `FAILURE`/`WARNING` against it. See `src/scenarios/client/auth/index.test.ts` and `src/scenarios/server/negative.test.ts` for the pattern. A passing run against the everything-server proves the check doesn't false-positive, but not that it catches anything.
90+
91+
Delete unused example scenarios. If a scenario key in the everything-client has no corresponding test, remove it.
92+
93+
## Don't add new ways to run tests
94+
95+
Use the existing CLI runner (`npx @modelcontextprotocol/conformance client|server ...`). If you need a feature the runner doesn't have, add it to the runner rather than building a parallel entry point.
96+
97+
## Before opening a PR
98+
99+
- `npm run build` passes
100+
- `npm test` passes
101+
- For non-trivial scenario changes, run against at least one real SDK (typescript-sdk or python-sdk) to see actual output. For changes to shared infrastructure (runner, tier-check), test against go-sdk or csharp-sdk too.
102+
- Scenario is registered in the right suite in `src/scenarios/index.ts`

0 commit comments

Comments
 (0)