modelcontextprotocol
diff --git a/‎.claude/skills/new-sep/SKILL.md‎
Lines changed: 201 additions & 0 deletions b/‎.claude/skills/new-sep/SKILL.md‎
Lines changed: 201 additions & 0 deletions
diff --git a/‎.github/dependabot.yml‎
Lines changed: 0 additions & 29 deletions b/‎.github/dependabot.yml‎
Lines changed: 0 additions & 29 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 102 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 102 additions & 0 deletions
@@ -0,0 +1,201 @@
+---
+name: new-sep
+description: >-
+  Scaffold a sep-NNNN.yaml requirement-traceability file for the MCP
+  conformance repo from a SEP PR's spec diff. Runs the new-sep CLI, then
+  parses the modelcontextprotocol/modelcontextprotocol spec diff to populate
+  `requirements[]` with the RFC 2119 sentences and proposed check IDs.
+argument-hint: '<sep-number>'
+---
+
+# new-sep: SEP traceability YAML scaffolding
+
+You are bootstrapping a `sep-NNNN.yaml` file for a new SEP in the MCP conformance repo. The output is the requirement-traceability file specified by SEP-2484: a YAML that maps each normative sentence from the SEP's spec diff to a `check:` ID (testable) or an `excluded:` reason (not testable). The CLI gets the skeleton; you fill in the rows by reading the spec diff.
+
+## Step 0: Pre-flight checks
+
+Before doing anything else, verify GitHub CLI authentication:
+
+```bash
+gh auth status 2>&1
+```
+
+If this fails, stop immediately and tell the user:
+
+> GitHub authentication is required for this skill. Please run `gh auth login` first, then re-run.
+
+Verify you're running inside the conformance repo:
+
+```bash
+test -f package.json && jq -r '.name' package.json
+```
+
+The name should be `@modelcontextprotocol/conformance`. If not, stop and ask the user to `cd` into the conformance repo first.
+
+## Step 1: Parse arguments
+
+Extract from the user's input:
+
+- **sep-number** (required): the SEP number, e.g. `2164`. This is also the PR number in `modelcontextprotocol/modelcontextprotocol` by convention.
+
+## Step 2: Generate the skeleton
+
+Run the CLI:
+
+```bash
+npm run --silent build
+node dist/index.js new-sep <NNNN>
+```
+
+(For development against a non-built source tree: `npx tsx src/index.ts new-sep ...`.)
+
+The CLI writes `src/seps/sep-<NNNN>.yaml` with `sep`, `spec_url`, and two TODO `requirements[]` rows. Capture the output path from the CLI's `Wrote …` line and remember it as `$YAML`.
+
+If the CLI errors with "does not change any docs/specification/draft/\*.mdx", the SEP's spec changes landed in a separate PR — ask the user for the spec file path and rerun with `--spec-path docs/specification/draft/<path>`. Do not guess.
+
+## Step 3: Fetch the spec diff
+
+`AGENTS.md` (lines 64–72) is explicit that severity must come from the spec text itself, not the SEP markdown or the conformance PR description:
+
+```bash
+gh api "repos/modelcontextprotocol/modelcontextprotocol/pulls/<NNNN>/files" \
+  --jq '.[] | select(.filename | test("^docs/specification/draft/.*\\.mdx$")) | {filename, patch}'
+```
+
+For each file, pull the added (`+`-prefixed) lines from `patch`. If `patch` is truncated for a large file, fall back to fetching the whole file at the PR's head ref:
+
+```bash
+gh api "repos/modelcontextprotocol/modelcontextprotocol/contents/<path>?ref=<sep-branch>" \
+  --jq '.content' | base64 -d
+```
+
+## Step 4: Extract RFC 2119 requirements
+
+Walk the added lines and identify sentences containing the keywords: **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **MAY**, **OPTIONAL**.
+
+**Quote the whole sentence**, not just the matched line. The matched word may sit inside a bullet point whose lead-in sentence supplies the keyword by inheritance — e.g.:
+
+> Servers SHOULD return standard JSON-RPC errors for common failure cases:
+>
+> - Resource not found: -32602 (Invalid Params)
+
+The bullet inherits `SHOULD`. The yaml row should quote the _combined_ obligation: `'Servers SHOULD return standard JSON-RPC errors for common failure cases: Resource not found: -32602 (Invalid Params)'` — see `src/seps/sep-2164.yaml` for the canonical example.
+
+**Regex alone is insufficient** (this is called out in Issue #243). Read for context: pronouns, "the server", and "such cases" all refer back to the lead-in.
+
+## Step 5: Map severity → check vs. excluded
+
+From `AGENTS.md:50-56`:
+
+| Keyword                                        | Severity | YAML field                 |
+| ---------------------------------------------- | -------- | -------------------------- |
+| MUST / MUST NOT / SHALL / SHALL NOT / REQUIRED | FAILURE  | `check: sep-<NNNN>-<slug>` |
+| SHOULD / SHOULD NOT                            | WARNING  | `check: sep-<NNNN>-<slug>` |
+| MAY / OPTIONAL                                 | —        | _no row — skip entirely_   |
+
+MAY / OPTIONAL sentences are noted in Step 4 only so you consciously skip them — they never produce a yaml row.
+
+A row is `excluded:` when a MUST/SHOULD requirement can't be protocol-observed by the harness. Do **not** write any `excluded:` row on your own authority — every exclusion goes through Step 6.
+
+While classifying, sort each MUST/SHOULD row into one of three buckets:
+
+- **`check:`** — observably testable on the wire.
+- **clearly-excluded** — you're confident it can't be observed (e.g. "clients SHOULD also accept -32002" when the harness only drives servers).
+- **borderline** — you'd default to `check:` but observability is questionable. Markers:
+  - _Internal state_ — verbs like _record_, _store_, _associate_, _track_, _cache_. The harness sees wire traffic, not memory; usually only observable via a downstream row already in your list.
+  - _UI / human-facing_ — _display_, _show_, _render_, _prompt the user_.
+  - _Precondition phrasing_ — "Before doing X, the implementation MUST Y" where X is itself another row.
+
+Slug convention: lowercase-kebab, derived from the verb phrase. Examples from `sep-2164.yaml`: `no-empty-contents`, `error-code`. Same `id` is used for SUCCESS and FAILURE (`AGENTS.md:52`).
+
+## Step 6: Confirm exclusions with the user
+
+Nothing becomes `excluded:` without sign-off. Two rounds:
+
+**Round 1 — clearly-excluded, single batch question.** One `AskUserQuestion` listing all clearly-excluded rows in the question body (slug + one-line reason each). Options:
+
+- `Exclude all as listed (Recommended)`
+- `Flip all to check:`
+- `Let me adjust per-row` — if chosen, append these rows to round 2.
+
+Skip this round if the bucket is empty.
+
+**Round 2 — borderline, one question per row.** One `AskUserQuestion` call with a question per borderline row (loop in batches of 4 if needed). For each:
+
+- header: the proposed slug
+- question: quote the requirement sentence + your one-line observability concern
+- options (list `check:` first — it's the default for borderline):
+  - `check:` — keep as a testable check
+  - `excluded: <reason>` — drop to excluded with your stated reason
+  - `merge into <other-slug>` — offer when the row is a precondition for another row already in the list
+
+Apply the answers before writing. For any `excluded:` outcome, write the reason verbatim into the yaml and add an `issue:` URL if the user supplies one. A `merge` outcome means: drop this row, and append its `text:` to the surviving row's `text:` separated by `/` so the traceability isn't lost.
+
+## Step 7: Rewrite the YAML
+
+Replace the two TODO rows the CLI generated with one row per extracted requirement. Preserve the CLI's quoting style (single quotes, two-space indent — see `src/seps/sep-2164.yaml`).
+
+**Key order within each row** — for `check:` rows the **`check:` key comes first**, then `text:`, then any optional `url:`. Scanning the left margin should reveal every check ID without reading the quoted sentences. For `excluded:` rows the order is **`text:` first**, then `excluded:`, then optional `issue:` — there's no ID to scan for, so lead with the requirement.
+
+**Row order in the file** — all `check:` rows first (in spec-diff order), then **all `excluded:` rows grouped at the bottom**, separated from the checks by **one blank line**. Do not interleave.
+
+```yaml
+requirements:
+  - check: sep-NNNN-first-slug
+    text: '...'
+  - check: sep-NNNN-second-slug
+    text: '...'
+
+  - text: '...'
+    excluded: 'reason'
+    issue: https://github.com/modelcontextprotocol/conformance/issues/<NNNN>
+```
+
+If a requirement is ambiguous or you're not confident, leave it as a `TODO:` row rather than guessing — humans review this yaml before scenarios get written.
+
+Also fix the `spec_url`: the CLI emits the page URL with no anchor. If the requirements you extracted live under a specific spec subsection (e.g. `#error-handling`), append it.
+
+If a requirement comes from a **different spec page** than `spec_url` (the SEP touched multiple `.mdx` files — the CLI prints these as "PR also changes N other spec file(s)"), give that row a full `url:` override:
+
+```yaml
+- check: sep-NNNN-slug
+  text: '...'
+  url: https://modelcontextprotocol.io/specification/draft/other/page#anchor
+```
+
+A row's effective spec reference is `row.url ?? file.spec_url`.
+
+Write the result back to `$YAML`.
+
+## Step 8: Suggest a host scenario
+
+`AGENTS.md` prefers **fewer scenarios with more checks** over one-scenario-per-check. Before telling the user to write a new scenario, look for an existing one the new checks could be folded into.
+
+Determine the suite directory from the requirement subjects ("MCP clients MUST…" → `client/`, "Servers MUST…" → `server/`, "authorization servers MUST…" → `authorization-server/`; a SEP may map to more than one). Then search that directory for scenarios touching the same spec area:
+
+```bash
+rg -l -i '<domain-term>|<domain-term-2>' src/scenarios/<suite>/ --type ts
+```
+
+Pick 2–3 domain terms from the SEP's subject matter (for a discovery SEP: `metadata`, `well-known`; for an auth-response SEP: `redirect`, `callback`, `pkce`). For each hit, pull the scenario's `name`/`description` to confirm relevance:
+
+```bash
+rg -A1 'name:|description:' <hit.ts>
+```
+
+If you find a plausible host, recommend it by path. If nothing fits, say so explicitly — a new scenario file is then the right call.
+
+## Step 9: Hand-off
+
+Report to the user, in this order:
+
+1. Path to the generated yaml.
+2. Row counts: "`N check:` rows, `M excluded:` rows" — and note which exclusions the user signed off in Step 6.
+3. Any requirements you left as `TODO:` and why.
+4. **Host-scenario recommendation** from Step 8 — either "consider adding these checks to `src/scenarios/<suite>/<file>.ts` (it already exercises _X_)" or "no existing scenario covers this area; a new file is appropriate".
+5. Remaining next steps the user owns:
+   - add the checks to the host scenario (or create one) under `src/scenarios/{client,server,authorization-server}/`,
+   - register any new scenario in `src/scenarios/index.ts` (`AGENTS.md:48`),
+   - add a passing example to the everything-client/server and a negative test, per `AGENTS.md:74-81`.
+
+Do **not** generate or edit scenario `.ts` files or touch `src/scenarios/index.ts`. The skill's scope ends at the yaml plus the recommendation.
@@ -0,0 +1,102 @@
+# AGENTS.md
+
+Guidance for AI agents (and humans) contributing to the MCP conformance test framework.
+
+## What this repo is
+
+A test harness that exercises MCP SDK implementations against the protocol spec. The coverage number that matters here is **spec coverage** — how much of the protocol the scenarios test.
+
+Uses **npm** (not pnpm/yarn). Don't commit `pnpm-lock.yaml` or `yarn.lock`.
+
+## Where to start
+
+**Open an issue first** — whether you've hit a bug in the harness or want to propose a new scenario. For scenarios, sketch which part of the spec you want to cover and roughly how; for bugs, include the command you ran and the output. Either way, a short discussion up front beats review churn on a PR that overlaps existing work or heads in a direction we're not going.
+
+**Don't point an agent at the repo and ask it to "find bugs."** Generic bug-hunting on a test harness produces low-signal PRs (typo fixes, unused-variable cleanups, speculative refactors). If you want to contribute via an agent, give it a concrete target:
+
+- Pick a specific MUST or SHOULD from the [MCP spec](https://modelcontextprotocol.io/specification/) that has no scenario yet, and ask the agent to draft one.
+- Pick an [open issue](https://github.com/modelcontextprotocol/conformance/issues) and work on that.
+
+The valuable contribution here is **spec coverage**, not harness polish.
+
+## Scenario design: fewer scenarios, more checks
+
+**The strongest rule in this repo:** prefer one scenario with many checks over many scenarios with one check each.
+
+Why:
+
+- Each scenario often spins up its own HTTP server. These suites run in CI on every push for every SDK, so per-scenario overhead multiplies fast.
+- Less code to maintain and update when the spec shifts.
+- Progress on making an SDK better shows up as "pass 7/10 checks" rather than "pass 1 test, fail another" — finer-grained signal from the same run.
+
+### Granularity heuristic
+
+Ask: **"Would it make sense for someone to implement a server/client that does just this scenario?"**
+
+If two scenarios would always be implemented together, merge them. Examples:
+
+- `tools/list` + a simple `tools/call` → one scenario
+- All content-type variants (image, audio, mixed, resource) → one scenario
+- Full OAuth flow with token refresh → one scenario, not separate "basic" + "refresh" scenarios. A client that passes "basic" but not "refresh" just shows up as passing N−2 checks.
+
+Keep scenarios separate when they're genuinely independent features or when they're mutually exclusive (e.g., an SDK should support writing a server that _doesn't_ implement certain stateful features).
+
+### When a PR adds scenarios
+
+- Start with **one end-to-end scenario** covering the happy path with many checks along the way.
+- Don't add "step 1 only" and "step 1+2" as separate scenarios — the second subsumes the first.
+- Register the scenario in the appropriate suite list in `src/scenarios/index.ts` (`core`, `extensions`, `backcompat`, etc.).
+
+## Check conventions
+
+- **Same `id` for SUCCESS and FAIL.** A check should use one slug and flip `status` + `errorMessage`, not branch into `foo-success` vs `foo-failure` slugs.
+- **Optimize for Ctrl+F on the slug.** Repetitive check blocks are fine — easier to find the failing one than to unwind a clever helper.
+- Reuse `ConformanceCheck` and other types from `src/types.ts` rather than defining parallel shapes.
+- Include `specReferences` pointing to the relevant spec section.
+- **Severity follows the spec keyword:** MUST / MUST NOT → `FAILURE`; SHOULD / SHOULD NOT → `WARNING`. (CI treats WARNING as a failure, so Tier-1 SDKs still need to satisfy SHOULDs — see #245.)
+
+## Descriptions and wording
+
+Be precise about what's **required** vs **optional**. A scenario description that tests optional behavior should make that clear — e.g. "Tests that a client _that wants a refresh token_ handles offline_access scope…" not "Tests that a client handles offline_access scope…". Don't accidentally promote a MAY/SHOULD to a MUST in the prose.
+
+When in doubt about spec details (OAuth parameters, audiences, grant types), check the actual spec in `modelcontextprotocol` rather than guessing.
+
+## Reviewing PRs
+
+### SEP scenarios
+
+Verify requirement levels against the SEP's **spec diff** — the change to `docs/specification/draft/` in the SEP's PR — not the SEP markdown summary or the conformance PR's description. The keyword that governs check severity is the one in the spec text; a bullet under a "Servers SHOULD…" sentence is SHOULD-level even if the SEP's title says "standardize."
+
+```sh
+gh api "repos/modelcontextprotocol/modelcontextprotocol/contents/docs/specification/draft/<path>?ref=<sep-branch>" --jq '.content' | base64 -d
+```
+
+### Adding a new SEP
+
+Scaffold the requirement-traceability YAML with:
+
+```sh
+npx @modelcontextprotocol/conformance new-sep <NNNN>
+```
+
+The command looks up PR #`<NNNN>` in `modelcontextprotocol/modelcontextprotocol` (SEP numbers are PR numbers), derives `spec_url` from the `docs/specification/draft/*.mdx` file it changes, and writes `src/seps/sep-<NNNN>.yaml` with TODO `requirements[]` rows. Use `--spec-path` or `--spec-url` to skip the lookup. The `new-sep` Claude Code skill drives the same flow end-to-end, parses the spec diff, and fills in the requirement rows.
+
+## Examples: prove it passes and fails
+
+A new scenario should come with:
+
+1. **A passing example** — usually by extending `examples/clients/typescript/everything-client.ts` or the everything-server, not a new file.
+2. **A negative test** — a deliberately-broken implementation in `examples/{clients,servers}/typescript/` plus a vitest case asserting the check emits `FAILURE`/`WARNING` against it. See `src/scenarios/client/auth/index.test.ts` and `src/scenarios/server/negative.test.ts` for the pattern. A passing run against the everything-server proves the check doesn't false-positive, but not that it catches anything.
+
+Delete unused example scenarios. If a scenario key in the everything-client has no corresponding test, remove it.
+
+## Don't add new ways to run tests
+
+Use the existing CLI runner (`npx @modelcontextprotocol/conformance client|server ...`). If you need a feature the runner doesn't have, add it to the runner rather than building a parallel entry point.
+
+## Before opening a PR
+
+- `npm run build` passes
+- `npm test` passes
+- For non-trivial scenario changes, run against at least one real SDK (typescript-sdk or python-sdk) to see actual output. For changes to shared infrastructure (runner, tier-check), test against go-sdk or csharp-sdk too.
+- Scenario is registered in the right suite in `src/scenarios/index.ts`