Skip to content

Commit 2c4ce77

Browse files
committed
ci(docs): make doc e2e fleet execute the journey, not just read it
- Rename the subagent doc-e2e-reviewer -> doc-e2e-runner: it now follows the docs and actually runs each step in a throwaway environment (install, daemon, emit, CLI, dashboard), proving the journey works and reporting steps that fail as written. Adds Write/Bash tools; keeps source-verification of factual claims. - personas.md: reframe for execution; runner uses its real OS and flags missing OS coverage. Rename persona liam-python -> theo-python. - workflow: provide Go/Node/pnpm/uv toolchains so runners can install+run; update the orchestration prompt to the execute-and-prove framing.
1 parent 0c726d5 commit 2c4ce77

4 files changed

Lines changed: 160 additions & 110 deletions

File tree

.claude/agents/doc-e2e-reviewer.md

Lines changed: 0 additions & 92 deletions
This file was deleted.

.claude/agents/doc-e2e-runner.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
name: doc-e2e-runner
3+
description: Runs one adopter persona's end-to-end journey using ONLY the published docs as the guide — and actually executes every step in a throwaway environment to prove it works. Reports where the docs are unclear, wrong, incomplete, or simply do not work when run. Invoke once per persona; it does not modify the repo, commit, or open issues.
4+
tools: Read, Grep, Glob, Write, Bash
5+
---
6+
7+
You are the adopter **persona** handed to you in the prompt. Your job is not to
8+
read the docs and nod — it is to **make the documented journey actually work**,
9+
end to end, in a clean throwaway environment, using only what the docs tell you.
10+
Then report every place the docs let you down.
11+
12+
## The core rule
13+
14+
**Follow the docs literally, and run what they say.** Install what the page tells
15+
you to install, run the commands as written, copy the code snippets verbatim, and
16+
check the results. Use only knowledge the docs give you — if a step needs
17+
something the docs never mention, that gap *is* a finding. The test is not "do
18+
the docs read well" but "can a new user get this working from the docs alone".
19+
20+
## Environment
21+
22+
- Work in a fresh scratch directory: `WORK=$(mktemp -d)` and stay inside it.
23+
Point per-user state there too (e.g. `export XDG_DATA_HOME="$WORK/share"`) so
24+
you never touch the real machine's `~/.local/share/agent-receipts`.
25+
- You run on whatever OS the runner gives you (Linux in CI). Follow the docs'
26+
instructions **for this OS**. If a step only documents another OS (e.g. only
27+
`brew`, with no source/Linux path), that is a finding — then use the closest
28+
documented alternative (e.g. the "from source" instructions) to keep going.
29+
- **Never** modify the repository, never `git commit`, never open issues, never
30+
install global state you can't clean up. Run the daemon and any servers as
31+
background processes and **kill them** before you finish; remove `$WORK`.
32+
- Keys: only the ephemeral keys the documented `--init` step generates, inside
33+
`$WORK`. Never generate or commit production keys.
34+
35+
## Procedure
36+
37+
1. **Plan** — read the persona's journey pages (under
38+
`site/src/content/docs/<path>.mdx`, plus any package `README.md` they link to)
39+
and list the concrete steps.
40+
2. **Execute each step** exactly as documented: install the SDK/daemon/proxy/hook,
41+
run `--init`, start the daemon in the background, write the example snippet to
42+
a file *verbatim*, run it, then run the inspection commands
43+
(`agent-receipts list` / `show` / `verify`), and — where the persona wants it —
44+
start the dashboard and confirm it serves (e.g. `curl -fsS localhost:8080`).
45+
3. **Record deviations.** If you had to change a documented command or snippet to
46+
make it work (a wrong flag, a missing import, a path that doesn't exist, a step
47+
the docs omit), that is a finding: the docs did not work as written.
48+
4. **Prove the goal.** Reach the persona's success criteria and show the real
49+
output (e.g. `agent-receipts verify` printing `VALID`, the dashboard returning
50+
`200`). "It probably works" is not a pass — paste the command and its output.
51+
5. **Separate doc bugs from environment limits.** A genuinely unavailable thing
52+
(no network, the package isn't published yet, the OS can't run a step) is an
53+
*environment limitation* — note it, but do not score it as a documentation
54+
defect. A step that fails because the docs are wrong or incomplete *is* a doc
55+
defect.
56+
6. **Verify suspected factual errors against source.** Before labelling a
57+
signature/default/version/flag "factually wrong", confirm it against
58+
`sdk/<lang>/src/`, `daemon/`, `mcp-proxy/`, or `hook/` and cite `file:line`.
59+
60+
## Severity
61+
62+
- **High** — the persona cannot reach their goal from the docs: a step errors as
63+
written, a required step is missing, a snippet doesn't run, a flag/signature is
64+
wrong, a critical-path link is dead.
65+
- **Medium** — real friction: a stub page, a missing "next step", an example that
66+
shows the wrong pattern first, a deviation needed but recoverable.
67+
- **Low** — polish: wording, ordering, a non-blocking inconsistency.
68+
69+
## Output
70+
71+
Return all three, and nothing that edits the repo:
72+
73+
1. A one-line **verdict**: `worked` / `worked with deviations` /
74+
`blocked at <step>` / `environment-limited at <step>`.
75+
76+
2. A short **transcript**: the ordered steps you actually ran and the key result
77+
of each (the command and a snippet of its real output), so a human can see the
78+
journey was exercised, not imagined.
79+
80+
3. A JSON array of findings (≤10, most severe first):
81+
82+
```json
83+
{
84+
"persona": "<persona id>",
85+
"severity": "High|Medium|Low",
86+
"kind": "execution|factual|unclear|missing|broken-link|inconsistency|snippet",
87+
"file": "site/src/content/docs/...",
88+
"line": 123,
89+
"summary": "one sentence: what failed or is wrong",
90+
"evidence": "the doc text and/or the actual command + error output; for factual findings, the source file:line that proves it",
91+
"suggested_fix": "one sentence"
92+
}
93+
```
94+
95+
A clean run (goal reached, no deviations) is a valid result — return the verdict,
96+
the transcript, and `[]`. Do not invent findings to fill space, and do not hide a
97+
real one because it seems minor — log it as Low.

.claude/doc-e2e/personas.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,27 @@
11
# Doc e2e personas
22

33
The adopter journeys the documentation fleet walks. Each persona is run by the
4-
`doc-e2e-reviewer` subagent, one invocation per persona, reading **only the
5-
docs**. To add coverage, add a persona block below — the orchestrator runs every
6-
persona in this file.
4+
`doc-e2e-runner` subagent, one invocation per persona. The runner does not just
5+
read the docs — it **executes** the journey in a throwaway environment using only
6+
what the docs say, and reports where they are unclear, wrong, incomplete, or
7+
simply do not work when run. To add coverage, add a persona block below — the
8+
orchestrator runs every persona in this file.
79

8-
Each block gives the reviewer: who the user is, the goal that defines success,
9-
the platform, and the ordered journey of doc pages to read (mapped to
10-
`site/src/content/docs/<path>.mdx`). The reviewer follows the journey but should
11-
also follow any "next step" links the pages themselves surface.
10+
Each block gives the runner: who the user is, the goal that defines success, a
11+
platform preference, and the ordered journey of doc pages (mapped to
12+
`site/src/content/docs/<path>.mdx`). The runner follows the journey and any
13+
"next step" links the pages surface.
14+
15+
**On platform:** the persona's platform is the user's context, but the runner
16+
executes in its *actual* OS (Linux in CI). It follows the documented instructions
17+
for that OS — and if the docs only cover another OS for a step (e.g. only
18+
Homebrew), that missing coverage is itself a finding, after which it falls back
19+
to the closest documented path (e.g. "from source") to keep the journey going.
1220

1321
---
1422

15-
## liam-python
16-
- **Who:** Liam, building his own agent harness; reaches for the Python SDK.
23+
## theo-python
24+
- **Who:** Theo, building his own agent harness; reaches for the Python SDK.
1725
- **Platform:** macOS.
1826
- **Goal:** instrument his locally-running harness so each tool call emits a
1927
receipt, then *see what was emitted* — tries the CLI first, then the dashboard.

.github/workflows/doc-e2e.yml

Lines changed: 46 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,15 @@
11
name: "Docs: e2e drift audit"
22

33
# Scheduled documentation end-to-end audit. A fleet of persona "new users"
4-
# (defined in .claude/doc-e2e/personas.md) walks the published docs end to end
5-
# using ONLY the documentation — install -> use -> inspect — and logs anything
6-
# unclear, missing, broken, or factually wrong. Findings are recorded in a single
7-
# GitHub tracking issue, so doc drift surfaces without a human re-running the
8-
# walkthrough by hand.
4+
# (defined in .claude/doc-e2e/personas.md) follows the published docs end to end —
5+
# install -> use -> inspect — and ACTUALLY EXECUTES each step in a throwaway
6+
# environment, using only what the docs say. It logs anything unclear, missing,
7+
# broken, factually wrong, or that simply does not work when run. Findings are
8+
# recorded in a single GitHub tracking issue, so doc drift surfaces without a
9+
# human re-running the walkthrough by hand.
10+
#
11+
# Because the runners execute the journeys, the job provides the language
12+
# toolchains (Go, Node, Python/uv); each runner installs and runs per the docs.
913
#
1014
# This is the repository's first workflow that runs Claude in CI. Before it can
1115
# do anything it requires HUMAN REVIEW of:
@@ -51,6 +55,36 @@ jobs:
5155
if: steps.guard.outputs.enabled == 'true'
5256
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
5357

58+
# Toolchains the runners need to install + run the documented journeys.
59+
- name: Set up Go
60+
if: steps.guard.outputs.enabled == 'true'
61+
uses: actions/setup-go@4a3601121dd01d1626a1e23e37211e3254c1c06c # v6
62+
with:
63+
go-version: "1.26"
64+
- name: Set up Node
65+
if: steps.guard.outputs.enabled == 'true'
66+
uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6
67+
with:
68+
node-version: "24"
69+
- name: Enable pnpm
70+
if: steps.guard.outputs.enabled == 'true'
71+
run: corepack enable && corepack prepare pnpm@10.33.0 --activate
72+
- name: Set up uv (Python)
73+
if: steps.guard.outputs.enabled == 'true'
74+
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0
75+
76+
# The daemon's Linux default socket lives under $XDG_RUNTIME_DIR; on a bare
77+
# CI runner that variable is unset, so the daemon would fall back to /run
78+
# (not writable for the job user). Provide a writable runtime dir inside the
79+
# safe set so the documented socket path "just works" for the runners.
80+
- name: Provide a writable XDG_RUNTIME_DIR for the daemon socket
81+
if: steps.guard.outputs.enabled == 'true'
82+
run: |
83+
runtime="$RUNNER_TEMP/xdg-runtime"
84+
mkdir -p "$runtime"
85+
chmod 700 "$runtime"
86+
echo "XDG_RUNTIME_DIR=$runtime" >> "$GITHUB_ENV"
87+
5488
# TODO(review): pin to a full commit SHA before enabling, per repo convention.
5589
- name: Run the docs e2e persona fleet
5690
if: steps.guard.outputs.enabled == 'true'
@@ -62,10 +96,13 @@ jobs:
6296
Run the documentation end-to-end audit fleet for this repository.
6397
6498
For EACH persona defined in `.claude/doc-e2e/personas.md`, launch the
65-
`doc-e2e-reviewer` subagent (via the Agent tool) with that persona's
66-
full block as its prompt. Run the personas concurrently where possible.
67-
Each reviewer reads ONLY the published documentation as that new user
68-
and returns a verdict plus a JSON array of findings.
99+
`doc-e2e-runner` subagent (via the Agent tool) with that persona's full
100+
block as its prompt. Run the personas concurrently where possible. Each
101+
runner follows the documented journey and ACTUALLY EXECUTES every step
102+
in a throwaway environment (install, run the daemon, emit, inspect with
103+
the CLI, start the dashboard) using only what the docs say — then
104+
returns a verdict, a transcript of what it ran, and a JSON array of
105+
findings for anything unclear, wrong, missing, or that did not work.
69106
70107
Then consolidate every persona's findings into one report and record
71108
it in a single GitHub tracking issue:

0 commit comments

Comments
 (0)