This explains how to measure how a Claude Code agent uses the codegraph MCP
tools on a real repo — which tools it calls (does it lead with
codegraph_explore?), how many follow-up Read/Greps it does, and the token
cost. Use it when changing tool guidance (server-instructions.ts,
instructions-template.ts, tool descriptions) or retrieval, to verify the
change actually shifts agent behavior.
Scripts live in scripts/agent-eval/.
Interactive (itrun.sh) |
Headless (run-agent.sh) |
|
|---|---|---|
| Drives | the real TUI via tmux | claude -p print mode |
| Subagent it picks | Explore (matches real UX) | general-purpose (diverges) |
| Metrics | tool breakdown (from session logs) + Done(…) token summary |
exact per-tool calls + tokens/cost (stream-json) |
| Cost | Claude Max subscription | API $ (total_cost_usd) |
Headless claude -p does NOT reproduce what users see — it silently picks
the general-purpose subagent, while interactive sessions delegate to the
read-first Explore subagent. So for "what does my session actually do," use
the interactive harness. For a clean per-tool/token breakdown in one shot, use
headless (and ask for the Explore subagent in the prompt if you want that path).
- tmux 3.0+
- A logged-in
claudeCLI (Claude Max or API). - codegraph configured as an MCP server (
claude mcp listshowscodegraph). The interactive harness uses your global config, so it runs whatevercodegraphresolves to — point that at your dev build (npm link/ the symlinked global) to test local changes. - A target repo, cloned and indexed:
Good scale spread for a sweep: Alamofire (~100 files), Excalidraw (~600), OkHttp (~640), VS Code (~10k).
git clone --depth 1 https://github.com/square/okhttp /tmp/corpus/okhttp cd /tmp/corpus/okhttp && codegraph init -i
scripts/agent-eval/itrun.sh <repo-path> <label> "<question>"Example:
scripts/agent-eval/itrun.sh /tmp/corpus/vscode vscode \
"How does the extension host communicate with the main process?"It opens claude in a tmux session, types the question, waits for the agent to
finish, then prints:
- the
Done (N tool uses · Xk tokens · Ym)subagent summary (from the pane), - the
Context Xk/1.0Mmain-session size, - a tool breakdown parsed from the session logs (main + subagents), ending
in a
VERDICT: codegraph_explore used Nx | Read N | Grep/Bash Nline.
Two things bite an unattended driver before the prompt even runs:
- The
❯glyph is drawn ~6s before the input accepts keystrokes. Waiting for❯is necessary but not sufficient. The harness sends the prompt, then verifies a chunk of it actually landed in the input box, retrying until it does — so it can't type into a not-yet-live input and submit nothing. - First time claude opens a repo it shows "Is this a project you trust?"
(which also contains
❯). The harness detects that dialog and presses Enter to accept it before typing.
If the prompt never lands or work never starts, the harness now fails loudly (non-zero exit) instead of capturing an empty pane and reporting a bogus run.
Claude's TUI redraws in place, so you can't just wait for output to stop. The
harness polls tmux capture-pane and treats the pane as busy when it shows
the spinner's elapsed-time-in-parens — (8s · …) / (1m 3s · …), matched by
\(([0-9]+m )?[0-9]+s ·. That's the universal working signal: it shows during
the pre-stream thinking phase ((8s · thinking with max effort), which has
no token arrow yet) and during streaming. The ↓ N/↑ N token arrow,
esc to interrupt, and Initializing… are OR'd in as belt-and-braces (some TUI
versions show one but not the others). It declares idle when the ❯ prompt
is present and not busy for 10 consecutive polls (~5s, long enough to ride out
mid-conversation thinking gaps that briefly drop the spinner). (Technique
adapted from devpit's WaitForIdle.)
parse-session.mjs reads the newest session log under
~/.claude/projects/<escaped-cwd>/<session>.jsonl and its subagent transcripts
under <session>/subagents/*.jsonl. The subagent file is where the real
tool calls are — the main log only shows the Agent delegation. You can run it
standalone:
node scripts/agent-eval/parse-session.mjs /tmp/corpus/vscodescripts/agent-eval/run-agent.sh <repo-path> <label> "<question>"Writes stream-json and prints the tool sequence + exact tokens/cost. To
reproduce the Explore-subagent path headlessly, ask for it:
"Use an Explore subagent to investigate, then answer: …".
Single runs vary a lot (the VS Code question has ranged 26–37 tool uses / 88–105k tokens across runs). For a real signal, run N≥3 and take the median:
for i in 1 2 3; do
scripts/agent-eval/itrun.sh /tmp/corpus/vscode "vscode-$i" "<question>"
doneAfter the explore-first guidance (PR #191), an understanding question should
show the agent leading with codegraph_explore and using search/node
to fill gaps — not a wall of Read/Grep. Example faithful run:
VERDICT: codegraph_explore used 3x | Read 8 | Grep/Bash 1. If explore is 0
and Read/Grep dominate, the guidance regressed.
Transcripts and logs go to $AGENT_EVAL_OUT (default /tmp/agent-eval/):
itrun-<label>.txt (pane capture), run-<label>.jsonl (headless stream-json).