colbymchenry · dragonnite1221-lgtm · May 19, 2026 · May 20, 2026 · May 20, 2026 · May 20, 2026
diff --git a/.cursor/rules/codegraph.mdc b/.cursor/rules/codegraph.mdc
@@ -13,22 +13,22 @@ Use codegraph for **structural** questions — what calls what, what would break
 
 | Question | Tool |
 |---|---|
+| "How does X work? / trace X / explain a system / architecture" | `codegraph_explore` (seed with symbol names) |
 | "Where is X defined?" / "Find symbol named X" | `codegraph_search` |
 | "What calls function Y?" | `codegraph_callers` |
 | "What does Y call?" | `codegraph_callees` |
 | "What would break if I changed Z?" | `codegraph_impact` |
 | "Show me Y's signature / source / docstring" | `codegraph_node` |
 | "Give me focused context for a task/area" | `codegraph_context` |
-| "Survey an unfamiliar module/topic" | `codegraph_explore` |
 | "What files exist under path/" | `codegraph_files` |
 | "Is the index healthy?" | `codegraph_status` |
 
 ### Rules of thumb
 
+- **`codegraph_explore` is the workhorse for understanding questions** ("how does X work", "trace…", "explain the Y system"). Feed it the key symbol/file names and read its output (line-numbered source from many files in one call). If the question names nothing concrete, do one quick `codegraph_search`/`codegraph_context` to surface the names, then explore with them. Fill gaps with `codegraph_node`/Read — don't grep-and-read your way through; that's the loop explore replaces.
+- **Delegating exploration to a subagent?** Tell it to call `codegraph_explore` first and trust the result. A generic "explore"-style agent defaults to grep+Read and treats codegraph as just a search index, throwing away the token savings.
 - **Trust codegraph results.** They come from a full AST parse. Do NOT re-verify them with grep — that's slower, less accurate, and wastes context.
 - **Don't grep first** when looking up a symbol by name. `codegraph_search` is faster and returns kind + location + signature in one call.
-- **Don't chain `codegraph_search` + `codegraph_node`** when you just want context — `codegraph_context` is one call.
-- **`codegraph_explore` is the heavy hitter** for unfamiliar areas — it returns full source from all relevant files in one call, but is token-heavy. If your harness supports parallel subagents (e.g., Claude Code's Task tool), spawn one for explore-class questions to keep main session context clean.
 - **Index lag**: the file watcher debounces ~500ms behind writes; don't re-query immediately after editing a file in the same turn.
 
 ### If `.codegraph/` doesn't exist

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,7 +7,7 @@ a [GitHub Release](https://github.com/colbymchenry/codegraph/releases) tagged
 This project follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
 and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [Unreleased]
+## [0.7.12] - 2026-05-20
 
 ### Added
 - **MCP / explore**: `codegraph_explore` source sections now carry line
@@ -44,13 +44,35 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
   VS Code ~12%. Agent-trust floor still holds — the Relationships section,
   scored cluster selection, and structured-source output are all retained.
   Thanks to [@essopsp](https://github.com/essopsp) for the repro.
+- **MCP / tool guidance**: the tool descriptions and installed instructions
+  now steer agents to treat `codegraph_explore` as the workhorse for
+  understanding/architecture/"how does X work" questions — seed it with the
+  key symbol names (a quick `codegraph_search`/`codegraph_context` first if
+  the question names nothing concrete) and read its output, rather than
+  searching and then Reading each file. Diagnosed from a benchmark run where
+  Claude Code's Explore agent used `codegraph_search` + Read + grep (37 tool
+  calls, ~90k tokens) and never called `codegraph_explore`, vs a
+  general-purpose agent that led with explore (13 calls, ~55k tokens) for the
+  same VS Code question. Updated in lockstep across `server-instructions.ts`,
+  `instructions-template.ts`, and `.cursor/rules/codegraph.mdc`.
 
 ### Fixed
 - **MCP**: source-omission markers in `codegraph_explore` and
   `codegraph_context` output are now language-neutral (`... (gap) ...`,
   `... (trimmed) ...`, `... (truncated) ...`) instead of C-style `//`
   comments, which were misleading inside Python, Ruby, and other non-C
   fenced source blocks.
+- **Search/explore ranking**: test-file detection now recognizes Kotlin
+  (`*Test.kt`, `jvmTest/`/`commonTest/`/`androidTest/` source sets), Swift
+  (`*Tests.swift`), and other camelCase test conventions, so test code is
+  properly deprioritized in `codegraph_explore` / `codegraph_context`
+  results. Previously only Java/JS/Python conventions were known, which let
+  test files dominate exploration of Kotlin/Swift codebases (e.g. an OkHttp
+  "trace a request" query returned 8/9 test files; now it surfaces
+  `Call.kt`, `OkHttpClient.kt`, `Request.kt`, `Response.kt`). Capital-led
+  matching keeps production files like `latest.kt` / `manifest.kt` unflagged.
+
+[0.7.12]: https://github.com/colbymchenry/codegraph/releases/tag/v0.7.12
 
 ## [0.7.10] - 2026-05-19
 

diff --git a/README.md b/README.md
@@ -492,6 +492,16 @@ The `.codegraph/config.json` file controls indexing:
 
 **Missing symbols** — The MCP server auto-syncs on save (wait a couple seconds). Run `codegraph sync` manually if needed. Check that the file's language is supported and isn't excluded by config patterns.
 
+## Star History
+
+<a href="https://www.star-history.com/?repos=colbymchenry%2Fcodegraph&type=date&legend=top-left">
+ <picture>
+   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/chart?repos=colbymchenry/codegraph&type=date&theme=dark&legend=top-left" />
+   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/chart?repos=colbymchenry/codegraph&type=date&legend=top-left" />
+   <img alt="Star History Chart" src="https://api.star-history.com/chart?repos=colbymchenry/codegraph&type=date&legend=top-left" />
+ </picture>
+</a>
+
 ## License
 
 MIT

diff --git a/__tests__/is-test-file.test.ts b/__tests__/is-test-file.test.ts
@@ -0,0 +1,53 @@
+/**
+ * isTestFile heuristic — test-file detection used to deprioritize test code in
+ * search/explore ranking.
+ *
+ * Regression coverage for the cold-query fix: the heuristic previously only
+ * knew Java/JS/Python conventions, so Kotlin (`*Test.kt`, `jvmTest/`), Swift
+ * (`*Tests.swift`), and camelCase test source-set dirs slipped through — which
+ * let OkHttp's tests flood `codegraph_explore` results on a plain-language
+ * query. The false-positive guards matter just as much: `latest.kt` /
+ * `manifest.kt` / a `RealCall.kt` production file must NOT be flagged.
+ */
+import { describe, it, expect } from 'vitest';
+import { isTestFile } from '../src/search/query-utils';
+
+describe('isTestFile', () => {
+  it('flags Kotlin test files and source sets', () => {
+    expect(isTestFile('okhttp/src/jvmTest/kotlin/okhttp3/CallTest.kt')).toBe(true);
+    expect(isTestFile('okhttp/src/commonTest/kotlin/okhttp3/CompressionInterceptorTest.kt')).toBe(true);
+    expect(isTestFile('app/src/androidTest/java/com/example/FooTest.kt')).toBe(true);
+    expect(isTestFile('module/src/integrationTest/kotlin/BarSpec.kt')).toBe(true);
+  });
+
+  it('flags Swift test files', () => {
+    expect(isTestFile('Tests/SessionTests.swift')).toBe(true);
+    expect(isTestFile('Sources/FooTest.swift')).toBe(true);
+  });
+
+  it('still flags the previously-supported conventions', () => {
+    expect(isTestFile('foo/test_bar.py')).toBe(true);
+    expect(isTestFile('pkg/bar_test.go')).toBe(true);
+    expect(isTestFile('src/foo.test.ts')).toBe(true);
+    expect(isTestFile('src/foo.spec.ts')).toBe(true);
+    expect(isTestFile('com/example/FooTest.java')).toBe(true);
+    expect(isTestFile('com/example/FooTestCase.java')).toBe(true);
+    expect(isTestFile('project/__tests__/foo.ts')).toBe(true);
+    expect(isTestFile('project/tests/foo.rb')).toBe(true);
+  });
+
+  it('does NOT flag production files that merely contain "test" lowercase', () => {
+    // The fix is capital-led so camelCase boundaries distinguish these.
+    expect(isTestFile('src/latest/loader.kt')).toBe(false);
+    expect(isTestFile('lib/manifest.kt')).toBe(false);
+    expect(isTestFile('okhttp/src/jvmMain/kotlin/okhttp3/internal/connection/RealCall.kt')).toBe(false);
+    expect(isTestFile('src/contestEntry.ts')).toBe(false);
+    expect(isTestFile('pkg/greatest.go')).toBe(false);
+  });
+
+  it('does NOT flag ordinary production source', () => {
+    expect(isTestFile('src/flask/app.py')).toBe(false);
+    expect(isTestFile('src/vs/workbench/api/common/extensionHostMain.ts')).toBe(false);
+    expect(isTestFile('okhttp/src/commonJvmAndroid/kotlin/okhttp3/OkHttpClient.kt')).toBe(false);
+  });
+});
diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@colbymchenry/codegraph",
-  "version": "0.7.11",
+  "version": "0.7.12",
   "description": "Supercharge Claude Code with semantic code intelligence. 94% fewer tool calls • 77% faster exploration • 100% local.",
   "main": "dist/index.js",
   "types": "dist/index.d.ts",
@@ -38,7 +38,7 @@
     "fast-wrap-ansi": "^0.2.0",
     "jsonc-parser": "^3.3.1",
     "node-sqlite3-wasm": "^0.8.30",
-    "picomatch": "^4.0.3",
+    "picomatch": "^4.0.4",
     "sisteransi": "^1.0.5",
     "tree-sitter-wasms": "^0.1.11",
     "web-tree-sitter": "^0.25.3"

diff --git a/run-interactive-test.md b/run-interactive-test.md
@@ -0,0 +1,131 @@
+# Running the agent-behavior test (how agents actually use codegraph)
+
+This explains how to measure **how a Claude Code agent uses the codegraph MCP
+tools** on a real repo — which tools it calls (does it lead with
+`codegraph_explore`?), how many follow-up `Read`/`Grep`s it does, and the token
+cost. Use it when changing tool guidance (`server-instructions.ts`,
+`instructions-template.ts`, tool descriptions) or retrieval, to verify the
+change actually shifts agent behavior.
+
+Scripts live in `scripts/agent-eval/`.
+
+## Why two harnesses (read this first)
+
+| | Interactive (`itrun.sh`) | Headless (`run-agent.sh`) |
+|---|---|---|
+| Drives | the real TUI via tmux | `claude -p` print mode |
+| Subagent it picks | **Explore** (matches real UX) | general-purpose (diverges) |
+| Metrics | tool breakdown (from session logs) + `Done(…)` token summary | exact per-tool calls + tokens/cost (stream-json) |
+| Cost | Claude Max subscription | API $ (`total_cost_usd`) |
+
+**Headless `claude -p` does NOT reproduce what users see** — it silently picks
+the general-purpose subagent, while interactive sessions delegate to the
+read-first **Explore** subagent. So for "what does my session actually do," use
+the interactive harness. For a clean per-tool/token breakdown in one shot, use
+headless (and ask for the Explore subagent in the prompt if you want that path).
+
+## Prerequisites
+
+- **tmux 3.0+**
+- A logged-in `claude` CLI (Claude Max or API).
+- codegraph configured as an MCP server (`claude mcp list` shows `codegraph`).
+  The interactive harness uses your global config, so it runs whatever
+  `codegraph` resolves to — point that at your dev build (`npm link` / the
+  symlinked global) to test local changes.
+- A target repo, cloned and indexed:
+  ```bash
+  git clone --depth 1 https://github.com/square/okhttp /tmp/corpus/okhttp
+  cd /tmp/corpus/okhttp && codegraph init -i
+  ```
+  Good scale spread for a sweep: Alamofire (~100 files), Excalidraw (~600),
+  OkHttp (~640), VS Code (~10k).
+
+## Interactive test (the faithful one)
+
+```bash
+scripts/agent-eval/itrun.sh <repo-path> <label> "<question>"
+```
+
+Example:
+```bash
+scripts/agent-eval/itrun.sh /tmp/corpus/vscode vscode \
+  "How does the extension host communicate with the main process?"
+```
+
+It opens `claude` in a tmux session, types the question, waits for the agent to
+finish, then prints:
+- the `Done (N tool uses · Xk tokens · Ym)` subagent summary (from the pane),
+- the `Context Xk/1.0M` main-session size,
+- a **tool breakdown** parsed from the session logs (main + subagents), ending
+  in a `VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N` line.
+
+### Startup robustness (so unattended runs don't silently no-op)
+
+Two things bite an unattended driver before the prompt even runs:
+- **The `❯` glyph is drawn ~6s before the input accepts keystrokes.** Waiting
+  for `❯` is necessary but not sufficient. The harness sends the prompt, then
+  **verifies a chunk of it actually landed in the input box**, retrying until it
+  does — so it can't type into a not-yet-live input and submit nothing.
+- **First time claude opens a repo it shows "Is this a project you trust?"**
+  (which also contains `❯`). The harness detects that dialog and presses Enter
+  to accept it before typing.
+
+If the prompt never lands or work never starts, the harness now **fails loudly**
+(non-zero exit) instead of capturing an empty pane and reporting a bogus run.
+
+### How completion is detected (the tricky part)
+
+Claude's TUI redraws in place, so you can't just wait for output to stop. The
+harness polls `tmux capture-pane` and treats the pane as **busy** when it shows
+the spinner's elapsed-time-in-parens — `(8s · …)` / `(1m 3s · …)`, matched by
+`\(([0-9]+m )?[0-9]+s ·`. That's the *universal* working signal: it shows during
+the pre-stream **thinking** phase (`(8s · thinking with max effort)`, which has
+no token arrow yet) *and* during streaming. The `↓ N`/`↑ N` token arrow,
+`esc to interrupt`, and `Initializing…` are OR'd in as belt-and-braces (some TUI
+versions show one but not the others). It declares **idle** when the `❯` prompt
+is present and not busy for 10 consecutive polls (~5s, long enough to ride out
+mid-conversation thinking gaps that briefly drop the spinner). (Technique
+adapted from devpit's `WaitForIdle`.)
+
+### Where the breakdown comes from
+
+`parse-session.mjs` reads the newest session log under
+`~/.claude/projects/<escaped-cwd>/<session>.jsonl` and its subagent transcripts
+under `<session>/subagents/*.jsonl`. The **subagent** file is where the real
+tool calls are — the main log only shows the `Agent` delegation. You can run it
+standalone:
+```bash
+node scripts/agent-eval/parse-session.mjs /tmp/corpus/vscode
+```
+
+## Headless test (clean tokens, forceable Explore path)
+
+```bash
+scripts/agent-eval/run-agent.sh <repo-path> <label> "<question>"
+```
+Writes stream-json and prints the tool sequence + exact tokens/cost. To
+reproduce the Explore-subagent path headlessly, ask for it:
+`"Use an Explore subagent to investigate, then answer: …"`.
+
+## Running a sweep
+
+Single runs vary a lot (the VS Code question has ranged 26–37 tool uses /
+88–105k tokens across runs). For a real signal, run N≥3 and take the median:
+```bash
+for i in 1 2 3; do
+  scripts/agent-eval/itrun.sh /tmp/corpus/vscode "vscode-$i" "<question>"
+done
+```
+
+## What "good" looks like
+
+After the explore-first guidance (PR #191), an understanding question should
+show the agent **leading with `codegraph_explore`** and using `search`/`node`
+to fill gaps — not a wall of `Read`/`Grep`. Example faithful run:
+`VERDICT: codegraph_explore used 3x | Read 8 | Grep/Bash 1`. If `explore` is 0
+and `Read`/`Grep` dominate, the guidance regressed.
+
+## Output artifacts
+
+Transcripts and logs go to `$AGENT_EVAL_OUT` (default `/tmp/agent-eval/`):
+`itrun-<label>.txt` (pane capture), `run-<label>.jsonl` (headless stream-json).