Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .cursor/rules/codegraph.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,22 @@ Use codegraph for **structural** questions — what calls what, what would break

| Question | Tool |
|---|---|
| "How does X work? / trace X / explain a system / architecture" | `codegraph_explore` (seed with symbol names) |
| "Where is X defined?" / "Find symbol named X" | `codegraph_search` |
| "What calls function Y?" | `codegraph_callers` |
| "What does Y call?" | `codegraph_callees` |
| "What would break if I changed Z?" | `codegraph_impact` |
| "Show me Y's signature / source / docstring" | `codegraph_node` |
| "Give me focused context for a task/area" | `codegraph_context` |
| "Survey an unfamiliar module/topic" | `codegraph_explore` |
| "What files exist under path/" | `codegraph_files` |
| "Is the index healthy?" | `codegraph_status` |

### Rules of thumb

- **`codegraph_explore` is the workhorse for understanding questions** ("how does X work", "trace…", "explain the Y system"). Feed it the key symbol/file names and read its output (line-numbered source from many files in one call). If the question names nothing concrete, do one quick `codegraph_search`/`codegraph_context` to surface the names, then explore with them. Fill gaps with `codegraph_node`/Read — don't grep-and-read your way through; that's the loop explore replaces.
- **Delegating exploration to a subagent?** Tell it to call `codegraph_explore` first and trust the result. A generic "explore"-style agent defaults to grep+Read and treats codegraph as just a search index, throwing away the token savings.
- **Trust codegraph results.** They come from a full AST parse. Do NOT re-verify them with grep — that's slower, less accurate, and wastes context.
- **Don't grep first** when looking up a symbol by name. `codegraph_search` is faster and returns kind + location + signature in one call.
- **Don't chain `codegraph_search` + `codegraph_node`** when you just want context — `codegraph_context` is one call.
- **`codegraph_explore` is the heavy hitter** for unfamiliar areas — it returns full source from all relevant files in one call, but is token-heavy. If your harness supports parallel subagents (e.g., Claude Code's Task tool), spawn one for explore-class questions to keep main session context clean.
- **Index lag**: the file watcher debounces ~500ms behind writes; don't re-query immediately after editing a file in the same turn.

### If `.codegraph/` doesn't exist
Expand Down
24 changes: 23 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ a [GitHub Release](https://github.com/colbymchenry/codegraph/releases) tagged
This project follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
## [0.7.12] - 2026-05-20

### Added
- **MCP / explore**: `codegraph_explore` source sections now carry line
Expand Down Expand Up @@ -44,13 +44,35 @@ and adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
VS Code ~12%. Agent-trust floor still holds — the Relationships section,
scored cluster selection, and structured-source output are all retained.
Thanks to [@essopsp](https://github.com/essopsp) for the repro.
- **MCP / tool guidance**: the tool descriptions and installed instructions
now steer agents to treat `codegraph_explore` as the workhorse for
understanding/architecture/"how does X work" questions — seed it with the
key symbol names (a quick `codegraph_search`/`codegraph_context` first if
the question names nothing concrete) and read its output, rather than
searching and then Reading each file. Diagnosed from a benchmark run where
Claude Code's Explore agent used `codegraph_search` + Read + grep (37 tool
calls, ~90k tokens) and never called `codegraph_explore`, vs a
general-purpose agent that led with explore (13 calls, ~55k tokens) for the
same VS Code question. Updated in lockstep across `server-instructions.ts`,
`instructions-template.ts`, and `.cursor/rules/codegraph.mdc`.

### Fixed
- **MCP**: source-omission markers in `codegraph_explore` and
`codegraph_context` output are now language-neutral (`... (gap) ...`,
`... (trimmed) ...`, `... (truncated) ...`) instead of C-style `//`
comments, which were misleading inside Python, Ruby, and other non-C
fenced source blocks.
- **Search/explore ranking**: test-file detection now recognizes Kotlin
(`*Test.kt`, `jvmTest/`/`commonTest/`/`androidTest/` source sets), Swift
(`*Tests.swift`), and other camelCase test conventions, so test code is
properly deprioritized in `codegraph_explore` / `codegraph_context`
results. Previously only Java/JS/Python conventions were known, which let
test files dominate exploration of Kotlin/Swift codebases (e.g. an OkHttp
"trace a request" query returned 8/9 test files; now it surfaces
`Call.kt`, `OkHttpClient.kt`, `Request.kt`, `Response.kt`). Capital-led
matching keeps production files like `latest.kt` / `manifest.kt` unflagged.

[0.7.12]: https://github.com/colbymchenry/codegraph/releases/tag/v0.7.12

## [0.7.10] - 2026-05-19

Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -492,6 +492,16 @@ The `.codegraph/config.json` file controls indexing:

**Missing symbols** — The MCP server auto-syncs on save (wait a couple seconds). Run `codegraph sync` manually if needed. Check that the file's language is supported and isn't excluded by config patterns.

## Star History

<a href="https://www.star-history.com/?repos=colbymchenry%2Fcodegraph&type=date&legend=top-left">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/chart?repos=colbymchenry/codegraph&type=date&theme=dark&legend=top-left" />
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/chart?repos=colbymchenry/codegraph&type=date&legend=top-left" />
<img alt="Star History Chart" src="https://api.star-history.com/chart?repos=colbymchenry/codegraph&type=date&legend=top-left" />
</picture>
</a>

## License

MIT
Expand Down
53 changes: 53 additions & 0 deletions __tests__/is-test-file.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
/**
* isTestFile heuristic — test-file detection used to deprioritize test code in
* search/explore ranking.
*
* Regression coverage for the cold-query fix: the heuristic previously only
* knew Java/JS/Python conventions, so Kotlin (`*Test.kt`, `jvmTest/`), Swift
* (`*Tests.swift`), and camelCase test source-set dirs slipped through — which
* let OkHttp's tests flood `codegraph_explore` results on a plain-language
* query. The false-positive guards matter just as much: `latest.kt` /
* `manifest.kt` / a `RealCall.kt` production file must NOT be flagged.
*/
import { describe, it, expect } from 'vitest';
import { isTestFile } from '../src/search/query-utils';

describe('isTestFile', () => {
it('flags Kotlin test files and source sets', () => {
expect(isTestFile('okhttp/src/jvmTest/kotlin/okhttp3/CallTest.kt')).toBe(true);
expect(isTestFile('okhttp/src/commonTest/kotlin/okhttp3/CompressionInterceptorTest.kt')).toBe(true);
expect(isTestFile('app/src/androidTest/java/com/example/FooTest.kt')).toBe(true);
expect(isTestFile('module/src/integrationTest/kotlin/BarSpec.kt')).toBe(true);
});

it('flags Swift test files', () => {
expect(isTestFile('Tests/SessionTests.swift')).toBe(true);
expect(isTestFile('Sources/FooTest.swift')).toBe(true);
});

it('still flags the previously-supported conventions', () => {
expect(isTestFile('foo/test_bar.py')).toBe(true);
expect(isTestFile('pkg/bar_test.go')).toBe(true);
expect(isTestFile('src/foo.test.ts')).toBe(true);
expect(isTestFile('src/foo.spec.ts')).toBe(true);
expect(isTestFile('com/example/FooTest.java')).toBe(true);
expect(isTestFile('com/example/FooTestCase.java')).toBe(true);
expect(isTestFile('project/__tests__/foo.ts')).toBe(true);
expect(isTestFile('project/tests/foo.rb')).toBe(true);
});

it('does NOT flag production files that merely contain "test" lowercase', () => {
// The fix is capital-led so camelCase boundaries distinguish these.
expect(isTestFile('src/latest/loader.kt')).toBe(false);
expect(isTestFile('lib/manifest.kt')).toBe(false);
expect(isTestFile('okhttp/src/jvmMain/kotlin/okhttp3/internal/connection/RealCall.kt')).toBe(false);
expect(isTestFile('src/contestEntry.ts')).toBe(false);
expect(isTestFile('pkg/greatest.go')).toBe(false);
});

it('does NOT flag ordinary production source', () => {
expect(isTestFile('src/flask/app.py')).toBe(false);
expect(isTestFile('src/vs/workbench/api/common/extensionHostMain.ts')).toBe(false);
expect(isTestFile('okhttp/src/commonJvmAndroid/kotlin/okhttp3/OkHttpClient.kt')).toBe(false);
});
});
12 changes: 6 additions & 6 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@colbymchenry/codegraph",
"version": "0.7.11",
"version": "0.7.12",
"description": "Supercharge Claude Code with semantic code intelligence. 94% fewer tool calls • 77% faster exploration • 100% local.",
"main": "dist/index.js",
"types": "dist/index.d.ts",
Expand Down Expand Up @@ -38,7 +38,7 @@
"fast-wrap-ansi": "^0.2.0",
"jsonc-parser": "^3.3.1",
"node-sqlite3-wasm": "^0.8.30",
"picomatch": "^4.0.3",
"picomatch": "^4.0.4",
"sisteransi": "^1.0.5",
"tree-sitter-wasms": "^0.1.11",
"web-tree-sitter": "^0.25.3"
Expand Down
131 changes: 131 additions & 0 deletions run-interactive-test.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Running the agent-behavior test (how agents actually use codegraph)

This explains how to measure **how a Claude Code agent uses the codegraph MCP
tools** on a real repo — which tools it calls (does it lead with
`codegraph_explore`?), how many follow-up `Read`/`Grep`s it does, and the token
cost. Use it when changing tool guidance (`server-instructions.ts`,
`instructions-template.ts`, tool descriptions) or retrieval, to verify the
change actually shifts agent behavior.

Scripts live in `scripts/agent-eval/`.

## Why two harnesses (read this first)

| | Interactive (`itrun.sh`) | Headless (`run-agent.sh`) |
|---|---|---|
| Drives | the real TUI via tmux | `claude -p` print mode |
| Subagent it picks | **Explore** (matches real UX) | general-purpose (diverges) |
| Metrics | tool breakdown (from session logs) + `Done(…)` token summary | exact per-tool calls + tokens/cost (stream-json) |
| Cost | Claude Max subscription | API $ (`total_cost_usd`) |

**Headless `claude -p` does NOT reproduce what users see** — it silently picks
the general-purpose subagent, while interactive sessions delegate to the
read-first **Explore** subagent. So for "what does my session actually do," use
the interactive harness. For a clean per-tool/token breakdown in one shot, use
headless (and ask for the Explore subagent in the prompt if you want that path).

## Prerequisites

- **tmux 3.0+**
- A logged-in `claude` CLI (Claude Max or API).
- codegraph configured as an MCP server (`claude mcp list` shows `codegraph`).
The interactive harness uses your global config, so it runs whatever
`codegraph` resolves to — point that at your dev build (`npm link` / the
symlinked global) to test local changes.
- A target repo, cloned and indexed:
```bash
git clone --depth 1 https://github.com/square/okhttp /tmp/corpus/okhttp
cd /tmp/corpus/okhttp && codegraph init -i
```
Good scale spread for a sweep: Alamofire (~100 files), Excalidraw (~600),
OkHttp (~640), VS Code (~10k).

## Interactive test (the faithful one)

```bash
scripts/agent-eval/itrun.sh <repo-path> <label> "<question>"
```

Example:
```bash
scripts/agent-eval/itrun.sh /tmp/corpus/vscode vscode \
"How does the extension host communicate with the main process?"
```

It opens `claude` in a tmux session, types the question, waits for the agent to
finish, then prints:
- the `Done (N tool uses · Xk tokens · Ym)` subagent summary (from the pane),
- the `Context Xk/1.0M` main-session size,
- a **tool breakdown** parsed from the session logs (main + subagents), ending
in a `VERDICT: codegraph_explore used Nx | Read N | Grep/Bash N` line.

### Startup robustness (so unattended runs don't silently no-op)

Two things bite an unattended driver before the prompt even runs:
- **The `❯` glyph is drawn ~6s before the input accepts keystrokes.** Waiting
for `❯` is necessary but not sufficient. The harness sends the prompt, then
**verifies a chunk of it actually landed in the input box**, retrying until it
does — so it can't type into a not-yet-live input and submit nothing.
- **First time claude opens a repo it shows "Is this a project you trust?"**
(which also contains `❯`). The harness detects that dialog and presses Enter
to accept it before typing.

If the prompt never lands or work never starts, the harness now **fails loudly**
(non-zero exit) instead of capturing an empty pane and reporting a bogus run.

### How completion is detected (the tricky part)

Claude's TUI redraws in place, so you can't just wait for output to stop. The
harness polls `tmux capture-pane` and treats the pane as **busy** when it shows
the spinner's elapsed-time-in-parens — `(8s · …)` / `(1m 3s · …)`, matched by
`\(([0-9]+m )?[0-9]+s ·`. That's the *universal* working signal: it shows during
the pre-stream **thinking** phase (`(8s · thinking with max effort)`, which has
no token arrow yet) *and* during streaming. The `↓ N`/`↑ N` token arrow,
`esc to interrupt`, and `Initializing…` are OR'd in as belt-and-braces (some TUI
versions show one but not the others). It declares **idle** when the `❯` prompt
is present and not busy for 10 consecutive polls (~5s, long enough to ride out
mid-conversation thinking gaps that briefly drop the spinner). (Technique
adapted from devpit's `WaitForIdle`.)

### Where the breakdown comes from

`parse-session.mjs` reads the newest session log under
`~/.claude/projects/<escaped-cwd>/<session>.jsonl` and its subagent transcripts
under `<session>/subagents/*.jsonl`. The **subagent** file is where the real
tool calls are — the main log only shows the `Agent` delegation. You can run it
standalone:
```bash
node scripts/agent-eval/parse-session.mjs /tmp/corpus/vscode
```

## Headless test (clean tokens, forceable Explore path)

```bash
scripts/agent-eval/run-agent.sh <repo-path> <label> "<question>"
```
Writes stream-json and prints the tool sequence + exact tokens/cost. To
reproduce the Explore-subagent path headlessly, ask for it:
`"Use an Explore subagent to investigate, then answer: …"`.

## Running a sweep

Single runs vary a lot (the VS Code question has ranged 26–37 tool uses /
88–105k tokens across runs). For a real signal, run N≥3 and take the median:
```bash
for i in 1 2 3; do
scripts/agent-eval/itrun.sh /tmp/corpus/vscode "vscode-$i" "<question>"
done
```

## What "good" looks like

After the explore-first guidance (PR #191), an understanding question should
show the agent **leading with `codegraph_explore`** and using `search`/`node`
to fill gaps — not a wall of `Read`/`Grep`. Example faithful run:
`VERDICT: codegraph_explore used 3x | Read 8 | Grep/Bash 1`. If `explore` is 0
and `Read`/`Grep` dominate, the guidance regressed.

## Output artifacts

Transcripts and logs go to `$AGENT_EVAL_OUT` (default `/tmp/agent-eval/`):
`itrun-<label>.txt` (pane capture), `run-<label>.jsonl` (headless stream-json).
Loading