Skip to content

Commit c678733

Browse files
committed
docs: expand CLAUDE.md with diffctx architecture and design rationale
1 parent 0e741ea commit c678733

1 file changed

Lines changed: 247 additions & 57 deletions

File tree

CLAUDE.md

Lines changed: 247 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -161,63 +161,253 @@ pre-commit run --all-files
161161

162162
## Testing
163163

164-
Integration tests only - test against real filesystem. No mocking.
165-
166-
## Diff Context Mode
167-
168-
Smart context selection for git diffs using personalized PageRank:
169-
170-
```bash
171-
treemapper . --diff HEAD~1..HEAD
172-
```
173-
174-
Output format:
175-
176-
```yaml
177-
name: myproject
178-
type: diff_context
179-
fragment_count: 5
180-
fragments:
181-
- path: src/main.py
182-
lines: "10-25"
183-
kind: function
184-
symbol: process_data
185-
content: |
186-
def process_data(items):
187-
...
188-
```
189-
190-
Options:
191-
192-
- `--budget N` — token budget (default: 50000)
193-
- `--alpha F` — PPR damping factor (default: 0.60)
194-
- `--tau F` — stopping threshold (default: 0.08)
195-
- `--full` — skip smart selection, include all changed code
196-
197-
## Architecture
198-
199-
```text
200-
src/treemapper/
201-
├── cli.py # argument parsing
202-
├── clipboard.py # clipboard copy support
203-
├── ignore.py # gitignore/treemapperignore handling
204-
├── tokens.py # token counting (tiktoken)
205-
├── tree.py # directory traversal
206-
├── writer.py # YAML/JSON/text/Markdown output
207-
├── treemapper.py # main entry point
208-
└── diffctx/ # diff context mode
209-
├── __init__.py # entry point, run_diff_context()
210-
├── fragments.py # file fragmenters (Python, Markdown, etc.)
211-
├── git.py # git diff parsing
212-
├── graph.py # dependency graph building
213-
├── ppr.py # personalized PageRank
214-
├── python_semantics.py # Python import/call analysis
215-
├── render.py # output formatting
216-
├── select.py # greedy budget selection
217-
├── stopwords.py # identifier filtering
218-
├── types.py # Fragment, DiffHunk types
219-
└── utility.py # submodular utility functions
220-
```
164+
Integration tests only — test against real filesystem and real git
165+
repos. No mocking.
166+
167+
The diff context tests use a **YAML-based declarative framework**:
168+
each test case defines initial files, changed files, and expected
169+
output assertions. A dedicated test runner creates a real git repo
170+
per test, commits the files, runs the full diffctx pipeline, and
171+
verifies results.
172+
173+
**Negative testing via garbage injection**: every test case
174+
automatically includes ~10 unrelated "garbage" files with
175+
distinctive markers. Tests verify the algorithm excludes this
176+
noise, catching regressions in relevance filtering. Each garbage
177+
file uses unique prefixed identifiers (e.g. `GARBAGE_*`) so leaks
178+
are unambiguously detectable.
179+
180+
## Two Modes of Operation
181+
182+
TreeMapper operates in two fundamentally different modes that
183+
share output formatting, token counting, and file reading
184+
infrastructure:
185+
186+
**Tree Mapping Mode** (`treemapper .`) — Filesystem-focused.
187+
Walks the directory tree respecting hierarchical ignore patterns,
188+
reads file contents with binary/encoding detection, and serializes
189+
to YAML/JSON/text/Markdown. Deterministic, side-effect-free.
190+
191+
**Diff Context Mode** (`treemapper . --diff`) — Semantics-focused.
192+
Analyzes a git diff to intelligently select the minimal set of
193+
code fragments needed to understand a change. This is the core
194+
intellectual property of the project — a graph-based relevance
195+
engine described in detail below.
196+
197+
---
198+
199+
## Diff Context: Architecture & Design
200+
201+
### The Problem
202+
203+
When reviewing a code change, the diff alone is rarely sufficient.
204+
A developer needs surrounding context: the function being called,
205+
the interface being implemented, the config driving deployment.
206+
But naively including "everything related" explodes the context
207+
window. The challenge is selecting the **minimal, sufficient
208+
context** within a token budget.
209+
210+
### The Approach: Graph-Based Relevance Propagation
211+
212+
The diffctx engine models a codebase as a **weighted directed
213+
graph** where nodes are semantic code fragments and edges represent
214+
dependencies between them. Changed code seeds the graph, relevance
215+
propagates through edges via Personalized PageRank, and a
216+
budget-aware greedy algorithm selects the best fragments.
217+
218+
This approach was chosen over simpler alternatives (call-graph
219+
depth, grep-based expansion, file-level inclusion) because:
220+
221+
- **Transitive importance decays naturally** — a function calling
222+
a modified function is relevant; a function calling *that*
223+
function is less so. PPR captures this without manual depth
224+
limits.
225+
- **Heterogeneous relationships combine gracefully** — imports,
226+
type references, config links, test patterns, and lexical
227+
similarity all contribute edges with different weights. No
228+
single signal captures all dependencies.
229+
- **Budget optimization is principled** — submodular utility
230+
maximization with lazy greedy selection gives near-optimal
231+
coverage per token spent.
232+
233+
### Pipeline Stages
234+
235+
The engine operates as a 7-stage pipeline:
236+
237+
1. **Diff Parsing** — Extract changed file paths and exact line
238+
ranges from git diff output.
239+
240+
2. **Core Fragment Identification** — Break changed files into
241+
semantic units (functions, classes, config blocks, doc sections)
242+
using language-aware parsers, then identify which fragments
243+
cover the actual changed lines.
244+
245+
3. **Concept Extraction** — Extract identifiers from added/removed
246+
diff lines. These "diff concepts" represent the vocabulary of
247+
the change and drive relevance scoring.
248+
249+
4. **Universe Expansion** — Discover related files beyond those
250+
directly changed. Edge builders scan for imports, config
251+
references, naming patterns. Rare identifiers (appearing in
252+
≤3 files) trigger targeted file discovery.
253+
254+
5. **Graph Construction** — Build fragment-level dependency graph.
255+
26 edge builders contribute weighted edges across 6 categories
256+
(see below). Edges are aggregated via max — if any builder
257+
thinks two fragments are related, the strongest signal wins.
258+
Hub suppression downweights over-connected nodes (e.g. common
259+
utilities) to prevent them from dominating the graph.
260+
261+
6. **Relevance Scoring (PPR)** — Run Personalized PageRank seeded
262+
from core (changed) fragments. The damping factor α=0.60
263+
controls propagation depth: 60% chance of following an edge,
264+
40% chance of teleporting back to changed code. Convergence
265+
produces a relevance score per fragment.
266+
267+
7. **Budget-Aware Selection** — A lazy greedy algorithm selects
268+
fragments maximizing density (marginal utility per token). Core
269+
fragments are selected first, then expansion candidates ordered
270+
by a max-heap. A τ-based stopping threshold (relative to
271+
baseline density median) prevents noise accumulation.
272+
273+
### Edge Taxonomy: Six Perspectives on Code Relationships
274+
275+
The system intentionally models relationships from multiple
276+
independent perspectives. Each catches blind spots the others miss.
277+
278+
**Semantic Edges** — Language-aware code dependencies.
279+
Import/export resolution, function calls, type references, symbol
280+
usage. 11 language-specific builders (Python, JavaScript/TypeScript,
281+
Go, Rust, Java/Kotlin/Scala, C/C++, C#/.NET, Ruby, PHP, Swift,
282+
Shell). Weights reflect type-system reliability: Rust symbol refs
283+
(0.95) are trusted more than Python calls (0.55) because static
284+
analysis is more reliable in strict type systems. All semantic
285+
edges are asymmetric — "A imports B" is a stronger signal than
286+
"B is imported by A" — modeled via reverse weight factors
287+
(0.4–0.7).
288+
289+
**Configuration Edges** — Infrastructure-to-code dependencies that
290+
don't appear in source. Docker COPY/FROM to source files,
291+
Kubernetes manifests to application code, Terraform modules to
292+
infrastructure scripts, CI/CD workflows to tested code, Helm
293+
templates to services, build system configs to compiled sources,
294+
generic config keys to code referencing them. 7 specialized
295+
builders covering the DevOps ecosystem.
296+
297+
**Structural Edges** — Filesystem and organizational proximity.
298+
Containment (parent-child directory nesting), test-code
299+
associations (naming heuristics like `test_foo.py` to `foo.py`),
300+
sibling files in the same directory. These are weak signals
301+
(0.05–0.60) that prevent blind spots in code without explicit
302+
imports.
303+
304+
**Document Edges** — Non-code content relationships.
305+
Section-to-section flow within Markdown, anchor link references,
306+
cross-document citations. Enable following documentation
307+
dependencies when docs change alongside code.
308+
309+
**Similarity Edges** — Content-based relationships via TF-IDF
310+
lexical matching. Finds code with similar vocabulary/structure
311+
even without explicit references. Weight bounds are
312+
language-specific: wider for dynamic languages (Python 0.20–0.35),
313+
narrower for typed (Rust 0.10–0.15) where semantic edges are more
314+
reliable.
315+
316+
**History Edges** — Temporal co-change patterns from git log.
317+
Files repeatedly committed together have implicit coupling.
318+
Capped at 500 recent commits with noise filtering (ignoring large
319+
commits with >30 files).
320+
321+
### Selection: Submodular Utility Maximization
322+
323+
The greedy selector optimizes a submodular utility function under
324+
a token budget constraint:
325+
326+
**Concept coverage** — Each diff concept (identifier from the
327+
change) has a "best coverage score" across selected fragments.
328+
Adding a fragment that covers new concepts yields high marginal
329+
gain; covering already-covered concepts yields diminishing returns
330+
(modeled via square-root scaling).
331+
332+
**Relatedness bonus** — High-PPR fragments receive minimum
333+
guaranteed utility even without concept overlap, ensuring
334+
structurally related code is included.
335+
336+
**Density ordering** — Candidates are ranked by utility-per-token
337+
(density), not raw utility. A 10-token fragment covering 2
338+
concepts beats a 500-token fragment covering 3. Lazy heap
339+
evaluation avoids recomputing stale density values until a
340+
candidate is popped.
341+
342+
**τ-stopping** — After establishing a baseline from the first 5
343+
selected fragments, stop when density drops below
344+
τ × median(baseline). This relative threshold adapts to the
345+
codebase: dense code triggers earlier stopping, sparse code allows
346+
broader inclusion.
347+
348+
### Fragment Granularity
349+
350+
Files are decomposed into semantic fragments using a
351+
priority-ordered parser pipeline. Language-specific parsers
352+
(tree-sitter for 13+ languages, Python AST, Mistune for Markdown)
353+
produce function/class/section-level fragments. Fallback parsers
354+
handle config files (key-value boundaries), text (sentence-aware
355+
splitting), and generic content (line-count limits). The
356+
granularity choice means PPR reasons at the right level — a
357+
changed line in a function selects that function as a unit, not
358+
the whole file.
359+
360+
### Key Design Decisions
361+
362+
**Why Personalized PageRank over call-graph BFS?** BFS requires
363+
arbitrary depth limits and treats all edges equally. PPR provides
364+
natural exponential decay, respects edge weights, and converges
365+
to a principled relevance distribution.
366+
367+
**Why max-aggregation for edge combination?** Multiple edge types
368+
often agree on the same relationship. Taking the max avoids
369+
inflating weights through redundant signals while preserving the
370+
strongest evidence from any perspective.
371+
372+
**Why submodular greedy over knapsack?** Submodular functions
373+
guarantee that greedy gives (1 - 1/e) ≈ 63% of optimal. With
374+
lazy evaluation and density ordering, the algorithm runs in
375+
near-linear time while achieving strong coverage.
376+
377+
**Why asymmetric edge weights?** Code dependencies are
378+
directional. "A imports B" means A needs B for context; B doesn't
379+
necessarily need A. Reverse factors (0.4–0.7 of forward weight)
380+
enable bidirectional graph search while respecting this asymmetry.
381+
382+
**Why hub suppression?** Common utility modules (logging, helpers,
383+
config) receive edges from everywhere. Without dampening, they
384+
dominate PPR scores and pull in unrelated code. Log-scaled
385+
in-degree suppression at the 95th percentile keeps them accessible
386+
without letting them dominate.
387+
388+
### Tunable Parameters
389+
390+
| Parameter | Default | Controls |
391+
|------------|---------|-----------------------------------|
392+
| `--budget` | 50000 | Maximum output tokens |
393+
| `--alpha` | 0.60 | PPR damping — broader propagation |
394+
| `--tau` | 0.08 | Stopping — stricter = less noise |
395+
| `--full` | false | Bypass smart selection |
396+
397+
---
398+
399+
## Technology Choices
400+
401+
| Decision | Choice | Rationale |
402+
|-------------|-------------------|------------------------------|
403+
| Output | YAML | LLM-readable, literal blocks |
404+
| Tokens | tiktoken o200k | GPT-4o standard, exact BPE |
405+
| Ignores | pathspec | gitignore-compatible |
406+
| Parsing | tree-sitter | 13+ languages, AST-level |
407+
| Ranking | PPR | Relevance with natural decay |
408+
| Selection | Lazy greedy | Near-optimal, linear time |
409+
| Git | subprocess UTF-8 | Platform-safe, non-ASCII |
410+
| Diff | git diff unified=0| Exact line ranges |
221411

222412
## License
223413

0 commit comments

Comments
 (0)