@@ -161,63 +161,253 @@ pre-commit run --all-files
161161
162162## Testing
163163
164- Integration tests only - test against real filesystem. No mocking.
165-
166- ## Diff Context Mode
167-
168- Smart context selection for git diffs using personalized PageRank:
169-
170- ``` bash
171- treemapper . --diff HEAD~1..HEAD
172- ```
173-
174- Output format:
175-
176- ``` yaml
177- name : myproject
178- type : diff_context
179- fragment_count : 5
180- fragments :
181- - path : src/main.py
182- lines : " 10-25"
183- kind : function
184- symbol : process_data
185- content : |
186- def process_data(items):
187- ...
188- ` ` `
189-
190- Options:
191-
192- - ` --budget N` — token budget (default: 50000)
193- - `--alpha F` — PPR damping factor (default : 0.60)
194- - `--tau F` — stopping threshold (default : 0.08)
195- - ` --full` — skip smart selection, include all changed code
196-
197- # # Architecture
198-
199- ` ` ` text
200- src/treemapper/
201- ├── cli.py # argument parsing
202- ├── clipboard.py # clipboard copy support
203- ├── ignore.py # gitignore/treemapperignore handling
204- ├── tokens.py # token counting (tiktoken)
205- ├── tree.py # directory traversal
206- ├── writer.py # YAML/JSON/text/Markdown output
207- ├── treemapper.py # main entry point
208- └── diffctx/ # diff context mode
209- ├── __init__.py # entry point, run_diff_context()
210- ├── fragments.py # file fragmenters (Python, Markdown, etc.)
211- ├── git.py # git diff parsing
212- ├── graph.py # dependency graph building
213- ├── ppr.py # personalized PageRank
214- ├── python_semantics.py # Python import/call analysis
215- ├── render.py # output formatting
216- ├── select.py # greedy budget selection
217- ├── stopwords.py # identifier filtering
218- ├── types.py # Fragment, DiffHunk types
219- └── utility.py # submodular utility functions
220- ` ` `
164+ Integration tests only — test against real filesystem and real git
165+ repos. No mocking.
166+
167+ The diff context tests use a ** YAML-based declarative framework** :
168+ each test case defines initial files, changed files, and expected
169+ output assertions. A dedicated test runner creates a real git repo
170+ per test, commits the files, runs the full diffctx pipeline, and
171+ verifies results.
172+
173+ ** Negative testing via garbage injection** : every test case
174+ automatically includes ~ 10 unrelated "garbage" files with
175+ distinctive markers. Tests verify the algorithm excludes this
176+ noise, catching regressions in relevance filtering. Each garbage
177+ file uses unique prefixed identifiers (e.g. ` GARBAGE_* ` ) so leaks
178+ are unambiguously detectable.
179+
180+ ## Two Modes of Operation
181+
182+ TreeMapper operates in two fundamentally different modes that
183+ share output formatting, token counting, and file reading
184+ infrastructure:
185+
186+ ** Tree Mapping Mode** (` treemapper . ` ) — Filesystem-focused.
187+ Walks the directory tree respecting hierarchical ignore patterns,
188+ reads file contents with binary/encoding detection, and serializes
189+ to YAML/JSON/text/Markdown. Deterministic, side-effect-free.
190+
191+ ** Diff Context Mode** (` treemapper . --diff ` ) — Semantics-focused.
192+ Analyzes a git diff to intelligently select the minimal set of
193+ code fragments needed to understand a change. This is the core
194+ intellectual property of the project — a graph-based relevance
195+ engine described in detail below.
196+
197+ ---
198+
199+ ## Diff Context: Architecture & Design
200+
201+ ### The Problem
202+
203+ When reviewing a code change, the diff alone is rarely sufficient.
204+ A developer needs surrounding context: the function being called,
205+ the interface being implemented, the config driving deployment.
206+ But naively including "everything related" explodes the context
207+ window. The challenge is selecting the ** minimal, sufficient
208+ context** within a token budget.
209+
210+ ### The Approach: Graph-Based Relevance Propagation
211+
212+ The diffctx engine models a codebase as a ** weighted directed
213+ graph** where nodes are semantic code fragments and edges represent
214+ dependencies between them. Changed code seeds the graph, relevance
215+ propagates through edges via Personalized PageRank, and a
216+ budget-aware greedy algorithm selects the best fragments.
217+
218+ This approach was chosen over simpler alternatives (call-graph
219+ depth, grep-based expansion, file-level inclusion) because:
220+
221+ - ** Transitive importance decays naturally** — a function calling
222+ a modified function is relevant; a function calling * that*
223+ function is less so. PPR captures this without manual depth
224+ limits.
225+ - ** Heterogeneous relationships combine gracefully** — imports,
226+ type references, config links, test patterns, and lexical
227+ similarity all contribute edges with different weights. No
228+ single signal captures all dependencies.
229+ - ** Budget optimization is principled** — submodular utility
230+ maximization with lazy greedy selection gives near-optimal
231+ coverage per token spent.
232+
233+ ### Pipeline Stages
234+
235+ The engine operates as a 7-stage pipeline:
236+
237+ 1 . ** Diff Parsing** — Extract changed file paths and exact line
238+ ranges from git diff output.
239+
240+ 2 . ** Core Fragment Identification** — Break changed files into
241+ semantic units (functions, classes, config blocks, doc sections)
242+ using language-aware parsers, then identify which fragments
243+ cover the actual changed lines.
244+
245+ 3 . ** Concept Extraction** — Extract identifiers from added/removed
246+ diff lines. These "diff concepts" represent the vocabulary of
247+ the change and drive relevance scoring.
248+
249+ 4 . ** Universe Expansion** — Discover related files beyond those
250+ directly changed. Edge builders scan for imports, config
251+ references, naming patterns. Rare identifiers (appearing in
252+ ≤3 files) trigger targeted file discovery.
253+
254+ 5 . ** Graph Construction** — Build fragment-level dependency graph.
255+ 26 edge builders contribute weighted edges across 6 categories
256+ (see below). Edges are aggregated via max — if any builder
257+ thinks two fragments are related, the strongest signal wins.
258+ Hub suppression downweights over-connected nodes (e.g. common
259+ utilities) to prevent them from dominating the graph.
260+
261+ 6 . ** Relevance Scoring (PPR)** — Run Personalized PageRank seeded
262+ from core (changed) fragments. The damping factor α=0.60
263+ controls propagation depth: 60% chance of following an edge,
264+ 40% chance of teleporting back to changed code. Convergence
265+ produces a relevance score per fragment.
266+
267+ 7 . ** Budget-Aware Selection** — A lazy greedy algorithm selects
268+ fragments maximizing density (marginal utility per token). Core
269+ fragments are selected first, then expansion candidates ordered
270+ by a max-heap. A τ-based stopping threshold (relative to
271+ baseline density median) prevents noise accumulation.
272+
273+ ### Edge Taxonomy: Six Perspectives on Code Relationships
274+
275+ The system intentionally models relationships from multiple
276+ independent perspectives. Each catches blind spots the others miss.
277+
278+ ** Semantic Edges** — Language-aware code dependencies.
279+ Import/export resolution, function calls, type references, symbol
280+ usage. 11 language-specific builders (Python, JavaScript/TypeScript,
281+ Go, Rust, Java/Kotlin/Scala, C/C++, C#/.NET, Ruby, PHP, Swift,
282+ Shell). Weights reflect type-system reliability: Rust symbol refs
283+ (0.95) are trusted more than Python calls (0.55) because static
284+ analysis is more reliable in strict type systems. All semantic
285+ edges are asymmetric — "A imports B" is a stronger signal than
286+ "B is imported by A" — modeled via reverse weight factors
287+ (0.4–0.7).
288+
289+ ** Configuration Edges** — Infrastructure-to-code dependencies that
290+ don't appear in source. Docker COPY/FROM to source files,
291+ Kubernetes manifests to application code, Terraform modules to
292+ infrastructure scripts, CI/CD workflows to tested code, Helm
293+ templates to services, build system configs to compiled sources,
294+ generic config keys to code referencing them. 7 specialized
295+ builders covering the DevOps ecosystem.
296+
297+ ** Structural Edges** — Filesystem and organizational proximity.
298+ Containment (parent-child directory nesting), test-code
299+ associations (naming heuristics like ` test_foo.py ` to ` foo.py ` ),
300+ sibling files in the same directory. These are weak signals
301+ (0.05–0.60) that prevent blind spots in code without explicit
302+ imports.
303+
304+ ** Document Edges** — Non-code content relationships.
305+ Section-to-section flow within Markdown, anchor link references,
306+ cross-document citations. Enable following documentation
307+ dependencies when docs change alongside code.
308+
309+ ** Similarity Edges** — Content-based relationships via TF-IDF
310+ lexical matching. Finds code with similar vocabulary/structure
311+ even without explicit references. Weight bounds are
312+ language-specific: wider for dynamic languages (Python 0.20–0.35),
313+ narrower for typed (Rust 0.10–0.15) where semantic edges are more
314+ reliable.
315+
316+ ** History Edges** — Temporal co-change patterns from git log.
317+ Files repeatedly committed together have implicit coupling.
318+ Capped at 500 recent commits with noise filtering (ignoring large
319+ commits with >30 files).
320+
321+ ### Selection: Submodular Utility Maximization
322+
323+ The greedy selector optimizes a submodular utility function under
324+ a token budget constraint:
325+
326+ ** Concept coverage** — Each diff concept (identifier from the
327+ change) has a "best coverage score" across selected fragments.
328+ Adding a fragment that covers new concepts yields high marginal
329+ gain; covering already-covered concepts yields diminishing returns
330+ (modeled via square-root scaling).
331+
332+ ** Relatedness bonus** — High-PPR fragments receive minimum
333+ guaranteed utility even without concept overlap, ensuring
334+ structurally related code is included.
335+
336+ ** Density ordering** — Candidates are ranked by utility-per-token
337+ (density), not raw utility. A 10-token fragment covering 2
338+ concepts beats a 500-token fragment covering 3. Lazy heap
339+ evaluation avoids recomputing stale density values until a
340+ candidate is popped.
341+
342+ ** τ-stopping** — After establishing a baseline from the first 5
343+ selected fragments, stop when density drops below
344+ τ × median(baseline). This relative threshold adapts to the
345+ codebase: dense code triggers earlier stopping, sparse code allows
346+ broader inclusion.
347+
348+ ### Fragment Granularity
349+
350+ Files are decomposed into semantic fragments using a
351+ priority-ordered parser pipeline. Language-specific parsers
352+ (tree-sitter for 13+ languages, Python AST, Mistune for Markdown)
353+ produce function/class/section-level fragments. Fallback parsers
354+ handle config files (key-value boundaries), text (sentence-aware
355+ splitting), and generic content (line-count limits). The
356+ granularity choice means PPR reasons at the right level — a
357+ changed line in a function selects that function as a unit, not
358+ the whole file.
359+
360+ ### Key Design Decisions
361+
362+ ** Why Personalized PageRank over call-graph BFS?** BFS requires
363+ arbitrary depth limits and treats all edges equally. PPR provides
364+ natural exponential decay, respects edge weights, and converges
365+ to a principled relevance distribution.
366+
367+ ** Why max-aggregation for edge combination?** Multiple edge types
368+ often agree on the same relationship. Taking the max avoids
369+ inflating weights through redundant signals while preserving the
370+ strongest evidence from any perspective.
371+
372+ ** Why submodular greedy over knapsack?** Submodular functions
373+ guarantee that greedy gives (1 - 1/e) ≈ 63% of optimal. With
374+ lazy evaluation and density ordering, the algorithm runs in
375+ near-linear time while achieving strong coverage.
376+
377+ ** Why asymmetric edge weights?** Code dependencies are
378+ directional. "A imports B" means A needs B for context; B doesn't
379+ necessarily need A. Reverse factors (0.4–0.7 of forward weight)
380+ enable bidirectional graph search while respecting this asymmetry.
381+
382+ ** Why hub suppression?** Common utility modules (logging, helpers,
383+ config) receive edges from everywhere. Without dampening, they
384+ dominate PPR scores and pull in unrelated code. Log-scaled
385+ in-degree suppression at the 95th percentile keeps them accessible
386+ without letting them dominate.
387+
388+ ### Tunable Parameters
389+
390+ | Parameter | Default | Controls |
391+ | ------------| ---------| -----------------------------------|
392+ | ` --budget ` | 50000 | Maximum output tokens |
393+ | ` --alpha ` | 0.60 | PPR damping — broader propagation |
394+ | ` --tau ` | 0.08 | Stopping — stricter = less noise |
395+ | ` --full ` | false | Bypass smart selection |
396+
397+ ---
398+
399+ ## Technology Choices
400+
401+ | Decision | Choice | Rationale |
402+ | -------------| -------------------| ------------------------------|
403+ | Output | YAML | LLM-readable, literal blocks |
404+ | Tokens | tiktoken o200k | GPT-4o standard, exact BPE |
405+ | Ignores | pathspec | gitignore-compatible |
406+ | Parsing | tree-sitter | 13+ languages, AST-level |
407+ | Ranking | PPR | Relevance with natural decay |
408+ | Selection | Lazy greedy | Near-optimal, linear time |
409+ | Git | subprocess UTF-8 | Platform-safe, non-ASCII |
410+ | Diff | git diff unified=0| Exact line ranges |
221411
222412## License
223413
0 commit comments