|
| 1 | +# `internal/intelligence/repomap/` |
| 2 | + |
| 3 | +> Deep code-analysis engine for hawk: language-aware symbol extraction, |
| 4 | +> static analysis, search, quality signals, API scanning, and incremental |
| 5 | +> indexing. Distinct from `internal/context/repomap`, which is the narrow |
| 6 | +> prompt-injection shim used by the context layer. |
| 7 | +
|
| 8 | +## What it does |
| 9 | + |
| 10 | +`Generate(dir, opts)` walks a directory, dispatches each supported source |
| 11 | +file to a language-aware parser, and returns a `RepoMap`: a token-budgeted |
| 12 | +summary of files and their top-level symbols suitable for injection into |
| 13 | +LLM prompts. Around that core the package accumulates a large set of |
| 14 | +specialised analyses - call graph, import graph, type hierarchy, code |
| 15 | +ownership, BM25 search, cyclomatic complexity, code smells, dead-code |
| 16 | +detection, health score, doc linter, migration detector, HTTP route |
| 17 | +scanner, and an incremental file-hash cache - that all share the same |
| 18 | +parsed-symbol substrate. |
| 19 | + |
| 20 | +The package is stdlib-only at its core (`go/parser`, `go/ast`, `go/token`, |
| 21 | +`encoding/*`). The only third-party dependency is `github.com/fsnotify |
| 22 | +/fsnotify` for file watching; hawk's `internal/scoring` and |
| 23 | +`internal/ui/icons` are pulled in where they are used. Tree-sitter is |
| 24 | +deliberately not required: Go is parsed with `go/ast` and other languages |
| 25 | +are handled by an enhanced regex extractor with scope tracking. |
| 26 | + |
| 27 | +## Architecture |
| 28 | + |
| 29 | +```mermaid |
| 30 | +flowchart TB |
| 31 | + subgraph Entry["Entry point"] |
| 32 | + REPOMAP[repomap.go<br/>Generate, RepoMap, Options] |
| 33 | + end |
| 34 | +
|
| 35 | + subgraph Core["Core"] |
| 36 | + CACHE[cache.go<br/>in-process LRU] |
| 37 | + WATCHER[watcher.go<br/>fsnotify wrapper] |
| 38 | + GITIGNORE[gitignore.go<br/>composed rules] |
| 39 | + PATTERNS[patterns.go<br/>include/exclude loader] |
| 40 | + end |
| 41 | +
|
| 42 | + subgraph Symbols["Symbols / parsing"] |
| 43 | + PARSER[parser.go<br/>regex-Go] |
| 44 | + ENHANCED[parser_enhanced.go<br/>AST-Go] |
| 45 | + LANGS[parser_langs.go<br/>regex non-Go] |
| 46 | + TS[treesitter.go<br/>scope-aware] |
| 47 | + end |
| 48 | +
|
| 49 | + subgraph Static["Static analysis"] |
| 50 | + CALL[callgraph.go] |
| 51 | + DEP[depgraph.go] |
| 52 | + IMP[imports.go] |
| 53 | + HIER[hierarchy.go] |
| 54 | + IFACE[interface_extract.go] |
| 55 | + COCHG[cochange.go] |
| 56 | + CHG[changeset.go] |
| 57 | + OWN[ownership.go] |
| 58 | + SHAP[shapley.go] |
| 59 | + end |
| 60 | +
|
| 61 | + subgraph Search["Search / navigation"] |
| 62 | + NAV[navigation.go] |
| 63 | + SEM[semantic.go] |
| 64 | + SEMSRCH[semantic_search.go] |
| 65 | + RERANK[rerank.go] |
| 66 | + PR[pagerank.go] |
| 67 | + PRED[predict.go] |
| 68 | + end |
| 69 | +
|
| 70 | + subgraph Quality["Quality signals"] |
| 71 | + CPLX[complexity.go] |
| 72 | + SMELL[smells.go] |
| 73 | + HEALTH[health_score.go] |
| 74 | + DOC[doclint.go] |
| 75 | + DEAD[dead_code.go] |
| 76 | + MIG[migration_detector.go] |
| 77 | + end |
| 78 | +
|
| 79 | + subgraph API["API surface"] |
| 80 | + SCAN[api_scanner.go] |
| 81 | + end |
| 82 | +
|
| 83 | + subgraph Incr["Incremental"] |
| 84 | + INCR[incremental.go] |
| 85 | + INCRM[incremental_map.go] |
| 86 | + end |
| 87 | +
|
| 88 | + subgraph Group["Grouping"] |
| 89 | + GROUPER[file_grouper.go] |
| 90 | + SUMMARY[summary.go] |
| 91 | + end |
| 92 | +
|
| 93 | + REPOMAP --> CACHE |
| 94 | + REPOMAP --> WATCHER |
| 95 | + REPOMAP --> GITIGNORE |
| 96 | + REPOMAP --> PATTERNS |
| 97 | + REPOMAP --> PARSER |
| 98 | + REPOMAP --> ENHANCED |
| 99 | + REPOMAP --> LANGS |
| 100 | + REPOMAP --> TS |
| 101 | + Static --> IMP |
| 102 | + Static --> CALL |
| 103 | + Search --> PR |
| 104 | + Search --> SEM |
| 105 | + Search --> SEMSRCH |
| 106 | + Quality --> CPLX |
| 107 | + Quality --> SMELL |
| 108 | + Quality --> HEALTH |
| 109 | + Quality --> DOC |
| 110 | + Quality --> DEAD |
| 111 | + Quality --> MIG |
| 112 | + API --> SCAN |
| 113 | + Incr --> INCR |
| 114 | + Incr --> INCRM |
| 115 | + Group --> GROUPER |
| 116 | + Group --> SUMMARY |
| 117 | +``` |
| 118 | + |
| 119 | +## File groups |
| 120 | + |
| 121 | +| Group | Files | Purpose | |
| 122 | +|------------------------|--------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------| |
| 123 | +| **Core** | `repomap.go`, `cache.go`, `watcher.go`, `gitignore.go`, `patterns.go` | Entry point, file scanning, file watching, in-process and persistent caches | |
| 124 | +| **Symbols / parsing** | `parser.go`, `parser_enhanced.go`, `parser_langs.go`, `treesitter.go` | Language-aware symbol extraction: regex Go, AST Go, regex non-Go, scope-aware (tree-sitter-like) | |
| 125 | +| **Static analysis** | `callgraph.go`, `depgraph.go`, `imports.go`, `hierarchy.go`, `interface_extract.go`, `cochange.go`, `changeset.go`, `ownership.go`, `shapley.go` | Callers/callees, package-level deps, import graph, type hierarchy, exported surface, git-history co-change, change-set context, ownership, Shapley value ranker | |
| 126 | +| **Search / navigation**| `navigation.go`, `semantic.go`, `semantic_search.go`, `rerank.go`, `pagerank.go`, `predict.go` | LSP-free navigation index, BM25 search, PageRank ranking, reranking, relevance prediction | |
| 127 | +| **Quality signals** | `complexity.go`, `smells.go`, `health_score.go`, `doclint.go`, `dead_code.go`, `migration_detector.go` | Cyclomatic complexity, code smells, health rollup, doc linter, dead code, deprecated-API detection | |
| 128 | +| **API surface** | `api_scanner.go` | HTTP route scanners (Chi, net/http, Gin, Echo, Gorilla, Fiber) + OpenAPI export | |
| 129 | +| **Incremental** | `incremental.go`, `incremental_map.go` | `CodeIndexer` interface and reindex loop, persistent on-disk symbol cache | |
| 130 | +| **Grouping** | `file_grouper.go`, `summary.go` | File grouping, codebase summary suitable for prompt injection | |
| 131 | + |
| 132 | +## Entry points |
| 133 | + |
| 134 | +The full public surface is read from the source. The headline entry points |
| 135 | +are: |
| 136 | + |
| 137 | +- **`Generate(dir string, opts Options) (*RepoMap, error)`** in `repomap.go` - |
| 138 | + the canonical entry point. Walks `dir`, parses every supported file, |
| 139 | + and returns a `RepoMap` with `Files` and a `TokenEst`. |
| 140 | +- **`(*RepoMap).Format(maxTokens int) string`** in `repomap.go` - renders |
| 141 | + the map as text, truncating to fit `maxTokens`. |
| 142 | +- **`BuildCallGraph(root string) (*CallGraph, error)`** in `callgraph.go` - |
| 143 | + Go-only caller/callee graph from `go/ast`. |
| 144 | +- **`BuildDepGraph` / `NewDepGraph` / `BuildFromGoMod` / `BuildFromPackageJSON`** |
| 145 | + in `depgraph.go` - Go and JS/TS dependency graphs. |
| 146 | +- **`BuildImportGraph(root string) (*ImportGraph, error)`** in `imports.go` - |
| 147 | + file-level import graph. |
| 148 | +- **`NewNavIndex` / `(*NavIndex).BuildIndex` / `(*NavIndex).GoToDefinition` / |
| 149 | + `(*NavIndex).FindReferences` / `(*NavIndex).FindImplementations`** in |
| 150 | + `navigation.go` - the LSP-free navigation API. |
| 151 | +- **`BuildSemanticIndex` / `(*SemanticIndex).Search` / |
| 152 | + `NewSemanticSearchIndex`** in `semantic.go` / `semantic_search.go` - |
| 153 | + chunked TF-IDF and BM25 search. |
| 154 | +- **`BuildSymbolGraph` / `(*SymbolGraph).TopSymbols`** in `pagerank.go` - |
| 155 | + symbol-level PageRank. |
| 156 | +- **`NewComplexityAnalyzer` / `(*ComplexityAnalyzer).FindHotspots`** in |
| 157 | + `complexity.go` - complexity hotspots. |
| 158 | +- **`NewSmellDetector` / `(*SmellDetector).ScanDirectory`** in `smells.go` |
| 159 | + - code smell detection. |
| 160 | +- **`NewHealthScorer` / `(*HealthScorer).Score` / `FormatScore` / |
| 161 | + `CompareScores`** in `health_score.go` - health score rollup and |
| 162 | + before/after diff. |
| 163 | +- **`NewDeadCodeDetector` / `(*DeadCodeDetector).Detect`** in |
| 164 | + `dead_code.go` - dead-code detection. |
| 165 | +- **`NewMigrationDetector` / `(*MigrationDetector).Scan` / |
| 166 | + `FormatOpportunities` / `AutoFix`** in `migration_detector.go` - |
| 167 | + deprecated-API migration. |
| 168 | +- **`NewAPIScanner` / `(*APIScanner).Scan` / `FormatAPIMap` / |
| 169 | + `GenerateOpenAPI`** in `api_scanner.go` - HTTP route scanner. |
| 170 | +- **`NewIncrementalMap(cacheDir string) (*IncrementalMap, error)`** in |
| 171 | + `incremental_map.go` - persistent on-disk symbol cache. |
| 172 | +- **`IncrementalReindex(dir string, ignore []string, indexer CodeIndexer) |
| 173 | + (added, skipped, removed int, err error)`** in `incremental.go` - the |
| 174 | + diff-and-reindex loop for the `CodeIndexer` interface. |
| 175 | +- **`NewFileWatcher(root string, onChange func(path string)) |
| 176 | + (*FileWatcher, error)`** in `watcher.go` - fsnotify wrapper. |
| 177 | +- **`NewSummaryGenerator(projectDir string, maxTokens int) |
| 178 | + *SummaryGenerator` / `RenderForPrompt` / `RenderCompact`** in |
| 179 | + `summary.go` - the prompt-injectable codebase summary. |
| 180 | +- **`BuildHierarchy(root string) (*HierarchicalSummary, error)`** in |
| 181 | + `hierarchy.go` - 3-level project summary. |
| 182 | +- **`PredictRelevantFiles` / `NewRecentEditTracker`** in `predict.go` - |
| 183 | + relevance prediction from prompt, recent edits, import graph, and |
| 184 | + symbol map. |
| 185 | +- **`BuildCoChangeAnalysis(root string, commitLimit int) |
| 186 | + (*CoChangeAnalysis, error)`** in `cochange.go` - git-history co-change. |
| 187 | +- **`FromGitDiff` / `FromGitDiffRange`** in `changeset.go` - change-set |
| 188 | + context. |
| 189 | +- **`NewOwnershipMap` / `(*OwnershipMap).Compute`** in `ownership.go` - |
| 190 | + per-file ownership. |
| 191 | +- **`NewShapleyRanker(chunks []CodeChunk) *ShapleyRanker`** in `shapley.go` |
| 192 | + - Shapley-value chunk ranking. |
| 193 | +- **`NewAPIScanner`** in `api_scanner.go` - HTTP route scanner factory. |
| 194 | + |
| 195 | +## Storage model |
| 196 | + |
| 197 | +- **In-process symbol cache** (`cache.go`): an LRU keyed by |
| 198 | + `(path, modtime)` capped at `defaultMaxSymbolCacheEntries` (5000). It |
| 199 | + is consulted by `parseFileSymbols` in `repomap.go` and is cleared on |
| 200 | + process exit. |
| 201 | +- **Persistent incremental cache** (`incremental_map.go`): JSON file at |
| 202 | + `<cacheDir>/repomap-cache.json` (typically `.hawk/repomap-cache.json`) |
| 203 | + keyed by SHA-256 of file content. `IncrementalReindex` diffs the |
| 204 | + project tree against the cached hash set and re-parses only changed |
| 205 | + files. |
| 206 | +- **Watch protocol** (`watcher.go`): `NewFileWatcher(root, onChange)` walks |
| 207 | + the tree, registers every non-hidden, non-vendor directory with |
| 208 | + `fsnotify.Watcher`, and invokes `onChange(path)` on |
| 209 | + `Write`/`Create`/`Remove` events for supported source files. `Start` |
| 210 | + launches the event loop goroutine; `Stop` terminates it. |
| 211 | + |
| 212 | +## Extension points |
| 213 | + |
| 214 | +### Add a new language parser |
| 215 | + |
| 216 | +1. Add the extension to `isSupportedExt` in `repomap.go` so the walker |
| 217 | + picks up the new files. |
| 218 | +2. Add a new case to `parseFileSymbols` in `repomap.go` that dispatches |
| 219 | + to a `parseX` function. |
| 220 | +3. Add the `parseX(src string) []Symbol` function. For most languages |
| 221 | + you can copy the `jsSpec` / `cSpec` patterns in `parser_langs.go`. |
| 222 | +4. If the language needs scope-aware extraction, add a new |
| 223 | + `TreeSitterParser` method in `treesitter.go` instead. |
| 224 | +5. Optionally wire the same extension into `detectLang` in |
| 225 | + `internal/context/repomap/scan.go` if the prompt-injection shim |
| 226 | + should also pick it up. |
| 227 | + |
| 228 | +### Add a new code smell |
| 229 | + |
| 230 | +1. Add a `Detector func(...) []CodeSmell` field on `SmellDetector` in |
| 231 | + `smells.go`. |
| 232 | +2. Wire the new field into `NewSmellDetector` and `ScanDirectory`. |
| 233 | +3. Tune `SmellThresholds` defaults if the new smell has tunable limits. |
| 234 | + |
| 235 | +### Add a new HTTP framework scanner |
| 236 | + |
| 237 | +1. Add a `ScanX(content, file string) []APIEndpoint` function in |
| 238 | + `api_scanner.go` using one of the existing scanners (e.g. `ScanChi`) |
| 239 | + as a template. |
| 240 | +2. Add the corresponding case to `DetectFramework` so the dispatcher |
| 241 | + knows to use the new scanner. |
| 242 | +3. If the new framework uses a different routing style, update |
| 243 | + `FormatAPIMap` and `GenerateOpenAPI` to handle the new metadata. |
| 244 | + |
| 245 | +## Performance and scaling |
| 246 | + |
| 247 | +- **`Generate`** is O(N) in the number of files with a hard cap on |
| 248 | + `Options.MaxFiles` (default 500). The walk is single-threaded; the |
| 249 | + per-file parsing is also single-threaded but the work is bounded per |
| 250 | + file. |
| 251 | +- **Symbol cache** (`cache.go`) keeps hot files in memory; it is |
| 252 | + process-local and does not survive a restart. |
| 253 | +- **IncrementalMap** (`incremental_map.go`) persists hashes and symbol |
| 254 | + lists on disk. `IncrementalReindex` only re-parses files whose SHA-256 |
| 255 | + has changed. For very large repositories (tens of thousands of files) |
| 256 | + prefer the incremental path over `Generate`. |
| 257 | +- **Static-analysis passes** (`callgraph`, `depgraph`, `pagerank`, |
| 258 | + `shapley`) are O(V + E) per iteration over the symbol graph and scale |
| 259 | + linearly with the number of declarations, not the number of lines. |
| 260 | +- **BM25 search** (`semantic_search.go`) is O(Q * D) per query, where Q |
| 261 | + is the number of query terms and D is the number of indexed documents. |
| 262 | + IDF and average document length are precomputed and cached. |
| 263 | +- **Health score** (`health_score.go`) is O(F) per dimension (F = file |
| 264 | + count) and runs all dimensions in sequence; for very large projects |
| 265 | + the per-file scans are the bottleneck. |
| 266 | +- **Tree-sitter path** is not used. The "tree-sitter-style" scope-aware |
| 267 | + extractor in `treesitter.go` is a pure-Go regex implementation that |
| 268 | + avoids the CGO and binary dependencies of the real library. |
| 269 | + |
| 270 | +## Relationship to `internal/context/repomap` |
| 271 | + |
| 272 | +`internal/context/repomap` is a much narrower package - essentially just |
| 273 | +`RepoMap(root, budget) (string, error)`. It is the prompt-injection shim |
| 274 | +that hawk's context layer calls when it needs a budgeted overview for the |
| 275 | +system prompt. It does its own AST parsing, PageRank pass, and rendering |
| 276 | +and shares no code with this package beyond the name. |
| 277 | + |
| 278 | +Callers that need more than a budgeted text block (symbol-level |
| 279 | +navigation, BM25 search, dead-code detection, OpenAPI export, etc.) |
| 280 | +should import `internal/intelligence/repomap` (this package) directly. |
| 281 | +See the comment at the top of `internal/context/repomap/repomap.go` for |
| 282 | +the other side of the boundary. |
| 283 | + |
| 284 | +## See also |
| 285 | + |
| 286 | +- `doc.go` - go-doc compatible package overview. |
| 287 | +- `doc_test.go` - worked example (calls `Generate` + `Format`). |
| 288 | +- `internal/context/repomap/doc.go` - the shim's perspective. |
0 commit comments