|
5 | 5 | ## Context |
6 | 6 |
|
7 | 7 | Phase 1 replaced the storage layer (LanceDB → Antfly) but kept the old indexing |
8 | | -flow intact. That flow was designed around LanceDB constraints: local file storage, |
9 | | -manual embedding pipeline, batch sizing tuned for ONNX model memory, state files |
10 | | -for incremental updates. |
| 8 | +flow intact. That flow was overengineered for its original constraints: local file |
| 9 | +storage, manual embedding pipeline, state files tracking file hashes and document IDs. |
11 | 10 |
|
12 | | -With Antfly as the backend, many of these constraints no longer exist. Rather than |
13 | | -patching the old flow, we should redesign it around what Antfly enables and what |
14 | | -developers actually need. |
| 11 | +Research (see [research.md](./research.md)) found two production-grade tools that |
| 12 | +eliminate most of our custom plumbing: |
15 | 13 |
|
16 | | -See [user-stories.md](./user-stories.md) for the full set of user stories driving |
17 | | -this redesign. |
| 14 | +1. **`@parcel/watcher`** — native file watcher with `getEventsSince()` that tracks |
| 15 | + changes even when our process isn't running (used by VS Code) |
| 16 | +2. **Antfly Linear Merge** — server-side content hashing, dedup, and deletion in |
| 17 | + one API call. Replaces our state file, hash tracking, and upsert logic. |
18 | 18 |
|
19 | | -## Current flow (what exists) |
| 19 | +See [user-stories.md](./user-stories.md) for the 16 user stories driving this redesign. |
| 20 | + |
| 21 | +## Current flow (what we're replacing) |
20 | 22 |
|
21 | 23 | ``` |
22 | | -dev setup → start Antfly (one-time) |
23 | | -dev index . → scan all files → batch insert into Antfly → save state file |
24 | | - ├─ Phase 1: Scan → ts-morph/tree-sitter/remark → Document[] |
25 | | - ├─ Phase 2: Store → batch HTTP inserts (32 docs × CONCURRENCY parallel) |
26 | | - ├─ Phase 3: Git → extract commits → separate table |
27 | | - ├─ Phase 4: GitHub → fetch issues/PRs via gh CLI → separate table |
28 | | - └─ Save state → indexer-state.json (file hashes for incremental) |
29 | | -dev search "query" → hybrid search via Antfly |
| 24 | +dev index . |
| 25 | + ├─ Scan ALL files (glob + parse) |
| 26 | + ├─ Prepare EmbeddingDocument[] from scan results |
| 27 | + ├─ Batch insert (32 docs × CONCURRENCY parallel HTTP calls) |
| 28 | + ├─ Track state: file hashes, document IDs, timestamps → indexer-state.json |
| 29 | + ├─ Git: extract commits → separate table |
| 30 | + ├─ GitHub: fetch issues/PRs → separate table |
| 31 | + └─ Emit events, close |
| 32 | +
|
| 33 | +Problems: |
| 34 | + - Manual trigger required (US-4: changes should be automatic) |
| 35 | + - State file tracks what Antfly already knows (redundant) |
| 36 | + - Batch size 32 when Antfly handles 500 (15x too many HTTP calls) |
| 37 | + - No way to know what changed while MCP server was off |
| 38 | + - Git/GitHub coupled to code indexing |
30 | 39 | ``` |
31 | 40 |
|
32 | | -### Problems with current flow |
33 | | - |
34 | | -1. **Manual trigger required** — developer must remember to run `dev index .` after |
35 | | - code changes. AI tools get stale context. (violates US-4) |
36 | | - |
37 | | -2. **State file complexity** — tracks file hashes, document IDs per file, timestamps. |
38 | | - But Antfly does upsert natively — inserting an existing key overwrites. Do we need |
39 | | - the state file at all? |
40 | | - |
41 | | -3. **Embedding delay invisible** — Antfly embeds asynchronously (~2s). `dev index .` |
42 | | - completes before embeddings are ready. Immediate search may return nothing. (violates US-3) |
43 | | - |
44 | | -4. **Three separate VectorStorage instances** — created because LanceDB needed separate |
45 | | - directories. With Antfly, these are just three tables. But the code creates three |
46 | | - separate VectorStorage objects with separate connections. |
47 | | - |
48 | | -5. **Batch sizing is wrong** — indexer uses batch=32 (tuned for ONNX). Antfly can handle |
49 | | - 500 per request. We're making 15x more HTTP calls than needed. |
50 | | - |
51 | | -6. **Git and GitHub coupled to index command** — `dev index .` does code + git + GitHub |
52 | | - in one big command. These are different data sources with different update patterns. |
53 | | - |
54 | 41 | ## Proposed flow |
55 | 42 |
|
56 | | -### The big idea: file watcher + on-demand indexing |
| 43 | +### Architecture |
57 | 44 |
|
58 | 45 | ``` |
59 | | -dev setup → start Antfly + start file watcher (background) |
60 | | - watcher detects file changes → re-indexes changed files automatically |
61 | | -
|
62 | | -dev index . → full scan (first time or explicit refresh) |
63 | | -dev index . --force → clear + full scan |
64 | | -
|
65 | | -# These become separate, optional commands: |
66 | | -dev git index → index git history (already exists) |
67 | | -dev github index → index GitHub issues/PRs (already exists) |
| 46 | +┌─────────────────────────────────────────────────────────────┐ |
| 47 | +│ MCP Server (always running) │ |
| 48 | +│ │ |
| 49 | +│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │ |
| 50 | +│ │ @parcel/ │────▶│ Scanner │────▶│ Antfly │ │ |
| 51 | +│ │ watcher │ │ (ts-morph, │ │ Linear │ │ |
| 52 | +│ │ │ │ tree-sitter) │ │ Merge │ │ |
| 53 | +│ │ getEventsSince│ └──────────────┘ └─────────────┘ │ |
| 54 | +│ └──────────────┘ │ |
| 55 | +│ │ │ |
| 56 | +│ │ on file change │ |
| 57 | +│ ▼ │ |
| 58 | +│ ┌──────────────┐ │ |
| 59 | +│ │ Debounce │ (batch changes, wait 500ms of quiet) │ |
| 60 | +│ │ + Filter │ (ignore node_modules, dist, .git) │ |
| 61 | +│ └──────────────┘ │ |
| 62 | +└─────────────────────────────────────────────────────────────┘ |
68 | 63 | ``` |
69 | 64 |
|
70 | | -**US-4 solved:** The file watcher keeps the index fresh without manual intervention. |
71 | | -Developer saves a file, the watcher re-indexes it within seconds. |
72 | | - |
73 | | -### Alternative: no watcher, just fast incremental |
74 | | - |
75 | | -If a file watcher is too complex for Phase 2, the simpler approach: |
| 65 | +### The flow |
76 | 66 |
|
| 67 | +**First time (`dev index .`):** |
77 | 68 | ``` |
78 | | -dev index . → fast incremental (only changed files, <5s for small changes) |
79 | | - runs automatically on MCP server startup |
80 | | - runs automatically before search if stale (>5 min since last update) |
| 69 | +1. Scan all files → parse → extract code components |
| 70 | +2. Antfly Linear Merge: send all documents |
| 71 | + → Antfly hashes content, stores new docs, skips unchanged |
| 72 | + → Returns: { upserted: 2525, skipped: 0, deleted: 0 } |
| 73 | +3. Save watcher snapshot (for getEventsSince on restart) |
| 74 | +4. Start watching for changes |
81 | 75 | ``` |
82 | 76 |
|
83 | | -### Simplifications enabled by Antfly |
84 | | - |
85 | | -| Old complexity | New simplification | |
86 | | -|---------------|-------------------| |
87 | | -| State file (file hashes, doc IDs) | Antfly upsert by key — just re-insert, it overwrites | |
88 | | -| Three VectorStorage instances | One AntflyClient, three table names | |
89 | | -| Batch size 32 + CONCURRENCY | Single batch size 500, let Antfly handle parallelism | |
90 | | -| Manual embedding step | Antfly auto-embeds on insert | |
91 | | -| Wait for embedding completion | BM25 search works immediately; vector search ready in ~2s | |
92 | | - |
93 | | -### State file: keep or drop? |
94 | | - |
95 | | -**Keep a minimal version.** We still need to know: |
96 | | -- Which files have been indexed (to detect deleted files → remove from Antfly) |
97 | | -- Last index timestamp (to detect staleness) |
| 77 | +**Ongoing (automatic, no user command):** |
| 78 | +``` |
| 79 | +1. @parcel/watcher fires: files A, B, C changed |
| 80 | +2. Debounce (wait 500ms of quiet) |
| 81 | +3. Parse only changed files → extract components |
| 82 | +4. Antfly Linear Merge: send only changed documents |
| 83 | + → Returns: { upserted: 3, skipped: 0, deleted: 1 } |
| 84 | +5. MCP tools immediately have fresh data |
| 85 | +``` |
98 | 86 |
|
99 | | -**Drop:** |
100 | | -- File hashes (just re-insert everything that changed based on mtime) |
101 | | -- Document IDs per file (Antfly handles dedup by key) |
102 | | -- Embedding metadata (Antfly owns this) |
| 87 | +**MCP server restart:** |
| 88 | +``` |
| 89 | +1. @parcel/watcher.getEventsSince(lastSnapshot) |
| 90 | + → "files X, Y, Z changed while you were off" |
| 91 | +2. Parse only those files → extract → merge |
| 92 | +3. Resume watching |
| 93 | +``` |
103 | 94 |
|
104 | | -## Parts |
| 95 | +**Force re-index (`dev index . --force`):** |
| 96 | +``` |
| 97 | +1. Antfly: drop tables, recreate |
| 98 | +2. Full scan + merge (same as first time) |
| 99 | +``` |
105 | 100 |
|
106 | | -| Part | Description | User stories | |
107 | | -|------|-------------|-------------| |
108 | | -| 2.1 | Simplify indexer: drop state complexity, use Antfly upsert | US-3, US-5 | |
109 | | -| 2.2 | Increase batch size, single AntflyClient | US-6 | |
110 | | -| 2.3 | Wait for embedding completion (or BM25 fallback) | US-3 | |
111 | | -| 2.4 | Decouple git/github from `dev index .` | US-10, US-11 | |
112 | | -| 2.5 | Auto-index on MCP server startup | US-4, US-12 | |
113 | | -| 2.6 | File watcher for continuous indexing (stretch) | US-4 | |
114 | | -| 2.7 | `dev status` rework — show Antfly table stats | US-13 | |
| 101 | +### What we drop |
115 | 102 |
|
116 | | -## Decisions to make |
| 103 | +| Old complexity | Replaced by | |
| 104 | +|---------------|-------------| |
| 105 | +| `indexer-state.json` (file hashes, doc IDs) | `@parcel/watcher` snapshots + Antfly Linear Merge | |
| 106 | +| Manual `dev index .` after every change | Automatic via file watcher | |
| 107 | +| Batch size 32 + CONCURRENCY parallelism | Single Linear Merge call per change batch | |
| 108 | +| Three separate VectorStorage instances | One AntflyClient, three table names | |
| 109 | +| `TransformersEmbedder` pipeline | Antfly auto-embeds via Termite | |
| 110 | +| Hash comparison in RepositoryIndexer | Antfly server-side content hashing | |
117 | 111 |
|
118 | | -1. **File watcher or fast incremental?** Watcher is better UX but more complexity. |
119 | | - Fast incremental (<5s) on MCP startup might be enough. |
| 112 | +### What we keep |
120 | 113 |
|
121 | | -2. **State file: minimal or none?** We need *something* to detect deleted files. |
122 | | - Could query Antfly for existing keys and diff, but that's O(n) on every run. |
| 114 | +- **Scanner pipeline** — ts-morph, tree-sitter, remark (proven, well-tested) |
| 115 | +- **Document preparation** — `prepareDocumentsForEmbedding()` (pure transform) |
| 116 | +- **Git indexing** — as a separate command (`dev git index`) |
| 117 | +- **GitHub indexing** — as a separate command (`dev github index`) |
| 118 | +- **MCP adapter layer** — unchanged, consumes search results |
123 | 119 |
|
124 | | -3. **Git/GitHub: part of `dev index .` or separate?** Currently bundled. |
125 | | - Separating them makes `dev index .` faster and each concern independent. |
| 120 | +## Decisions |
126 | 121 |
|
127 | | -4. **Embedding completion: wait or don't?** Antfly's BM25 index is immediate. |
128 | | - Vector search has ~2s delay. Should we wait, or document the tradeoff? |
| 122 | +| Decision | Rationale | Alternatives | |
| 123 | +|----------|-----------|-------------| |
| 124 | +| Use `@parcel/watcher` | Native, `getEventsSince()` survives restarts, VS Code uses it | chokidar (no historical queries), watchman (requires daemon) | |
| 125 | +| Use Antfly Linear Merge | Server-side content hashing eliminates state file entirely | Keep state file + manual upsert (more code, same result) | |
| 126 | +| Watch from MCP server process | MCP server is the long-running process; watcher lives there | Separate daemon (more complexity), CLI-only (no auto-update) | |
| 127 | +| Decouple git/github from `dev index .` | Different update patterns, different data sources | Keep bundled (slower `dev index .`, coupled concerns) | |
| 128 | +| Debounce file changes (500ms) | Avoid re-indexing mid-save; batch rapid changes | Per-file immediate (too many API calls), longer debounce (stale data) | |
| 129 | +| Drop indexer-state.json | Antfly + watcher replace all its functions | Keep for backward compat (dead code) | |
129 | 130 |
|
130 | | -## Open questions |
| 131 | +## Parts |
131 | 132 |
|
132 | | -- What does the MCP server startup look like? Does it auto-index? |
133 | | -- How does Cursor's workspace detection interact with auto-indexing? |
134 | | -- Should `dev index .` be a command users run, or should it be invisible? |
135 | | -- What's the right granularity for file watching? (per-file? per-save? debounced?) |
| 133 | +| Part | Description | User stories | Risk | |
| 134 | +|------|-------------|-------------|------| |
| 135 | +| 2.1 | Replace batch insert with Antfly Linear Merge | US-3, US-5, US-6 | Low | |
| 136 | +| 2.2 | Add `@parcel/watcher` to MCP server | US-4, US-12 | Medium | |
| 137 | +| 2.3 | Debounce + incremental re-index on file change | US-4 | Medium | |
| 138 | +| 2.4 | `getEventsSince` on MCP server startup | US-5, US-12 | Low | |
| 139 | +| 2.5 | Decouple git/github from `dev index .` | US-10, US-11 | Low | |
| 140 | +| 2.6 | Drop indexer-state.json, simplify RepositoryIndexer | US-3, US-6 | Medium | |
| 141 | +| 2.7 | `dev status` rework — Antfly table stats + watcher status | US-13 | Low | |
| 142 | +| 2.8 | E2E tests: index real repo, search, verify results | US-3, US-8, US-9 | Low | |
| 143 | + |
| 144 | +## Risk register |
| 145 | + |
| 146 | +| Risk | Likelihood | Impact | Mitigation | |
| 147 | +|------|-----------|--------|------------| |
| 148 | +| `@parcel/watcher` native addon install issues | Medium | Medium | Fall back to chokidar; or bundle prebuilt binaries | |
| 149 | +| Antfly Linear Merge API doesn't exist yet in SDK | Medium | High | Verify in spike; use raw REST if SDK missing | |
| 150 | +| File watcher misses changes (edge cases) | Low | Medium | `dev index .` always available as manual fallback | |
| 151 | +| Large repos overwhelm watcher (10k+ files) | Low | Medium | Filter aggressively (ignore node_modules, dist, etc.) | |
| 152 | +| Debounce window too long/short | Low | Low | Make configurable; 500ms default is standard | |
| 153 | + |
| 154 | +## Verification checklist |
| 155 | + |
| 156 | +- [ ] `dev index .` works end-to-end with Linear Merge |
| 157 | +- [ ] File watcher detects changes and auto-re-indexes |
| 158 | +- [ ] MCP server restart catches up via `getEventsSince` |
| 159 | +- [ ] `dev_search "validateUser"` returns exact match (BM25) |
| 160 | +- [ ] `dev_search "authentication middleware"` returns semantic matches (vector) |
| 161 | +- [ ] `dev index . --force` clears and rebuilds |
| 162 | +- [ ] `dev git index` works independently |
| 163 | +- [ ] `dev github index` works independently |
| 164 | +- [ ] `dev status` shows fresh Antfly stats + watcher status |
| 165 | +- [ ] No `indexer-state.json` written or read |
| 166 | +- [ ] Works on this repo (dev-agent) end-to-end |
136 | 167 |
|
137 | 168 | ## Dependencies |
138 | 169 |
|
139 | 170 | - Phase 1 (Antfly migration) — merged |
140 | | -- Antfly server running |
141 | | -- Understanding of MCP server lifecycle (how/when it starts) |
| 171 | +- Antfly Linear Merge API — verify in spike (Part 2.1) |
| 172 | +- `@parcel/watcher` — npm install |
0 commit comments