@@ -16,7 +16,9 @@ eliminate most of our custom plumbing:
16162 . ** Antfly Linear Merge** — server-side content hashing, dedup, and deletion in
1717 one API call. Replaces our state file, hash tracking, and upsert logic.
1818
19- See [ user-stories.md] ( ./user-stories.md ) for the 16 user stories driving this redesign.
19+ See [ user-stories.md] ( ./user-stories.md ) for the user stories driving this redesign.
20+
21+ ---
2022
2123## Current flow (what we're replacing)
2224
@@ -38,6 +40,8 @@ Problems:
3840 - Git/GitHub coupled to code indexing
3941```
4042
43+ ---
44+
4145## Proposed flow
4246
4347### Architecture
@@ -67,117 +71,249 @@ Problems:
6771** First time (` dev index . ` ):**
6872```
69731. Scan all files → parse → extract code components
70- 2. Antfly Linear Merge: send all documents
71- → Antfly hashes content, stores new docs, skips unchanged
74+ 2. Antfly Linear Merge (delete_missing: true) : send all documents
75+ → Antfly hashes content, stores new docs, skips unchanged, removes stale
7276 → Returns: { upserted: 2525, skipped: 0, deleted: 0 }
73- 3. Save watcher snapshot (for getEventsSince on restart)
77+ 3. Save @parcel/ watcher snapshot to ~/.dev-agent/indexes/{hash}/watcher-snapshot
74784. Start watching for changes
7579```
7680
7781** Ongoing (automatic, no user command):**
7882```
79- 1. @parcel/watcher fires: files A, B, C changed
83+ 1. @parcel/watcher fires: files A, B, C changed; file D deleted
80842. Debounce (wait 500ms of quiet)
81853. Parse only changed files → extract components
82- 4. Antfly Linear Merge: send only changed documents
83- → Returns: { upserted: 3, skipped: 0, deleted: 1 }
84- 5 . MCP tools immediately have fresh data
86+ 4. For changed files: Antfly Linear Merge (delete_missing: false) — upsert only
87+ 5. For deleted files: explicitly delete doc IDs that belonged to those files
88+ 6 . MCP tools immediately have fresh data
8589```
8690
8791** MCP server restart:**
8892```
89- 1. @parcel/watcher.getEventsSince(lastSnapshot )
93+ 1. @parcel/watcher.getEventsSince(snapshotPath )
9094 → "files X, Y, Z changed while you were off"
91- 2. Parse only those files → extract → merge
92- 3. Resume watching
95+ 2. If snapshot missing: fall back to full index (same as first time)
96+ 3. If snapshot exists: parse only changed files → merge (delete_missing: false)
97+ 4. Save new snapshot, resume watching
9398```
9499
95100** Force re-index (` dev index . --force ` ):**
96101```
97- 1. Antfly: drop tables , recreate
102+ 1. Antfly: drop table , recreate
981032. Full scan + merge (same as first time)
99104```
100105
106+ ### Critical: ` delete_missing ` scoping
107+
108+ | Operation | ` delete_missing ` | Why |
109+ | -----------| -----------------| -----|
110+ | ` dev index . ` (full) | ` true ` | Clean slate — remove docs for deleted files |
111+ | ` dev index . --force ` | N/A — drops table | Complete rebuild |
112+ | Watcher incremental | ` false ` | Only upsert changed; delete removed files explicitly |
113+ | MCP restart catchup | ` false ` | Only process changes since snapshot |
114+
115+ ** Safety rule:** Incremental paths NEVER use ` delete_missing: true ` . Only full index does.
116+ Unit test enforces this.
117+
101118### What we drop
102119
103120| Old complexity | Replaced by |
104121| ---------------| -------------|
105122| ` indexer-state.json ` (file hashes, doc IDs) | ` @parcel/watcher ` snapshots + Antfly Linear Merge |
106123| Manual ` dev index . ` after every change | Automatic via file watcher |
107124| Batch size 32 + CONCURRENCY parallelism | Single Linear Merge call per change batch |
108- | Three separate VectorStorage instances | One AntflyClient, three table names |
125+ | Three separate VectorStorage instances | One AntflyClient, one table |
109126| ` TransformersEmbedder ` pipeline | Antfly auto-embeds via Termite |
110127| Hash comparison in RepositoryIndexer | Antfly server-side content hashing |
111128
112129### What we keep
113130
114131- ** Scanner pipeline** — ts-morph, tree-sitter, remark (proven, well-tested)
115132- ** Document preparation** — ` prepareDocumentsForEmbedding() ` (pure transform)
133+ - ** VectorStorage facade** — thin wrapper over AntflyVectorStore (Phase 1 established this)
116134- ** MCP adapter layer** — unchanged, consumes search results
135+ - ** ` LocalGitExtractor ` ** — used by ` dev_map ` for change frequency (shells out to git directly)
117136
118137### What we deprecate
119138
120139- ** Git history indexing** (` dev_history ` , ` dev git index ` ) — ` git log ` , ` git blame ` ,
121- and AI tools can run git commands directly. Semantic commit search is a nice-to-have
122- but not worth the indexing cost.
140+ and AI tools can run git commands directly.
123141- ** GitHub indexing** (` dev_gh ` , ` dev github index ` ) — GitHub's own MCP server handles
124- issues, PRs, and repo context natively. ` gh ` CLI is excellent. No reason to maintain
125- a separate index of the same data.
142+ this. Not everyone uses GitHub — teams use Linear, Jira, Notion, Shortcut.
126143- ** ` dev_plan ` ** context assembly — was valuable when it bundled issue + code + commits.
127- With git/github dropped, this becomes just a code search wrapper. Can revisit if needed.
144+ With git/github dropped, revisit if needed.
145+
146+ This reduces from 3 Antfly tables to 1, 9 MCP tools to 6, and removes 2 indexing phases.
147+
148+ ---
149+
150+ ## Plan B: If Linear Merge doesn't exist
151+
152+ If the spike (Part 2.1) reveals that Antfly does not have a Linear Merge API or it
153+ lacks content hashing:
128154
129- This reduces from 3 Antfly tables to 1 (code only), and removes 2 indexing phases.
155+ ** Fallback:** Client-side content hashing with existing ` batchOp ` .
156+
157+ ``` typescript
158+ // Lightweight hash file: ~/.dev-agent/indexes/{hash}/doc-hashes.json
159+ // Format: { "doc-id": "sha256-of-text" }
160+
161+ // On index:
162+ for (const doc of documents ) {
163+ const hash = sha256 (doc .text );
164+ if (existingHashes [doc .id ] === hash ) continue ; // Skip unchanged
165+ inserts [doc .id ] = { text: doc .text , metadata: ... };
166+ newHashes [doc .id ] = hash ;
167+ }
168+ await batchOp ({ inserts });
169+ ```
170+
171+ This is worse than server-side hashing (local state file, more code) but works
172+ with the existing API. The watcher flow stays the same — only the merge step changes.
173+
174+ ** Decision point:** The spike resolves this. If Linear Merge exists, use it. If not,
175+ use Plan B. The rest of the plan (watcher, debounce, git/gh removal) is unaffected.
176+
177+ ---
130178
131179## Decisions
132180
133181| Decision | Rationale | Alternatives |
134182| ----------| -----------| -------------|
135183| Use ` @parcel/watcher ` | Native, ` getEventsSince() ` survives restarts, VS Code uses it | chokidar (no historical queries), watchman (requires daemon) |
136- | Use Antfly Linear Merge | Server-side content hashing eliminates state file entirely | Keep state file + manual upsert ( more code, same result ) |
184+ | Use Antfly Linear Merge (or Plan B) | Server-side content hashing eliminates state file. Plan B if unavailable. | Keep full state file (Phase 1 approach, more code) |
137185| Watch from MCP server process | MCP server is the long-running process; watcher lives there | Separate daemon (more complexity), CLI-only (no auto-update) |
138- | Drop git/github indexing entirely | GitHub has its own MCP server; ` gh ` and ` git ` CLIs are excellent; AI tools call them directly. Not everyone uses GitHub — teams use Linear, Jira, Notion, Shortcut. By not coupling to GH, we stay tool-agnostic . Focus on code search — our unique value. | Keep as optional plugins (future, if demand) |
186+ | Drop git/github indexing | GitHub has its own MCP server; git CLI is excellent; not everyone uses GH . Focus on code search — our unique value. | Keep as optional plugins (future, if demand) |
139187| Debounce file changes (500ms) | Avoid re-indexing mid-save; batch rapid changes | Per-file immediate (too many API calls), longer debounce (stale data) |
140- | Drop indexer-state.json | Antfly + watcher replace all its functions | Keep for backward compat (dead code) |
188+ | Drop indexer-state.json | Antfly + watcher replace all its functions | Keep for Plan B (lightweight hash file only) |
189+ | Watcher snapshot at ` ~/.dev-agent/indexes/{hash}/watcher-snapshot ` | Colocated with project index data, survives process restarts | In repo dir (pollutes project), in memory (lost on restart) |
190+ | Concurrent MCP instances are safe | Antfly Linear Merge is idempotent (content-hashed). Two watchers writing same data = redundant but harmless. | File-based advisory lock (complexity for rare case) |
191+
192+ ---
141193
142194## Parts
143195
144196| Part | Description | User stories | Risk |
145197| ------| -------------| -------------| ------|
146198| 2.1 | Spike: verify Antfly Linear Merge API + ` @parcel/watcher ` | — | Low |
147- | 2.2 | Replace batch insert with Antfly Linear Merge | US-3, US-5, US-6 | Low |
199+ | 2.2 | Replace batch insert with Antfly Linear Merge (or Plan B) | US-3, US-5, US-6 | Low |
148200| 2.3 | Simplify RepositoryIndexer, drop state file | US-3, US-6 | Medium |
149201| 2.4 | Add ` @parcel/watcher ` + debounced auto-index to MCP server | US-4, US-12 | Medium |
150- | 2.5 | ` getEventsSince ` on MCP server startup | US-5, US-12 | Low |
151- | 2.6 | Deprecate git/github indexing, remove adapters | US-12 | Low |
202+ | 2.5 | ` getEventsSince ` on MCP server startup | US-4b, US-5, US-12 | Low |
203+ | 2.6a | Remove MCP adapters (history, github, plan) + CLI commands (git, github) | US-12 | Medium |
204+ | 2.6b | Remove core services, subagent github module, types, update exports | US-12 | Medium |
152205| 2.7 | ` dev status ` rework — Antfly table stats + watcher status | US-13 | Low |
153206| 2.8 | E2E tests: index this repo, search, verify results | US-3, US-8, US-9 | Low |
154207
208+ ---
209+
210+ ## Migration (Phase 1 → Phase 2 upgrade)
211+
212+ For users running Phase 1 (Antfly migration already merged):
213+
214+ - ** ` indexer-state.json ` exists** → log info "Migrating to new indexing system",
215+ delete the file. No user action needed.
216+ - ** Old git/github vector tables in Antfly** → left in place (harmless).
217+ ` dev clean ` removes them if user wants.
218+ - ** No watcher snapshot exists** → first run does a full index (same as fresh install).
219+ No ` --force ` required.
220+ - ** Removed CLI commands (` dev git ` , ` dev github ` )** → if user runs them, they get
221+ "Unknown command" error. Release notes document the deprecation.
222+
223+ ---
224+
155225## Risk register
156226
157227| Risk | Likelihood | Impact | Mitigation |
158228| ------| -----------| --------| ------------|
159- | ` @parcel/watcher ` native addon install issues | Medium | Medium | Fall back to chokidar; or bundle prebuilt binaries |
160- | Antfly Linear Merge API doesn't exist yet in SDK | Medium | High | Verify in spike; use raw REST if SDK missing |
229+ | Antfly Linear Merge API doesn't exist | Medium | High | Spike verifies; Plan B (client-side hashing) documented above |
230+ | ` @parcel/watcher ` native addon install issues | Medium | Medium | Fall back to chokidar; bundle prebuilt binaries |
231+ | Incremental merge accidentally deletes docs | Low | Critical | ` delete_missing ` scoping rules above; unit test enforces |
161232| File watcher misses changes (edge cases) | Low | Medium | ` dev index . ` always available as manual fallback |
162- | Large repos overwhelm watcher (10k+ files) | Low | Medium | Filter aggressively (ignore node_modules, dist, etc.) |
163- | Debounce window too long/short | Low | Low | Make configurable; 500ms default is standard |
233+ | Git branch switch creates hundreds of changes | Medium | Low | Debounce handles; watcher batches all changes in 500ms window |
234+ | Watcher snapshot corrupted or missing | Low | Low | Fall back to full index (same as first run) |
235+ | Two MCP instances on same repo | Medium | Low | Antfly merge is idempotent; redundant but safe |
236+ | Large repos overwhelm watcher (10k+ files) | Low | Medium | Filter aggressively (node_modules, dist, .git, etc.) |
237+ | ` dev_map ` breaks after LocalGitExtractor changes | Low | Medium | Keep LocalGitExtractor for now; shells out to git directly |
238+ | Git/github removal ripple effects (38 files) | Medium | Medium | Split into 2.6a/2.6b; ` pnpm typecheck ` after each deletion |
239+
240+ ---
241+
242+ ## Test strategy
243+
244+ ### Unit tests (P0)
245+
246+ | Test | What it verifies |
247+ | ------| -----------------|
248+ | ` debounce.test.ts ` | Debounce batches rapid changes; fires after 500ms quiet; cancels on new event |
249+ | ` watcher-filter.test.ts ` | Excludes node_modules, dist, .git, dotfiles; includes .ts, .js, .go, .md |
250+ | ` linear-merge-scoping.test.ts ` | Full index uses ` delete_missing: true ` ; incremental uses ` false ` ; NEVER true for incremental |
251+ | ` derive-table-name.test.ts ` | Edge cases: special chars, long names, unexpected path structures |
252+ | ` document-preparation.test.ts ` | Existing tests — verify unchanged after refactor |
253+
254+ ### Integration tests (P0)
255+
256+ | Test | What it verifies |
257+ | ------| -----------------|
258+ | ` linear-merge.integration.test.ts ` | Insert → update → verify dedup. Content hash skips unchanged. Delete missing removes stale. |
259+ | ` watcher-pipeline.integration.test.ts ` | Create file → watcher fires → scanner parses → merge upserts → searchable |
260+ | ` get-events-since.integration.test.ts ` | Write snapshot → change files offline → ` getEventsSince ` returns correct diff |
261+ | ` mcp-tools-regression.test.ts ` | All 6 remaining tools (search, refs, map, inspect, status, health) work after adapter removal |
262+
263+ ### Error handling tests (P1)
264+
265+ | Test | What it verifies |
266+ | ------| -----------------|
267+ | ` antfly-down.test.ts ` | Index/search fails gracefully with clear error; MCP tools return error not crash |
268+ | ` watcher-failure.test.ts ` | Watcher error → log warning, continue serving stale data |
269+ | ` snapshot-missing.test.ts ` | No snapshot → full re-index (same as first run), no crash |
270+ | ` snapshot-corrupted.test.ts ` | Invalid snapshot → fall back to full re-index |
271+
272+ ### E2E tests (P1)
273+
274+ | Test | What it verifies |
275+ | ------| -----------------|
276+ | ` e2e-index-dev-agent.test.ts ` | Index this repo → search for known code → verify results |
277+ | ` e2e-index-graphweave.test.ts ` | Index graphweave repo → search → verify (dogfooding) |
278+ | ` e2e-incremental.test.ts ` | Edit a file → watcher detects → re-indexes → new content searchable |
279+ | ` e2e-force-reindex.test.ts ` | ` dev index . --force ` → table dropped → full rebuild → search works |
280+
281+ ### Performance tests (P2)
282+
283+ | Test | Target | Measured on |
284+ | ------| --------| ------------|
285+ | Initial index | < 60s for 1k files, < 5 min for 10k files | dev-agent (~ 400 files), graphweave (~ 200 files) |
286+ | Incremental (watcher) | < 3s for 10 changed files | Edit 10 files, measure time to searchable |
287+ | MCP restart catchup | < 10s for 50 changed files | Simulate restart with ` getEventsSince ` |
288+ | Search latency | < 500ms per query | Hybrid search on 2k+ indexed documents |
289+
290+ ---
164291
165292## Verification checklist
166293
167- - [ ] ` dev index . ` works end-to-end with Linear Merge
294+ - [ ] ` dev index . ` works end-to-end ( Linear Merge or Plan B)
168295- [ ] File watcher detects changes and auto-re-indexes
169296- [ ] MCP server restart catches up via ` getEventsSince `
297+ - [ ] Snapshot missing → falls back to full index, no crash
170298- [ ] ` dev_search "validateUser" ` returns exact match (BM25)
171299- [ ] ` dev_search "authentication middleware" ` returns semantic matches (vector)
172300- [ ] ` dev index . --force ` clears and rebuilds
301+ - [ ] Incremental NEVER uses ` delete_missing: true `
173302- [ ] ` dev status ` shows fresh Antfly stats + watcher status
174303- [ ] No ` indexer-state.json ` written or read
304+ - [ ] Old ` indexer-state.json ` detected → deleted with info message
175305- [ ] Git/GitHub adapters removed (dev_history, dev_gh, dev_plan)
176306- [ ] MCP tools reduced from 9 to 6 (search, refs, map, inspect, status, health)
307+ - [ ] Two MCP instances on same repo don't conflict
177308- [ ] Works on this repo (dev-agent) end-to-end
309+ - [ ] Initial index < 60s on dev-agent repo
310+ - [ ] Incremental update < 3s for 10 files
311+
312+ ---
178313
179314## Dependencies
180315
181316- Phase 1 (Antfly migration) — merged
182- - Antfly Linear Merge API — verify in spike (Part 2.1)
183- - ` @parcel/watcher ` — npm install
317+ - Antfly Linear Merge API — verify in spike (Part 2.1); Plan B if absent
318+ - ` @parcel/watcher ` — npm install in mcp-server package
319+ - ` @parcel/watcher ` snapshot path added to ` getStorageFilePaths() `
0 commit comments