Skip to content

Commit 82add8d

Browse files
prosdevclaude
andcommitted
docs(plans): address all plan-reviewer findings for Phase 2
Blockers resolved: - Added Plan B fallback if Linear Merge doesn't exist (client-side hashing) - Added full test strategy: unit (P0), integration (P0), error handling (P1), E2E (P1), performance (P2) — 17 named tests with targets Warnings resolved: - Clarified delete_missing scoping (full=true, incremental=false, enforced by test) - Upgraded Part 2.6 to Medium risk, split into 2.6a/2.6b (38 files across 5 pkgs) - Addressed US-17 (concurrent instances safe — Antfly merge is idempotent) - Added migration section (Phase 1 → Phase 2 upgrade path) - Specified watcher snapshot location (~/.dev-agent/indexes/{hash}/watcher-snapshot) - Added performance acceptance criteria (initial <60s, incremental <3s, search <500ms) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent cc81305 commit 82add8d

1 file changed

Lines changed: 168 additions & 32 deletions

File tree

  • .claude/da-plans/core/phase-2-indexing-rethink

.claude/da-plans/core/phase-2-indexing-rethink/overview.md

Lines changed: 168 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,9 @@ eliminate most of our custom plumbing:
1616
2. **Antfly Linear Merge** — server-side content hashing, dedup, and deletion in
1717
one API call. Replaces our state file, hash tracking, and upsert logic.
1818

19-
See [user-stories.md](./user-stories.md) for the 16 user stories driving this redesign.
19+
See [user-stories.md](./user-stories.md) for the user stories driving this redesign.
20+
21+
---
2022

2123
## Current flow (what we're replacing)
2224

@@ -38,6 +40,8 @@ Problems:
3840
- Git/GitHub coupled to code indexing
3941
```
4042

43+
---
44+
4145
## Proposed flow
4246

4347
### Architecture
@@ -67,117 +71,249 @@ Problems:
6771
**First time (`dev index .`):**
6872
```
6973
1. Scan all files → parse → extract code components
70-
2. Antfly Linear Merge: send all documents
71-
→ Antfly hashes content, stores new docs, skips unchanged
74+
2. Antfly Linear Merge (delete_missing: true): send all documents
75+
→ Antfly hashes content, stores new docs, skips unchanged, removes stale
7276
→ Returns: { upserted: 2525, skipped: 0, deleted: 0 }
73-
3. Save watcher snapshot (for getEventsSince on restart)
77+
3. Save @parcel/watcher snapshot to ~/.dev-agent/indexes/{hash}/watcher-snapshot
7478
4. Start watching for changes
7579
```
7680

7781
**Ongoing (automatic, no user command):**
7882
```
79-
1. @parcel/watcher fires: files A, B, C changed
83+
1. @parcel/watcher fires: files A, B, C changed; file D deleted
8084
2. Debounce (wait 500ms of quiet)
8185
3. Parse only changed files → extract components
82-
4. Antfly Linear Merge: send only changed documents
83-
→ Returns: { upserted: 3, skipped: 0, deleted: 1 }
84-
5. MCP tools immediately have fresh data
86+
4. For changed files: Antfly Linear Merge (delete_missing: false) — upsert only
87+
5. For deleted files: explicitly delete doc IDs that belonged to those files
88+
6. MCP tools immediately have fresh data
8589
```
8690

8791
**MCP server restart:**
8892
```
89-
1. @parcel/watcher.getEventsSince(lastSnapshot)
93+
1. @parcel/watcher.getEventsSince(snapshotPath)
9094
→ "files X, Y, Z changed while you were off"
91-
2. Parse only those files → extract → merge
92-
3. Resume watching
95+
2. If snapshot missing: fall back to full index (same as first time)
96+
3. If snapshot exists: parse only changed files → merge (delete_missing: false)
97+
4. Save new snapshot, resume watching
9398
```
9499

95100
**Force re-index (`dev index . --force`):**
96101
```
97-
1. Antfly: drop tables, recreate
102+
1. Antfly: drop table, recreate
98103
2. Full scan + merge (same as first time)
99104
```
100105

106+
### Critical: `delete_missing` scoping
107+
108+
| Operation | `delete_missing` | Why |
109+
|-----------|-----------------|-----|
110+
| `dev index .` (full) | `true` | Clean slate — remove docs for deleted files |
111+
| `dev index . --force` | N/A — drops table | Complete rebuild |
112+
| Watcher incremental | `false` | Only upsert changed; delete removed files explicitly |
113+
| MCP restart catchup | `false` | Only process changes since snapshot |
114+
115+
**Safety rule:** Incremental paths NEVER use `delete_missing: true`. Only full index does.
116+
Unit test enforces this.
117+
101118
### What we drop
102119

103120
| Old complexity | Replaced by |
104121
|---------------|-------------|
105122
| `indexer-state.json` (file hashes, doc IDs) | `@parcel/watcher` snapshots + Antfly Linear Merge |
106123
| Manual `dev index .` after every change | Automatic via file watcher |
107124
| Batch size 32 + CONCURRENCY parallelism | Single Linear Merge call per change batch |
108-
| Three separate VectorStorage instances | One AntflyClient, three table names |
125+
| Three separate VectorStorage instances | One AntflyClient, one table |
109126
| `TransformersEmbedder` pipeline | Antfly auto-embeds via Termite |
110127
| Hash comparison in RepositoryIndexer | Antfly server-side content hashing |
111128

112129
### What we keep
113130

114131
- **Scanner pipeline** — ts-morph, tree-sitter, remark (proven, well-tested)
115132
- **Document preparation**`prepareDocumentsForEmbedding()` (pure transform)
133+
- **VectorStorage facade** — thin wrapper over AntflyVectorStore (Phase 1 established this)
116134
- **MCP adapter layer** — unchanged, consumes search results
135+
- **`LocalGitExtractor`** — used by `dev_map` for change frequency (shells out to git directly)
117136

118137
### What we deprecate
119138

120139
- **Git history indexing** (`dev_history`, `dev git index`) — `git log`, `git blame`,
121-
and AI tools can run git commands directly. Semantic commit search is a nice-to-have
122-
but not worth the indexing cost.
140+
and AI tools can run git commands directly.
123141
- **GitHub indexing** (`dev_gh`, `dev github index`) — GitHub's own MCP server handles
124-
issues, PRs, and repo context natively. `gh` CLI is excellent. No reason to maintain
125-
a separate index of the same data.
142+
this. Not everyone uses GitHub — teams use Linear, Jira, Notion, Shortcut.
126143
- **`dev_plan`** context assembly — was valuable when it bundled issue + code + commits.
127-
With git/github dropped, this becomes just a code search wrapper. Can revisit if needed.
144+
With git/github dropped, revisit if needed.
145+
146+
This reduces from 3 Antfly tables to 1, 9 MCP tools to 6, and removes 2 indexing phases.
147+
148+
---
149+
150+
## Plan B: If Linear Merge doesn't exist
151+
152+
If the spike (Part 2.1) reveals that Antfly does not have a Linear Merge API or it
153+
lacks content hashing:
128154

129-
This reduces from 3 Antfly tables to 1 (code only), and removes 2 indexing phases.
155+
**Fallback:** Client-side content hashing with existing `batchOp`.
156+
157+
```typescript
158+
// Lightweight hash file: ~/.dev-agent/indexes/{hash}/doc-hashes.json
159+
// Format: { "doc-id": "sha256-of-text" }
160+
161+
// On index:
162+
for (const doc of documents) {
163+
const hash = sha256(doc.text);
164+
if (existingHashes[doc.id] === hash) continue; // Skip unchanged
165+
inserts[doc.id] = { text: doc.text, metadata: ... };
166+
newHashes[doc.id] = hash;
167+
}
168+
await batchOp({ inserts });
169+
```
170+
171+
This is worse than server-side hashing (local state file, more code) but works
172+
with the existing API. The watcher flow stays the same — only the merge step changes.
173+
174+
**Decision point:** The spike resolves this. If Linear Merge exists, use it. If not,
175+
use Plan B. The rest of the plan (watcher, debounce, git/gh removal) is unaffected.
176+
177+
---
130178

131179
## Decisions
132180

133181
| Decision | Rationale | Alternatives |
134182
|----------|-----------|-------------|
135183
| Use `@parcel/watcher` | Native, `getEventsSince()` survives restarts, VS Code uses it | chokidar (no historical queries), watchman (requires daemon) |
136-
| Use Antfly Linear Merge | Server-side content hashing eliminates state file entirely | Keep state file + manual upsert (more code, same result) |
184+
| Use Antfly Linear Merge (or Plan B) | Server-side content hashing eliminates state file. Plan B if unavailable. | Keep full state file (Phase 1 approach, more code) |
137185
| Watch from MCP server process | MCP server is the long-running process; watcher lives there | Separate daemon (more complexity), CLI-only (no auto-update) |
138-
| Drop git/github indexing entirely | GitHub has its own MCP server; `gh` and `git` CLIs are excellent; AI tools call them directly. Not everyone uses GitHub — teams use Linear, Jira, Notion, Shortcut. By not coupling to GH, we stay tool-agnostic. Focus on code search — our unique value. | Keep as optional plugins (future, if demand) |
186+
| Drop git/github indexing | GitHub has its own MCP server; git CLI is excellent; not everyone uses GH. Focus on code search — our unique value. | Keep as optional plugins (future, if demand) |
139187
| Debounce file changes (500ms) | Avoid re-indexing mid-save; batch rapid changes | Per-file immediate (too many API calls), longer debounce (stale data) |
140-
| Drop indexer-state.json | Antfly + watcher replace all its functions | Keep for backward compat (dead code) |
188+
| Drop indexer-state.json | Antfly + watcher replace all its functions | Keep for Plan B (lightweight hash file only) |
189+
| Watcher snapshot at `~/.dev-agent/indexes/{hash}/watcher-snapshot` | Colocated with project index data, survives process restarts | In repo dir (pollutes project), in memory (lost on restart) |
190+
| Concurrent MCP instances are safe | Antfly Linear Merge is idempotent (content-hashed). Two watchers writing same data = redundant but harmless. | File-based advisory lock (complexity for rare case) |
191+
192+
---
141193

142194
## Parts
143195

144196
| Part | Description | User stories | Risk |
145197
|------|-------------|-------------|------|
146198
| 2.1 | Spike: verify Antfly Linear Merge API + `@parcel/watcher` || Low |
147-
| 2.2 | Replace batch insert with Antfly Linear Merge | US-3, US-5, US-6 | Low |
199+
| 2.2 | Replace batch insert with Antfly Linear Merge (or Plan B) | US-3, US-5, US-6 | Low |
148200
| 2.3 | Simplify RepositoryIndexer, drop state file | US-3, US-6 | Medium |
149201
| 2.4 | Add `@parcel/watcher` + debounced auto-index to MCP server | US-4, US-12 | Medium |
150-
| 2.5 | `getEventsSince` on MCP server startup | US-5, US-12 | Low |
151-
| 2.6 | Deprecate git/github indexing, remove adapters | US-12 | Low |
202+
| 2.5 | `getEventsSince` on MCP server startup | US-4b, US-5, US-12 | Low |
203+
| 2.6a | Remove MCP adapters (history, github, plan) + CLI commands (git, github) | US-12 | Medium |
204+
| 2.6b | Remove core services, subagent github module, types, update exports | US-12 | Medium |
152205
| 2.7 | `dev status` rework — Antfly table stats + watcher status | US-13 | Low |
153206
| 2.8 | E2E tests: index this repo, search, verify results | US-3, US-8, US-9 | Low |
154207

208+
---
209+
210+
## Migration (Phase 1 → Phase 2 upgrade)
211+
212+
For users running Phase 1 (Antfly migration already merged):
213+
214+
- **`indexer-state.json` exists** → log info "Migrating to new indexing system",
215+
delete the file. No user action needed.
216+
- **Old git/github vector tables in Antfly** → left in place (harmless).
217+
`dev clean` removes them if user wants.
218+
- **No watcher snapshot exists** → first run does a full index (same as fresh install).
219+
No `--force` required.
220+
- **Removed CLI commands (`dev git`, `dev github`)** → if user runs them, they get
221+
"Unknown command" error. Release notes document the deprecation.
222+
223+
---
224+
155225
## Risk register
156226

157227
| Risk | Likelihood | Impact | Mitigation |
158228
|------|-----------|--------|------------|
159-
| `@parcel/watcher` native addon install issues | Medium | Medium | Fall back to chokidar; or bundle prebuilt binaries |
160-
| Antfly Linear Merge API doesn't exist yet in SDK | Medium | High | Verify in spike; use raw REST if SDK missing |
229+
| Antfly Linear Merge API doesn't exist | Medium | High | Spike verifies; Plan B (client-side hashing) documented above |
230+
| `@parcel/watcher` native addon install issues | Medium | Medium | Fall back to chokidar; bundle prebuilt binaries |
231+
| Incremental merge accidentally deletes docs | Low | Critical | `delete_missing` scoping rules above; unit test enforces |
161232
| File watcher misses changes (edge cases) | Low | Medium | `dev index .` always available as manual fallback |
162-
| Large repos overwhelm watcher (10k+ files) | Low | Medium | Filter aggressively (ignore node_modules, dist, etc.) |
163-
| Debounce window too long/short | Low | Low | Make configurable; 500ms default is standard |
233+
| Git branch switch creates hundreds of changes | Medium | Low | Debounce handles; watcher batches all changes in 500ms window |
234+
| Watcher snapshot corrupted or missing | Low | Low | Fall back to full index (same as first run) |
235+
| Two MCP instances on same repo | Medium | Low | Antfly merge is idempotent; redundant but safe |
236+
| Large repos overwhelm watcher (10k+ files) | Low | Medium | Filter aggressively (node_modules, dist, .git, etc.) |
237+
| `dev_map` breaks after LocalGitExtractor changes | Low | Medium | Keep LocalGitExtractor for now; shells out to git directly |
238+
| Git/github removal ripple effects (38 files) | Medium | Medium | Split into 2.6a/2.6b; `pnpm typecheck` after each deletion |
239+
240+
---
241+
242+
## Test strategy
243+
244+
### Unit tests (P0)
245+
246+
| Test | What it verifies |
247+
|------|-----------------|
248+
| `debounce.test.ts` | Debounce batches rapid changes; fires after 500ms quiet; cancels on new event |
249+
| `watcher-filter.test.ts` | Excludes node_modules, dist, .git, dotfiles; includes .ts, .js, .go, .md |
250+
| `linear-merge-scoping.test.ts` | Full index uses `delete_missing: true`; incremental uses `false`; NEVER true for incremental |
251+
| `derive-table-name.test.ts` | Edge cases: special chars, long names, unexpected path structures |
252+
| `document-preparation.test.ts` | Existing tests — verify unchanged after refactor |
253+
254+
### Integration tests (P0)
255+
256+
| Test | What it verifies |
257+
|------|-----------------|
258+
| `linear-merge.integration.test.ts` | Insert → update → verify dedup. Content hash skips unchanged. Delete missing removes stale. |
259+
| `watcher-pipeline.integration.test.ts` | Create file → watcher fires → scanner parses → merge upserts → searchable |
260+
| `get-events-since.integration.test.ts` | Write snapshot → change files offline → `getEventsSince` returns correct diff |
261+
| `mcp-tools-regression.test.ts` | All 6 remaining tools (search, refs, map, inspect, status, health) work after adapter removal |
262+
263+
### Error handling tests (P1)
264+
265+
| Test | What it verifies |
266+
|------|-----------------|
267+
| `antfly-down.test.ts` | Index/search fails gracefully with clear error; MCP tools return error not crash |
268+
| `watcher-failure.test.ts` | Watcher error → log warning, continue serving stale data |
269+
| `snapshot-missing.test.ts` | No snapshot → full re-index (same as first run), no crash |
270+
| `snapshot-corrupted.test.ts` | Invalid snapshot → fall back to full re-index |
271+
272+
### E2E tests (P1)
273+
274+
| Test | What it verifies |
275+
|------|-----------------|
276+
| `e2e-index-dev-agent.test.ts` | Index this repo → search for known code → verify results |
277+
| `e2e-index-graphweave.test.ts` | Index graphweave repo → search → verify (dogfooding) |
278+
| `e2e-incremental.test.ts` | Edit a file → watcher detects → re-indexes → new content searchable |
279+
| `e2e-force-reindex.test.ts` | `dev index . --force` → table dropped → full rebuild → search works |
280+
281+
### Performance tests (P2)
282+
283+
| Test | Target | Measured on |
284+
|------|--------|------------|
285+
| Initial index | < 60s for 1k files, < 5 min for 10k files | dev-agent (~400 files), graphweave (~200 files) |
286+
| Incremental (watcher) | < 3s for 10 changed files | Edit 10 files, measure time to searchable |
287+
| MCP restart catchup | < 10s for 50 changed files | Simulate restart with `getEventsSince` |
288+
| Search latency | < 500ms per query | Hybrid search on 2k+ indexed documents |
289+
290+
---
164291

165292
## Verification checklist
166293

167-
- [ ] `dev index .` works end-to-end with Linear Merge
294+
- [ ] `dev index .` works end-to-end (Linear Merge or Plan B)
168295
- [ ] File watcher detects changes and auto-re-indexes
169296
- [ ] MCP server restart catches up via `getEventsSince`
297+
- [ ] Snapshot missing → falls back to full index, no crash
170298
- [ ] `dev_search "validateUser"` returns exact match (BM25)
171299
- [ ] `dev_search "authentication middleware"` returns semantic matches (vector)
172300
- [ ] `dev index . --force` clears and rebuilds
301+
- [ ] Incremental NEVER uses `delete_missing: true`
173302
- [ ] `dev status` shows fresh Antfly stats + watcher status
174303
- [ ] No `indexer-state.json` written or read
304+
- [ ] Old `indexer-state.json` detected → deleted with info message
175305
- [ ] Git/GitHub adapters removed (dev_history, dev_gh, dev_plan)
176306
- [ ] MCP tools reduced from 9 to 6 (search, refs, map, inspect, status, health)
307+
- [ ] Two MCP instances on same repo don't conflict
177308
- [ ] Works on this repo (dev-agent) end-to-end
309+
- [ ] Initial index < 60s on dev-agent repo
310+
- [ ] Incremental update < 3s for 10 files
311+
312+
---
178313

179314
## Dependencies
180315

181316
- Phase 1 (Antfly migration) — merged
182-
- Antfly Linear Merge API — verify in spike (Part 2.1)
183-
- `@parcel/watcher` — npm install
317+
- Antfly Linear Merge API — verify in spike (Part 2.1); Plan B if absent
318+
- `@parcel/watcher` — npm install in mcp-server package
319+
- `@parcel/watcher` snapshot path added to `getStorageFilePaths()`

0 commit comments

Comments
 (0)