Skip to content

Commit c7cc88d

Browse files
prosdevclaude
andcommitted
docs(plans): rewrite Phase 2 with @parcel/watcher + Antfly Linear Merge
Research-driven redesign of indexing flow: - @parcel/watcher for file watching (getEventsSince survives restarts) - Antfly Linear Merge for server-side content hashing + dedup - Drops indexer-state.json entirely - Auto-index on file change from MCP server process - Decouples git/github from dev index . - 8 implementation parts, 16 user stories - Research doc with industry patterns (Zoekt, GitHub, Cursor) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 492e403 commit c7cc88d

2 files changed

Lines changed: 205 additions & 103 deletions

File tree

.claude/da-plans/core/phase-2-indexing-rethink/overview.md

Lines changed: 134 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -5,137 +5,168 @@
55
## Context
66

77
Phase 1 replaced the storage layer (LanceDB → Antfly) but kept the old indexing
8-
flow intact. That flow was designed around LanceDB constraints: local file storage,
9-
manual embedding pipeline, batch sizing tuned for ONNX model memory, state files
10-
for incremental updates.
8+
flow intact. That flow was overengineered for its original constraints: local file
9+
storage, manual embedding pipeline, state files tracking file hashes and document IDs.
1110

12-
With Antfly as the backend, many of these constraints no longer exist. Rather than
13-
patching the old flow, we should redesign it around what Antfly enables and what
14-
developers actually need.
11+
Research (see [research.md](./research.md)) found two production-grade tools that
12+
eliminate most of our custom plumbing:
1513

16-
See [user-stories.md](./user-stories.md) for the full set of user stories driving
17-
this redesign.
14+
1. **`@parcel/watcher`** — native file watcher with `getEventsSince()` that tracks
15+
changes even when our process isn't running (used by VS Code)
16+
2. **Antfly Linear Merge** — server-side content hashing, dedup, and deletion in
17+
one API call. Replaces our state file, hash tracking, and upsert logic.
1818

19-
## Current flow (what exists)
19+
See [user-stories.md](./user-stories.md) for the 16 user stories driving this redesign.
20+
21+
## Current flow (what we're replacing)
2022

2123
```
22-
dev setup → start Antfly (one-time)
23-
dev index . → scan all files → batch insert into Antfly → save state file
24-
├─ Phase 1: Scan → ts-morph/tree-sitter/remark → Document[]
25-
├─ Phase 2: Store → batch HTTP inserts (32 docs × CONCURRENCY parallel)
26-
├─ Phase 3: Git → extract commits → separate table
27-
├─ Phase 4: GitHub → fetch issues/PRs via gh CLI → separate table
28-
└─ Save state → indexer-state.json (file hashes for incremental)
29-
dev search "query" → hybrid search via Antfly
24+
dev index .
25+
├─ Scan ALL files (glob + parse)
26+
├─ Prepare EmbeddingDocument[] from scan results
27+
├─ Batch insert (32 docs × CONCURRENCY parallel HTTP calls)
28+
├─ Track state: file hashes, document IDs, timestamps → indexer-state.json
29+
├─ Git: extract commits → separate table
30+
├─ GitHub: fetch issues/PRs → separate table
31+
└─ Emit events, close
32+
33+
Problems:
34+
- Manual trigger required (US-4: changes should be automatic)
35+
- State file tracks what Antfly already knows (redundant)
36+
- Batch size 32 when Antfly handles 500 (15x too many HTTP calls)
37+
- No way to know what changed while MCP server was off
38+
- Git/GitHub coupled to code indexing
3039
```
3140

32-
### Problems with current flow
33-
34-
1. **Manual trigger required** — developer must remember to run `dev index .` after
35-
code changes. AI tools get stale context. (violates US-4)
36-
37-
2. **State file complexity** — tracks file hashes, document IDs per file, timestamps.
38-
But Antfly does upsert natively — inserting an existing key overwrites. Do we need
39-
the state file at all?
40-
41-
3. **Embedding delay invisible** — Antfly embeds asynchronously (~2s). `dev index .`
42-
completes before embeddings are ready. Immediate search may return nothing. (violates US-3)
43-
44-
4. **Three separate VectorStorage instances** — created because LanceDB needed separate
45-
directories. With Antfly, these are just three tables. But the code creates three
46-
separate VectorStorage objects with separate connections.
47-
48-
5. **Batch sizing is wrong** — indexer uses batch=32 (tuned for ONNX). Antfly can handle
49-
500 per request. We're making 15x more HTTP calls than needed.
50-
51-
6. **Git and GitHub coupled to index command**`dev index .` does code + git + GitHub
52-
in one big command. These are different data sources with different update patterns.
53-
5441
## Proposed flow
5542

56-
### The big idea: file watcher + on-demand indexing
43+
### Architecture
5744

5845
```
59-
dev setup → start Antfly + start file watcher (background)
60-
watcher detects file changes → re-indexes changed files automatically
61-
62-
dev index . → full scan (first time or explicit refresh)
63-
dev index . --force → clear + full scan
64-
65-
# These become separate, optional commands:
66-
dev git index → index git history (already exists)
67-
dev github index → index GitHub issues/PRs (already exists)
46+
┌─────────────────────────────────────────────────────────────┐
47+
│ MCP Server (always running) │
48+
│ │
49+
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
50+
│ │ @parcel/ │────▶│ Scanner │────▶│ Antfly │ │
51+
│ │ watcher │ │ (ts-morph, │ │ Linear │ │
52+
│ │ │ │ tree-sitter) │ │ Merge │ │
53+
│ │ getEventsSince│ └──────────────┘ └─────────────┘ │
54+
│ └──────────────┘ │
55+
│ │ │
56+
│ │ on file change │
57+
│ ▼ │
58+
│ ┌──────────────┐ │
59+
│ │ Debounce │ (batch changes, wait 500ms of quiet) │
60+
│ │ + Filter │ (ignore node_modules, dist, .git) │
61+
│ └──────────────┘ │
62+
└─────────────────────────────────────────────────────────────┘
6863
```
6964

70-
**US-4 solved:** The file watcher keeps the index fresh without manual intervention.
71-
Developer saves a file, the watcher re-indexes it within seconds.
72-
73-
### Alternative: no watcher, just fast incremental
74-
75-
If a file watcher is too complex for Phase 2, the simpler approach:
65+
### The flow
7666

67+
**First time (`dev index .`):**
7768
```
78-
dev index . → fast incremental (only changed files, <5s for small changes)
79-
runs automatically on MCP server startup
80-
runs automatically before search if stale (>5 min since last update)
69+
1. Scan all files → parse → extract code components
70+
2. Antfly Linear Merge: send all documents
71+
→ Antfly hashes content, stores new docs, skips unchanged
72+
→ Returns: { upserted: 2525, skipped: 0, deleted: 0 }
73+
3. Save watcher snapshot (for getEventsSince on restart)
74+
4. Start watching for changes
8175
```
8276

83-
### Simplifications enabled by Antfly
84-
85-
| Old complexity | New simplification |
86-
|---------------|-------------------|
87-
| State file (file hashes, doc IDs) | Antfly upsert by key — just re-insert, it overwrites |
88-
| Three VectorStorage instances | One AntflyClient, three table names |
89-
| Batch size 32 + CONCURRENCY | Single batch size 500, let Antfly handle parallelism |
90-
| Manual embedding step | Antfly auto-embeds on insert |
91-
| Wait for embedding completion | BM25 search works immediately; vector search ready in ~2s |
92-
93-
### State file: keep or drop?
94-
95-
**Keep a minimal version.** We still need to know:
96-
- Which files have been indexed (to detect deleted files → remove from Antfly)
97-
- Last index timestamp (to detect staleness)
77+
**Ongoing (automatic, no user command):**
78+
```
79+
1. @parcel/watcher fires: files A, B, C changed
80+
2. Debounce (wait 500ms of quiet)
81+
3. Parse only changed files → extract components
82+
4. Antfly Linear Merge: send only changed documents
83+
→ Returns: { upserted: 3, skipped: 0, deleted: 1 }
84+
5. MCP tools immediately have fresh data
85+
```
9886

99-
**Drop:**
100-
- File hashes (just re-insert everything that changed based on mtime)
101-
- Document IDs per file (Antfly handles dedup by key)
102-
- Embedding metadata (Antfly owns this)
87+
**MCP server restart:**
88+
```
89+
1. @parcel/watcher.getEventsSince(lastSnapshot)
90+
→ "files X, Y, Z changed while you were off"
91+
2. Parse only those files → extract → merge
92+
3. Resume watching
93+
```
10394

104-
## Parts
95+
**Force re-index (`dev index . --force`):**
96+
```
97+
1. Antfly: drop tables, recreate
98+
2. Full scan + merge (same as first time)
99+
```
105100

106-
| Part | Description | User stories |
107-
|------|-------------|-------------|
108-
| 2.1 | Simplify indexer: drop state complexity, use Antfly upsert | US-3, US-5 |
109-
| 2.2 | Increase batch size, single AntflyClient | US-6 |
110-
| 2.3 | Wait for embedding completion (or BM25 fallback) | US-3 |
111-
| 2.4 | Decouple git/github from `dev index .` | US-10, US-11 |
112-
| 2.5 | Auto-index on MCP server startup | US-4, US-12 |
113-
| 2.6 | File watcher for continuous indexing (stretch) | US-4 |
114-
| 2.7 | `dev status` rework — show Antfly table stats | US-13 |
101+
### What we drop
115102

116-
## Decisions to make
103+
| Old complexity | Replaced by |
104+
|---------------|-------------|
105+
| `indexer-state.json` (file hashes, doc IDs) | `@parcel/watcher` snapshots + Antfly Linear Merge |
106+
| Manual `dev index .` after every change | Automatic via file watcher |
107+
| Batch size 32 + CONCURRENCY parallelism | Single Linear Merge call per change batch |
108+
| Three separate VectorStorage instances | One AntflyClient, three table names |
109+
| `TransformersEmbedder` pipeline | Antfly auto-embeds via Termite |
110+
| Hash comparison in RepositoryIndexer | Antfly server-side content hashing |
117111

118-
1. **File watcher or fast incremental?** Watcher is better UX but more complexity.
119-
Fast incremental (<5s) on MCP startup might be enough.
112+
### What we keep
120113

121-
2. **State file: minimal or none?** We need *something* to detect deleted files.
122-
Could query Antfly for existing keys and diff, but that's O(n) on every run.
114+
- **Scanner pipeline** — ts-morph, tree-sitter, remark (proven, well-tested)
115+
- **Document preparation**`prepareDocumentsForEmbedding()` (pure transform)
116+
- **Git indexing** — as a separate command (`dev git index`)
117+
- **GitHub indexing** — as a separate command (`dev github index`)
118+
- **MCP adapter layer** — unchanged, consumes search results
123119

124-
3. **Git/GitHub: part of `dev index .` or separate?** Currently bundled.
125-
Separating them makes `dev index .` faster and each concern independent.
120+
## Decisions
126121

127-
4. **Embedding completion: wait or don't?** Antfly's BM25 index is immediate.
128-
Vector search has ~2s delay. Should we wait, or document the tradeoff?
122+
| Decision | Rationale | Alternatives |
123+
|----------|-----------|-------------|
124+
| Use `@parcel/watcher` | Native, `getEventsSince()` survives restarts, VS Code uses it | chokidar (no historical queries), watchman (requires daemon) |
125+
| Use Antfly Linear Merge | Server-side content hashing eliminates state file entirely | Keep state file + manual upsert (more code, same result) |
126+
| Watch from MCP server process | MCP server is the long-running process; watcher lives there | Separate daemon (more complexity), CLI-only (no auto-update) |
127+
| Decouple git/github from `dev index .` | Different update patterns, different data sources | Keep bundled (slower `dev index .`, coupled concerns) |
128+
| Debounce file changes (500ms) | Avoid re-indexing mid-save; batch rapid changes | Per-file immediate (too many API calls), longer debounce (stale data) |
129+
| Drop indexer-state.json | Antfly + watcher replace all its functions | Keep for backward compat (dead code) |
129130

130-
## Open questions
131+
## Parts
131132

132-
- What does the MCP server startup look like? Does it auto-index?
133-
- How does Cursor's workspace detection interact with auto-indexing?
134-
- Should `dev index .` be a command users run, or should it be invisible?
135-
- What's the right granularity for file watching? (per-file? per-save? debounced?)
133+
| Part | Description | User stories | Risk |
134+
|------|-------------|-------------|------|
135+
| 2.1 | Replace batch insert with Antfly Linear Merge | US-3, US-5, US-6 | Low |
136+
| 2.2 | Add `@parcel/watcher` to MCP server | US-4, US-12 | Medium |
137+
| 2.3 | Debounce + incremental re-index on file change | US-4 | Medium |
138+
| 2.4 | `getEventsSince` on MCP server startup | US-5, US-12 | Low |
139+
| 2.5 | Decouple git/github from `dev index .` | US-10, US-11 | Low |
140+
| 2.6 | Drop indexer-state.json, simplify RepositoryIndexer | US-3, US-6 | Medium |
141+
| 2.7 | `dev status` rework — Antfly table stats + watcher status | US-13 | Low |
142+
| 2.8 | E2E tests: index real repo, search, verify results | US-3, US-8, US-9 | Low |
143+
144+
## Risk register
145+
146+
| Risk | Likelihood | Impact | Mitigation |
147+
|------|-----------|--------|------------|
148+
| `@parcel/watcher` native addon install issues | Medium | Medium | Fall back to chokidar; or bundle prebuilt binaries |
149+
| Antfly Linear Merge API doesn't exist yet in SDK | Medium | High | Verify in spike; use raw REST if SDK missing |
150+
| File watcher misses changes (edge cases) | Low | Medium | `dev index .` always available as manual fallback |
151+
| Large repos overwhelm watcher (10k+ files) | Low | Medium | Filter aggressively (ignore node_modules, dist, etc.) |
152+
| Debounce window too long/short | Low | Low | Make configurable; 500ms default is standard |
153+
154+
## Verification checklist
155+
156+
- [ ] `dev index .` works end-to-end with Linear Merge
157+
- [ ] File watcher detects changes and auto-re-indexes
158+
- [ ] MCP server restart catches up via `getEventsSince`
159+
- [ ] `dev_search "validateUser"` returns exact match (BM25)
160+
- [ ] `dev_search "authentication middleware"` returns semantic matches (vector)
161+
- [ ] `dev index . --force` clears and rebuilds
162+
- [ ] `dev git index` works independently
163+
- [ ] `dev github index` works independently
164+
- [ ] `dev status` shows fresh Antfly stats + watcher status
165+
- [ ] No `indexer-state.json` written or read
166+
- [ ] Works on this repo (dev-agent) end-to-end
136167

137168
## Dependencies
138169

139170
- Phase 1 (Antfly migration) — merged
140-
- Antfly server running
141-
- Understanding of MCP server lifecycle (how/when it starts)
171+
- Antfly Linear Merge API — verify in spike (Part 2.1)
172+
- `@parcel/watcher` — npm install
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Phase 2 Research: Indexing Libraries & Patterns
2+
3+
## File Watching
4+
5+
| Library | Downloads/wk | Historical queries | Native | Used by |
6+
|---------|-------------|-------------------|--------|---------|
7+
| `@parcel/watcher` | 12.6M | **Yes** (`getEventsSince()`) | C++ | VS Code, Tailwind, Nx, Nuxt |
8+
| `chokidar` | 115M | No | JS | Webpack, Vite, Brunch |
9+
| `fb-watchman` | 12M | Yes (clock-based) | Daemon | Jest, React Native |
10+
| `nsfw` | 200K | No | C++ | GitKraken |
11+
| `node:fs.watch` | built-in | No | N/A ||
12+
13+
**Winner: `@parcel/watcher`** — the `getEventsSince()` API solves the "MCP server restarts,
14+
what changed?" problem without a persistent daemon. Native C++ performance. VS Code uses it.
15+
16+
## Indexing patterns from industry
17+
18+
| Tool | Change detection | Incremental strategy |
19+
|------|-----------------|---------------------|
20+
| Zoekt (Sourcegraph) | Delta indexing vs stored state | Only processes changed files, merges shards |
21+
| GitHub Code Search | Content hash (blob SHA) | Unchanged blobs never re-indexed |
22+
| Cursor | Merkle trees, checks every 10 min | Hash mismatches → re-embed only changed files |
23+
| Livegrep | None | Full re-index every time (anti-pattern) |
24+
25+
**Key pattern:** Content hashing for change detection. All major tools use it.
26+
27+
## Antfly Linear Merge API
28+
29+
Antfly has a built-in sync API designed for exactly this use case:
30+
31+
```bash
32+
POST /api/v1/tables/{table}/merge
33+
{
34+
"documents": {
35+
"doc-1": { "text": "...", "metadata": "..." },
36+
"doc-2": { "text": "...", "metadata": "..." }
37+
},
38+
"delete_missing": true // Remove docs not in this batch
39+
}
40+
```
41+
42+
**What it does:**
43+
- Content hashing server-side — unchanged documents are skipped (no re-embedding)
44+
- New/changed documents are upserted
45+
- With `delete_missing: true`, documents not in the payload are removed
46+
- Returns: `{ upserted: N, skipped: N, deleted: N }`
47+
48+
**This replaces:** state file, hash tracking, manual upsert logic, delete-then-insert
49+
for removed files. All handled by Antfly in one API call.
50+
51+
## MCP server patterns
52+
53+
Most MCP servers are stateless (read on demand). Notable exceptions:
54+
- `mcp-file-context-server` — file watching + LRU cache + auto-invalidation
55+
- `context-mode` — event log + FTS5/BM25 in SQLite
56+
57+
No established pattern for live-indexed MCP servers. dev-agent would be first.
58+
59+
MCP spec supports `roots/list_changed` notification for workspace changes.
60+
61+
## What to build vs reuse
62+
63+
| Component | Action | Tool |
64+
|-----------|--------|------|
65+
| File watching | Reuse | `@parcel/watcher` |
66+
| Change detection | Reuse | Antfly Linear Merge (server-side content hashing) |
67+
| State file / hash tracking | **Drop entirely** | Antfly handles dedup + deletion |
68+
| Tree-sitter parsing | Already have | `web-tree-sitter` |
69+
| Embedding | Already have | Antfly Termite |
70+
| Batch insert | Simplify | Antfly Linear Merge (one call replaces batch loop) |
71+
| Orchestration glue | Build | Watcher → parser → merge (the only new code) |

0 commit comments

Comments
 (0)