Skip to content

Commit 492e403

Browse files
prosdevclaude
andcommitted
docs(plans): rethink indexing flow for Antfly backend
Replace investigation plan with user-story-driven redesign: - 16 user stories covering setup, indexing, search, lifecycle, multi-project - Identifies 6 problems with current flow (manual trigger, state complexity, embedding delay, batch sizing, coupled git/github) - Proposes file watcher or fast incremental approaches - Simplifications: Antfly upsert removes state file complexity, single client for three tables, BM25 immediate search Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7da0992 commit 492e403

4 files changed

Lines changed: 246 additions & 195 deletions

File tree

.claude/da-plans/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Implementation deviations are logged at the bottom of each plan file.
99

1010
| Track | Description | Status |
1111
|-------|-------------|--------|
12-
| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Merged, Phase 2: Draft |
12+
| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Merged, Phase 2: Draft (indexing rethink) |
1313
| [CLI](cli/) | Command-line interface | Not started |
1414
| [MCP Server](mcp-server/) | Model Context Protocol server + adapters | Phase 1: Draft (blocked on core/phase-1) |
1515
| [Subagents](subagents/) | Coordinator, explorer, planner, GitHub agents | Not started |

.claude/da-plans/core/phase-2-dev-index-investigation/overview.md

Lines changed: 0 additions & 194 deletions
This file was deleted.
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Phase 2: Rethink Indexing & Search Flow
2+
3+
**Status:** Draft
4+
5+
## Context
6+
7+
Phase 1 replaced the storage layer (LanceDB → Antfly) but kept the old indexing
8+
flow intact. That flow was designed around LanceDB constraints: local file storage,
9+
manual embedding pipeline, batch sizing tuned for ONNX model memory, state files
10+
for incremental updates.
11+
12+
With Antfly as the backend, many of these constraints no longer exist. Rather than
13+
patching the old flow, we should redesign it around what Antfly enables and what
14+
developers actually need.
15+
16+
See [user-stories.md](./user-stories.md) for the full set of user stories driving
17+
this redesign.
18+
19+
## Current flow (what exists)
20+
21+
```
22+
dev setup → start Antfly (one-time)
23+
dev index . → scan all files → batch insert into Antfly → save state file
24+
├─ Phase 1: Scan → ts-morph/tree-sitter/remark → Document[]
25+
├─ Phase 2: Store → batch HTTP inserts (32 docs × CONCURRENCY parallel)
26+
├─ Phase 3: Git → extract commits → separate table
27+
├─ Phase 4: GitHub → fetch issues/PRs via gh CLI → separate table
28+
└─ Save state → indexer-state.json (file hashes for incremental)
29+
dev search "query" → hybrid search via Antfly
30+
```
31+
32+
### Problems with current flow
33+
34+
1. **Manual trigger required** — developer must remember to run `dev index .` after
35+
code changes. AI tools get stale context. (violates US-4)
36+
37+
2. **State file complexity** — tracks file hashes, document IDs per file, timestamps.
38+
But Antfly does upsert natively — inserting an existing key overwrites. Do we need
39+
the state file at all?
40+
41+
3. **Embedding delay invisible** — Antfly embeds asynchronously (~2s). `dev index .`
42+
completes before embeddings are ready. Immediate search may return nothing. (violates US-3)
43+
44+
4. **Three separate VectorStorage instances** — created because LanceDB needed separate
45+
directories. With Antfly, these are just three tables. But the code creates three
46+
separate VectorStorage objects with separate connections.
47+
48+
5. **Batch sizing is wrong** — indexer uses batch=32 (tuned for ONNX). Antfly can handle
49+
500 per request. We're making 15x more HTTP calls than needed.
50+
51+
6. **Git and GitHub coupled to index command**`dev index .` does code + git + GitHub
52+
in one big command. These are different data sources with different update patterns.
53+
54+
## Proposed flow
55+
56+
### The big idea: file watcher + on-demand indexing
57+
58+
```
59+
dev setup → start Antfly + start file watcher (background)
60+
watcher detects file changes → re-indexes changed files automatically
61+
62+
dev index . → full scan (first time or explicit refresh)
63+
dev index . --force → clear + full scan
64+
65+
# These become separate, optional commands:
66+
dev git index → index git history (already exists)
67+
dev github index → index GitHub issues/PRs (already exists)
68+
```
69+
70+
**US-4 solved:** The file watcher keeps the index fresh without manual intervention.
71+
Developer saves a file, the watcher re-indexes it within seconds.
72+
73+
### Alternative: no watcher, just fast incremental
74+
75+
If a file watcher is too complex for Phase 2, the simpler approach:
76+
77+
```
78+
dev index . → fast incremental (only changed files, <5s for small changes)
79+
runs automatically on MCP server startup
80+
runs automatically before search if stale (>5 min since last update)
81+
```
82+
83+
### Simplifications enabled by Antfly
84+
85+
| Old complexity | New simplification |
86+
|---------------|-------------------|
87+
| State file (file hashes, doc IDs) | Antfly upsert by key — just re-insert, it overwrites |
88+
| Three VectorStorage instances | One AntflyClient, three table names |
89+
| Batch size 32 + CONCURRENCY | Single batch size 500, let Antfly handle parallelism |
90+
| Manual embedding step | Antfly auto-embeds on insert |
91+
| Wait for embedding completion | BM25 search works immediately; vector search ready in ~2s |
92+
93+
### State file: keep or drop?
94+
95+
**Keep a minimal version.** We still need to know:
96+
- Which files have been indexed (to detect deleted files → remove from Antfly)
97+
- Last index timestamp (to detect staleness)
98+
99+
**Drop:**
100+
- File hashes (just re-insert everything that changed based on mtime)
101+
- Document IDs per file (Antfly handles dedup by key)
102+
- Embedding metadata (Antfly owns this)
103+
104+
## Parts
105+
106+
| Part | Description | User stories |
107+
|------|-------------|-------------|
108+
| 2.1 | Simplify indexer: drop state complexity, use Antfly upsert | US-3, US-5 |
109+
| 2.2 | Increase batch size, single AntflyClient | US-6 |
110+
| 2.3 | Wait for embedding completion (or BM25 fallback) | US-3 |
111+
| 2.4 | Decouple git/github from `dev index .` | US-10, US-11 |
112+
| 2.5 | Auto-index on MCP server startup | US-4, US-12 |
113+
| 2.6 | File watcher for continuous indexing (stretch) | US-4 |
114+
| 2.7 | `dev status` rework — show Antfly table stats | US-13 |
115+
116+
## Decisions to make
117+
118+
1. **File watcher or fast incremental?** Watcher is better UX but more complexity.
119+
Fast incremental (<5s) on MCP startup might be enough.
120+
121+
2. **State file: minimal or none?** We need *something* to detect deleted files.
122+
Could query Antfly for existing keys and diff, but that's O(n) on every run.
123+
124+
3. **Git/GitHub: part of `dev index .` or separate?** Currently bundled.
125+
Separating them makes `dev index .` faster and each concern independent.
126+
127+
4. **Embedding completion: wait or don't?** Antfly's BM25 index is immediate.
128+
Vector search has ~2s delay. Should we wait, or document the tradeoff?
129+
130+
## Open questions
131+
132+
- What does the MCP server startup look like? Does it auto-index?
133+
- How does Cursor's workspace detection interact with auto-indexing?
134+
- Should `dev index .` be a command users run, or should it be invisible?
135+
- What's the right granularity for file watching? (per-file? per-save? debounced?)
136+
137+
## Dependencies
138+
139+
- Phase 1 (Antfly migration) — merged
140+
- Antfly server running
141+
- Understanding of MCP server lifecycle (how/when it starts)

0 commit comments

Comments
 (0)