Skip to content

Commit 7da0992

Browse files
prosdevclaude
andcommitted
docs(plans): add Phase 2 dev index investigation plan
Traces dev index . end-to-end, identifies 6 risk areas: - Embedding availability timing (async, ~2s delay) - Network dependency (HTTP vs local disk) - Batch size mismatch (indexer 32 vs antfly 500) - Incremental update + antfly dedup - deriveTableName edge cases - No end-to-end test (critical gap) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 46f693d commit 7da0992

2 files changed

Lines changed: 195 additions & 1 deletion

File tree

.claude/da-plans/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Implementation deviations are logged at the bottom of each plan file.
99

1010
| Track | Description | Status |
1111
|-------|-------------|--------|
12-
| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Draft |
12+
| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Merged, Phase 2: Draft |
1313
| [CLI](cli/) | Command-line interface | Not started |
1414
| [MCP Server](mcp-server/) | Model Context Protocol server + adapters | Phase 1: Draft (blocked on core/phase-1) |
1515
| [Subagents](subagents/) | Coordinator, explorer, planner, GitHub agents | Not started |
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# Phase 2: `dev index .` Investigation & Hardening
2+
3+
**Status:** Draft — investigate before implementing fixes.
4+
5+
## Context
6+
7+
`dev index .` is dev-agent's most important command. It scans a repository, extracts
8+
code components, and stores them in Antfly for hybrid search. The antfly migration
9+
(Phase 1) replaced the entire storage layer underneath it, but the indexing pipeline
10+
itself wasn't tested end-to-end against a real repository.
11+
12+
This phase investigates what works, what's broken, and what needs hardening.
13+
14+
## What `dev index .` does (traced)
15+
16+
```
17+
dev index .
18+
├─ Check prerequisites (git repo, gh CLI)
19+
├─ Load config, resolve storage paths
20+
├─ Create RepositoryIndexer + VectorStorage
21+
├─ indexer.initialize()
22+
│ └─ VectorStorage → AntflyVectorStore → create table if not exists
23+
24+
├─ Phase 1: Scan repository
25+
│ └─ scanRepository() → glob files → parse with ts-morph/tree-sitter/remark
26+
│ └─ Returns: Document[] (functions, classes, types, etc.)
27+
28+
├─ Phase 2: Prepare + store documents
29+
│ └─ prepareDocumentsForEmbedding() → EmbeddingDocument[]
30+
│ └─ Batch insert into Antfly (BATCH_SIZE=500, parallelism via CONCURRENCY)
31+
│ └─ Antfly auto-embeds via Termite (~2s delay)
32+
33+
├─ Phase 3: Git history (if enabled)
34+
│ └─ Separate VectorStorage instance → vectors-git table
35+
│ └─ Extract commits → batch insert
36+
37+
├─ Phase 4: GitHub issues/PRs (if enabled)
38+
│ └─ Separate VectorStorage instance → vectors-github table
39+
│ └─ Fetch via gh CLI → batch insert
40+
41+
└─ Save state, emit events, close
42+
```
43+
44+
## What works well
45+
46+
- **Scanner pipeline** — ts-morph (TS/JS), tree-sitter (Go), remark (Markdown) are
47+
unchanged and well-tested (hundreds of existing tests)
48+
- **Document preparation**`prepareDocumentsForEmbedding()` is pure transformation,
49+
no storage dependency
50+
- **State management** — indexer-state.json for incremental updates, file hash tracking
51+
- **Three separate tables** — clean separation of code/git/github data
52+
- **Error handling** — batch failures are caught and reported
53+
54+
## Known risks (from Phase 1 spike + migration)
55+
56+
### 1. Embedding availability timing (HIGH)
57+
58+
Antfly embeds documents asynchronously in the background (~2s delay per batch).
59+
After `dev index .` completes, newly-inserted documents may not be searchable yet.
60+
61+
**Question:** Does `dev index .` need to wait for all embeddings to complete before
62+
declaring success? Currently it doesn't — it returns as soon as all HTTP inserts succeed.
63+
64+
**Impact:** User runs `dev index .` then immediately `dev_search` — gets no results.
65+
66+
**Options:**
67+
- a. Poll antfly for embedding completion before returning
68+
- b. Add a brief sleep after all inserts
69+
- c. Return immediately, note "embeddings processing" in output
70+
- d. Antfly's full-text index (BM25) is immediate — only vector search is delayed
71+
72+
### 2. Network dependency (MEDIUM)
73+
74+
All storage operations are now HTTP calls to Antfly. Previously they were local disk writes.
75+
76+
**What could go wrong:**
77+
- Antfly server goes down mid-index → partial index, unclear state
78+
- Network timeout on large batches → batch retry needed
79+
- Port conflict → ensureAntfly fails silently
80+
81+
**Question:** What happens if antfly crashes during `dev index .`? Is the state file
82+
consistent with what's actually in antfly?
83+
84+
### 3. Batch size mismatch (LOW)
85+
86+
The indexer uses `batchSize=32` for its internal batching (parallelized with CONCURRENCY).
87+
AntflyVectorStore has its own `BATCH_SIZE=500` for HTTP requests. These are independent —
88+
the indexer sends 32 docs to `addDocuments()`, which passes them straight through since
89+
32 < 500.
90+
91+
**Question:** Is this efficient? Should we increase the indexer batch size to match
92+
antfly's capacity? Or does the parallelism (multiple batches of 32 in flight) compensate?
93+
94+
### 4. Incremental update + antfly dedup (LOW)
95+
96+
Incremental updates detect changed files, delete old documents, and insert new ones.
97+
Antfly deduplicates by key (upsert on insert). The delete step might be redundant.
98+
99+
**Question:** Can we simplify incremental updates by just re-inserting (antfly overwrites)?
100+
Or do we need the explicit delete for documents that no longer exist (removed code)?
101+
102+
### 5. deriveTableName edge cases (LOW)
103+
104+
`deriveTableName()` converts storePath to an antfly table name. It handles the three
105+
known patterns (vectors, vectors-git, vectors-github) but may break on edge cases:
106+
- Paths with special characters
107+
- Very long project directory names
108+
- Paths that don't match expected structure
109+
110+
### 6. No end-to-end test (CRITICAL)
111+
112+
We have 20 unit tests for AntflyVectorStore and hundreds of mock-based tests for the
113+
indexer, but **no test that runs `dev index .` against a real repository with a real
114+
antfly server**. This is the biggest gap.
115+
116+
## Investigation plan
117+
118+
### Step 1: Run `dev index .` on this repo
119+
120+
```bash
121+
dev index .
122+
```
123+
124+
Observe:
125+
- Does it complete without errors?
126+
- How long does it take?
127+
- How many documents are indexed?
128+
- Can we immediately search after?
129+
130+
### Step 2: Test search after indexing
131+
132+
```bash
133+
dev search "authentication middleware"
134+
dev search "VectorStorage"
135+
dev search "handleError"
136+
```
137+
138+
Observe:
139+
- Do results come back?
140+
- Are they relevant?
141+
- Does hybrid search (exact + semantic) work?
142+
143+
### Step 3: Test incremental update
144+
145+
```bash
146+
# Edit a file
147+
echo "// test change" >> packages/core/src/vector/antfly-store.ts
148+
dev update
149+
# Revert
150+
git checkout packages/core/src/vector/antfly-store.ts
151+
```
152+
153+
Observe:
154+
- Does incremental detect the change?
155+
- Does it only re-index the changed file?
156+
- Is the updated document searchable?
157+
158+
### Step 4: Test git history indexing
159+
160+
```bash
161+
dev git search "antfly migration"
162+
```
163+
164+
### Step 5: Test GitHub indexing
165+
166+
```bash
167+
dev github search "hybrid search"
168+
```
169+
170+
### Step 6: Test `--force` re-index
171+
172+
```bash
173+
dev index . --force
174+
```
175+
176+
Observe:
177+
- Does it clear antfly tables and recreate?
178+
- Is state file reset?
179+
- Does it complete cleanly?
180+
181+
## Parts (if fixes are needed)
182+
183+
| Part | Description | Risk |
184+
|------|-------------|------|
185+
| 2.1 | E2E test: index a real repo, search, verify results | Low |
186+
| 2.2 | Embedding completion: wait/poll after insert | Medium |
187+
| 2.3 | Error recovery: handle antfly failures mid-index | Medium |
188+
| 2.4 | Batch size optimization: tune for antfly throughput | Low |
189+
| 2.5 | Incremental update simplification | Low |
190+
191+
## Dependencies
192+
193+
- Antfly server must be running
194+
- Phase 1 (antfly migration) must be merged

0 commit comments

Comments
 (0)