|
| 1 | +# Phase 2: `dev index .` Investigation & Hardening |
| 2 | + |
| 3 | +**Status:** Draft — investigate before implementing fixes. |
| 4 | + |
| 5 | +## Context |
| 6 | + |
| 7 | +`dev index .` is dev-agent's most important command. It scans a repository, extracts |
| 8 | +code components, and stores them in Antfly for hybrid search. The antfly migration |
| 9 | +(Phase 1) replaced the entire storage layer underneath it, but the indexing pipeline |
| 10 | +itself wasn't tested end-to-end against a real repository. |
| 11 | + |
| 12 | +This phase investigates what works, what's broken, and what needs hardening. |
| 13 | + |
| 14 | +## What `dev index .` does (traced) |
| 15 | + |
| 16 | +``` |
| 17 | +dev index . |
| 18 | + ├─ Check prerequisites (git repo, gh CLI) |
| 19 | + ├─ Load config, resolve storage paths |
| 20 | + ├─ Create RepositoryIndexer + VectorStorage |
| 21 | + ├─ indexer.initialize() |
| 22 | + │ └─ VectorStorage → AntflyVectorStore → create table if not exists |
| 23 | + │ |
| 24 | + ├─ Phase 1: Scan repository |
| 25 | + │ └─ scanRepository() → glob files → parse with ts-morph/tree-sitter/remark |
| 26 | + │ └─ Returns: Document[] (functions, classes, types, etc.) |
| 27 | + │ |
| 28 | + ├─ Phase 2: Prepare + store documents |
| 29 | + │ └─ prepareDocumentsForEmbedding() → EmbeddingDocument[] |
| 30 | + │ └─ Batch insert into Antfly (BATCH_SIZE=500, parallelism via CONCURRENCY) |
| 31 | + │ └─ Antfly auto-embeds via Termite (~2s delay) |
| 32 | + │ |
| 33 | + ├─ Phase 3: Git history (if enabled) |
| 34 | + │ └─ Separate VectorStorage instance → vectors-git table |
| 35 | + │ └─ Extract commits → batch insert |
| 36 | + │ |
| 37 | + ├─ Phase 4: GitHub issues/PRs (if enabled) |
| 38 | + │ └─ Separate VectorStorage instance → vectors-github table |
| 39 | + │ └─ Fetch via gh CLI → batch insert |
| 40 | + │ |
| 41 | + └─ Save state, emit events, close |
| 42 | +``` |
| 43 | + |
| 44 | +## What works well |
| 45 | + |
| 46 | +- **Scanner pipeline** — ts-morph (TS/JS), tree-sitter (Go), remark (Markdown) are |
| 47 | + unchanged and well-tested (hundreds of existing tests) |
| 48 | +- **Document preparation** — `prepareDocumentsForEmbedding()` is pure transformation, |
| 49 | + no storage dependency |
| 50 | +- **State management** — indexer-state.json for incremental updates, file hash tracking |
| 51 | +- **Three separate tables** — clean separation of code/git/github data |
| 52 | +- **Error handling** — batch failures are caught and reported |
| 53 | + |
| 54 | +## Known risks (from Phase 1 spike + migration) |
| 55 | + |
| 56 | +### 1. Embedding availability timing (HIGH) |
| 57 | + |
| 58 | +Antfly embeds documents asynchronously in the background (~2s delay per batch). |
| 59 | +After `dev index .` completes, newly-inserted documents may not be searchable yet. |
| 60 | + |
| 61 | +**Question:** Does `dev index .` need to wait for all embeddings to complete before |
| 62 | +declaring success? Currently it doesn't — it returns as soon as all HTTP inserts succeed. |
| 63 | + |
| 64 | +**Impact:** User runs `dev index .` then immediately `dev_search` — gets no results. |
| 65 | + |
| 66 | +**Options:** |
| 67 | +- a. Poll antfly for embedding completion before returning |
| 68 | +- b. Add a brief sleep after all inserts |
| 69 | +- c. Return immediately, note "embeddings processing" in output |
| 70 | +- d. Antfly's full-text index (BM25) is immediate — only vector search is delayed |
| 71 | + |
| 72 | +### 2. Network dependency (MEDIUM) |
| 73 | + |
| 74 | +All storage operations are now HTTP calls to Antfly. Previously they were local disk writes. |
| 75 | + |
| 76 | +**What could go wrong:** |
| 77 | +- Antfly server goes down mid-index → partial index, unclear state |
| 78 | +- Network timeout on large batches → batch retry needed |
| 79 | +- Port conflict → ensureAntfly fails silently |
| 80 | + |
| 81 | +**Question:** What happens if antfly crashes during `dev index .`? Is the state file |
| 82 | +consistent with what's actually in antfly? |
| 83 | + |
| 84 | +### 3. Batch size mismatch (LOW) |
| 85 | + |
| 86 | +The indexer uses `batchSize=32` for its internal batching (parallelized with CONCURRENCY). |
| 87 | +AntflyVectorStore has its own `BATCH_SIZE=500` for HTTP requests. These are independent — |
| 88 | +the indexer sends 32 docs to `addDocuments()`, which passes them straight through since |
| 89 | +32 < 500. |
| 90 | + |
| 91 | +**Question:** Is this efficient? Should we increase the indexer batch size to match |
| 92 | +antfly's capacity? Or does the parallelism (multiple batches of 32 in flight) compensate? |
| 93 | + |
| 94 | +### 4. Incremental update + antfly dedup (LOW) |
| 95 | + |
| 96 | +Incremental updates detect changed files, delete old documents, and insert new ones. |
| 97 | +Antfly deduplicates by key (upsert on insert). The delete step might be redundant. |
| 98 | + |
| 99 | +**Question:** Can we simplify incremental updates by just re-inserting (antfly overwrites)? |
| 100 | +Or do we need the explicit delete for documents that no longer exist (removed code)? |
| 101 | + |
| 102 | +### 5. deriveTableName edge cases (LOW) |
| 103 | + |
| 104 | +`deriveTableName()` converts storePath to an antfly table name. It handles the three |
| 105 | +known patterns (vectors, vectors-git, vectors-github) but may break on edge cases: |
| 106 | +- Paths with special characters |
| 107 | +- Very long project directory names |
| 108 | +- Paths that don't match expected structure |
| 109 | + |
| 110 | +### 6. No end-to-end test (CRITICAL) |
| 111 | + |
| 112 | +We have 20 unit tests for AntflyVectorStore and hundreds of mock-based tests for the |
| 113 | +indexer, but **no test that runs `dev index .` against a real repository with a real |
| 114 | +antfly server**. This is the biggest gap. |
| 115 | + |
| 116 | +## Investigation plan |
| 117 | + |
| 118 | +### Step 1: Run `dev index .` on this repo |
| 119 | + |
| 120 | +```bash |
| 121 | +dev index . |
| 122 | +``` |
| 123 | + |
| 124 | +Observe: |
| 125 | +- Does it complete without errors? |
| 126 | +- How long does it take? |
| 127 | +- How many documents are indexed? |
| 128 | +- Can we immediately search after? |
| 129 | + |
| 130 | +### Step 2: Test search after indexing |
| 131 | + |
| 132 | +```bash |
| 133 | +dev search "authentication middleware" |
| 134 | +dev search "VectorStorage" |
| 135 | +dev search "handleError" |
| 136 | +``` |
| 137 | + |
| 138 | +Observe: |
| 139 | +- Do results come back? |
| 140 | +- Are they relevant? |
| 141 | +- Does hybrid search (exact + semantic) work? |
| 142 | + |
| 143 | +### Step 3: Test incremental update |
| 144 | + |
| 145 | +```bash |
| 146 | +# Edit a file |
| 147 | +echo "// test change" >> packages/core/src/vector/antfly-store.ts |
| 148 | +dev update |
| 149 | +# Revert |
| 150 | +git checkout packages/core/src/vector/antfly-store.ts |
| 151 | +``` |
| 152 | + |
| 153 | +Observe: |
| 154 | +- Does incremental detect the change? |
| 155 | +- Does it only re-index the changed file? |
| 156 | +- Is the updated document searchable? |
| 157 | + |
| 158 | +### Step 4: Test git history indexing |
| 159 | + |
| 160 | +```bash |
| 161 | +dev git search "antfly migration" |
| 162 | +``` |
| 163 | + |
| 164 | +### Step 5: Test GitHub indexing |
| 165 | + |
| 166 | +```bash |
| 167 | +dev github search "hybrid search" |
| 168 | +``` |
| 169 | + |
| 170 | +### Step 6: Test `--force` re-index |
| 171 | + |
| 172 | +```bash |
| 173 | +dev index . --force |
| 174 | +``` |
| 175 | + |
| 176 | +Observe: |
| 177 | +- Does it clear antfly tables and recreate? |
| 178 | +- Is state file reset? |
| 179 | +- Does it complete cleanly? |
| 180 | + |
| 181 | +## Parts (if fixes are needed) |
| 182 | + |
| 183 | +| Part | Description | Risk | |
| 184 | +|------|-------------|------| |
| 185 | +| 2.1 | E2E test: index a real repo, search, verify results | Low | |
| 186 | +| 2.2 | Embedding completion: wait/poll after insert | Medium | |
| 187 | +| 2.3 | Error recovery: handle antfly failures mid-index | Medium | |
| 188 | +| 2.4 | Batch size optimization: tune for antfly throughput | Low | |
| 189 | +| 2.5 | Incremental update simplification | Low | |
| 190 | + |
| 191 | +## Dependencies |
| 192 | + |
| 193 | +- Antfly server must be running |
| 194 | +- Phase 1 (antfly migration) must be merged |
0 commit comments