|
| 1 | +# Phase 2: Rethink Indexing & Search Flow |
| 2 | + |
| 3 | +**Status:** Draft |
| 4 | + |
| 5 | +## Context |
| 6 | + |
| 7 | +Phase 1 replaced the storage layer (LanceDB → Antfly) but kept the old indexing |
| 8 | +flow intact. That flow was designed around LanceDB constraints: local file storage, |
| 9 | +manual embedding pipeline, batch sizing tuned for ONNX model memory, state files |
| 10 | +for incremental updates. |
| 11 | + |
| 12 | +With Antfly as the backend, many of these constraints no longer exist. Rather than |
| 13 | +patching the old flow, we should redesign it around what Antfly enables and what |
| 14 | +developers actually need. |
| 15 | + |
| 16 | +See [user-stories.md](./user-stories.md) for the full set of user stories driving |
| 17 | +this redesign. |
| 18 | + |
| 19 | +## Current flow (what exists) |
| 20 | + |
| 21 | +``` |
| 22 | +dev setup → start Antfly (one-time) |
| 23 | +dev index . → scan all files → batch insert into Antfly → save state file |
| 24 | + ├─ Phase 1: Scan → ts-morph/tree-sitter/remark → Document[] |
| 25 | + ├─ Phase 2: Store → batch HTTP inserts (32 docs × CONCURRENCY parallel) |
| 26 | + ├─ Phase 3: Git → extract commits → separate table |
| 27 | + ├─ Phase 4: GitHub → fetch issues/PRs via gh CLI → separate table |
| 28 | + └─ Save state → indexer-state.json (file hashes for incremental) |
| 29 | +dev search "query" → hybrid search via Antfly |
| 30 | +``` |
| 31 | + |
| 32 | +### Problems with current flow |
| 33 | + |
| 34 | +1. **Manual trigger required** — developer must remember to run `dev index .` after |
| 35 | + code changes. AI tools get stale context. (violates US-4) |
| 36 | + |
| 37 | +2. **State file complexity** — tracks file hashes, document IDs per file, timestamps. |
| 38 | + But Antfly does upsert natively — inserting an existing key overwrites. Do we need |
| 39 | + the state file at all? |
| 40 | + |
| 41 | +3. **Embedding delay invisible** — Antfly embeds asynchronously (~2s). `dev index .` |
| 42 | + completes before embeddings are ready. Immediate search may return nothing. (violates US-3) |
| 43 | + |
| 44 | +4. **Three separate VectorStorage instances** — created because LanceDB needed separate |
| 45 | + directories. With Antfly, these are just three tables. But the code creates three |
| 46 | + separate VectorStorage objects with separate connections. |
| 47 | + |
| 48 | +5. **Batch sizing is wrong** — indexer uses batch=32 (tuned for ONNX). Antfly can handle |
| 49 | + 500 per request. We're making 15x more HTTP calls than needed. |
| 50 | + |
| 51 | +6. **Git and GitHub coupled to index command** — `dev index .` does code + git + GitHub |
| 52 | + in one big command. These are different data sources with different update patterns. |
| 53 | + |
| 54 | +## Proposed flow |
| 55 | + |
| 56 | +### The big idea: file watcher + on-demand indexing |
| 57 | + |
| 58 | +``` |
| 59 | +dev setup → start Antfly + start file watcher (background) |
| 60 | + watcher detects file changes → re-indexes changed files automatically |
| 61 | +
|
| 62 | +dev index . → full scan (first time or explicit refresh) |
| 63 | +dev index . --force → clear + full scan |
| 64 | +
|
| 65 | +# These become separate, optional commands: |
| 66 | +dev git index → index git history (already exists) |
| 67 | +dev github index → index GitHub issues/PRs (already exists) |
| 68 | +``` |
| 69 | + |
| 70 | +**US-4 solved:** The file watcher keeps the index fresh without manual intervention. |
| 71 | +Developer saves a file, the watcher re-indexes it within seconds. |
| 72 | + |
| 73 | +### Alternative: no watcher, just fast incremental |
| 74 | + |
| 75 | +If a file watcher is too complex for Phase 2, the simpler approach: |
| 76 | + |
| 77 | +``` |
| 78 | +dev index . → fast incremental (only changed files, <5s for small changes) |
| 79 | + runs automatically on MCP server startup |
| 80 | + runs automatically before search if stale (>5 min since last update) |
| 81 | +``` |
| 82 | + |
| 83 | +### Simplifications enabled by Antfly |
| 84 | + |
| 85 | +| Old complexity | New simplification | |
| 86 | +|---------------|-------------------| |
| 87 | +| State file (file hashes, doc IDs) | Antfly upsert by key — just re-insert, it overwrites | |
| 88 | +| Three VectorStorage instances | One AntflyClient, three table names | |
| 89 | +| Batch size 32 + CONCURRENCY | Single batch size 500, let Antfly handle parallelism | |
| 90 | +| Manual embedding step | Antfly auto-embeds on insert | |
| 91 | +| Wait for embedding completion | BM25 search works immediately; vector search ready in ~2s | |
| 92 | + |
| 93 | +### State file: keep or drop? |
| 94 | + |
| 95 | +**Keep a minimal version.** We still need to know: |
| 96 | +- Which files have been indexed (to detect deleted files → remove from Antfly) |
| 97 | +- Last index timestamp (to detect staleness) |
| 98 | + |
| 99 | +**Drop:** |
| 100 | +- File hashes (just re-insert everything that changed based on mtime) |
| 101 | +- Document IDs per file (Antfly handles dedup by key) |
| 102 | +- Embedding metadata (Antfly owns this) |
| 103 | + |
| 104 | +## Parts |
| 105 | + |
| 106 | +| Part | Description | User stories | |
| 107 | +|------|-------------|-------------| |
| 108 | +| 2.1 | Simplify indexer: drop state complexity, use Antfly upsert | US-3, US-5 | |
| 109 | +| 2.2 | Increase batch size, single AntflyClient | US-6 | |
| 110 | +| 2.3 | Wait for embedding completion (or BM25 fallback) | US-3 | |
| 111 | +| 2.4 | Decouple git/github from `dev index .` | US-10, US-11 | |
| 112 | +| 2.5 | Auto-index on MCP server startup | US-4, US-12 | |
| 113 | +| 2.6 | File watcher for continuous indexing (stretch) | US-4 | |
| 114 | +| 2.7 | `dev status` rework — show Antfly table stats | US-13 | |
| 115 | + |
| 116 | +## Decisions to make |
| 117 | + |
| 118 | +1. **File watcher or fast incremental?** Watcher is better UX but more complexity. |
| 119 | + Fast incremental (<5s) on MCP startup might be enough. |
| 120 | + |
| 121 | +2. **State file: minimal or none?** We need *something* to detect deleted files. |
| 122 | + Could query Antfly for existing keys and diff, but that's O(n) on every run. |
| 123 | + |
| 124 | +3. **Git/GitHub: part of `dev index .` or separate?** Currently bundled. |
| 125 | + Separating them makes `dev index .` faster and each concern independent. |
| 126 | + |
| 127 | +4. **Embedding completion: wait or don't?** Antfly's BM25 index is immediate. |
| 128 | + Vector search has ~2s delay. Should we wait, or document the tradeoff? |
| 129 | + |
| 130 | +## Open questions |
| 131 | + |
| 132 | +- What does the MCP server startup look like? Does it auto-index? |
| 133 | +- How does Cursor's workspace detection interact with auto-indexing? |
| 134 | +- Should `dev index .` be a command users run, or should it be invisible? |
| 135 | +- What's the right granularity for file watching? (per-file? per-save? debounced?) |
| 136 | + |
| 137 | +## Dependencies |
| 138 | + |
| 139 | +- Phase 1 (Antfly migration) — merged |
| 140 | +- Antfly server running |
| 141 | +- Understanding of MCP server lifecycle (how/when it starts) |
0 commit comments