|
| 1 | +# Part 1.2 — Implement AntflyVectorStore |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Create `AntflyVectorStore` class that implements the `VectorStore` interface using `@antfly/sdk`. |
| 6 | +This is the core swap — everything else builds on it. |
| 7 | + |
| 8 | +## New file |
| 9 | + |
| 10 | +`packages/core/src/vector/antfly-store.ts` |
| 11 | + |
| 12 | +## Interface to implement |
| 13 | + |
| 14 | +From `types.ts`: |
| 15 | + |
| 16 | +```typescript |
| 17 | +interface VectorStore { |
| 18 | + readonly path: string; |
| 19 | + initialize(): Promise<void>; |
| 20 | + add(documents: EmbeddingDocument[], embeddings: number[][]): Promise<void>; |
| 21 | + search(queryEmbedding: number[], options?: SearchOptions): Promise<SearchResult[]>; |
| 22 | + get(id: string): Promise<EmbeddingDocument | null>; |
| 23 | + delete(ids: string[]): Promise<void>; |
| 24 | + count(): Promise<number>; |
| 25 | + optimize(): Promise<void>; |
| 26 | + close(): Promise<void>; |
| 27 | +} |
| 28 | +``` |
| 29 | + |
| 30 | +Plus the concrete-only methods on the current `LanceDBVectorStore`: |
| 31 | +- `getAll(): Promise<EmbeddingDocument[]>` |
| 32 | +- `searchByDocumentId(id: string, options?): Promise<SearchResult[]>` |
| 33 | +- `clear(): Promise<void>` |
| 34 | + |
| 35 | +## Design |
| 36 | + |
| 37 | +### Constructor |
| 38 | + |
| 39 | +```typescript |
| 40 | +interface AntflyConfig { |
| 41 | + baseUrl: string; // default: 'http://localhost:8080' |
| 42 | + table: string; // e.g., 'dev-agent-code', 'dev-agent-git', 'dev-agent-github' |
| 43 | + indexName: string; // e.g., 'content' |
| 44 | + template?: string; // Handlebars template for embedding, default: '{{text}}' |
| 45 | + model?: string; // Termite model — read from config, default: 'BAAI/bge-small-en-v1.5' |
| 46 | +} |
| 47 | +``` |
| 48 | + |
| 49 | +The `model` field comes from `~/.dev-agent/config.json`, set by `dev setup --model`. |
| 50 | +This flows into the antfly table creation (embedding index config). |
| 51 | + |
| 52 | +### Search interface design (BLOCKER resolution) |
| 53 | + |
| 54 | +The `VectorStore.search()` interface takes `queryEmbedding: number[]`, but antfly needs |
| 55 | +query text, not a pre-computed vector. |
| 56 | + |
| 57 | +**Decision:** Add `searchText()` to `AntflyVectorStore` as a concrete method (not on the |
| 58 | +`VectorStore` interface). The `VectorStorage` facade calls `searchText()` directly since |
| 59 | +it already receives the query as a string. |
| 60 | + |
| 61 | +The old `search(queryEmbedding: number[])` remains on the interface for type compatibility |
| 62 | +but throws `Error('Use searchText() — antfly handles embeddings')` if called directly. |
| 63 | +In practice it's never called directly — only the facade calls it, and the facade is |
| 64 | +updated in Part 1.3 to call `searchText()` instead. |
| 65 | + |
| 66 | +```typescript |
| 67 | +class AntflyVectorStore implements VectorStore { |
| 68 | + // Interface method — kept for compatibility, not called in practice |
| 69 | + async search(queryEmbedding: number[], options?: SearchOptions): Promise<SearchResult[]> { |
| 70 | + throw new Error('Use searchText() — antfly handles embeddings internally'); |
| 71 | + } |
| 72 | + |
| 73 | + // The real search method — called by VectorStorage facade |
| 74 | + async searchText(query: string, options?: SearchOptions): Promise<SearchResult[]> { |
| 75 | + const results = await this.client.query({ |
| 76 | + table: this.config.table, |
| 77 | + semantic_search: query, |
| 78 | + indexes: [this.config.indexName], |
| 79 | + limit: options?.limit ?? 10, |
| 80 | + }); |
| 81 | + return this.mapHitsToSearchResults(results.hits); |
| 82 | + } |
| 83 | +} |
| 84 | +``` |
| 85 | + |
| 86 | +### searchByDocumentId behavioral change |
| 87 | + |
| 88 | +**Acknowledged tradeoff:** Currently, `searchByDocumentId` fetches the stored embedding |
| 89 | +vector and does a vector-space nearest-neighbor search. After migration, it becomes |
| 90 | +"lookup doc → search with its text." This may produce slightly different results because |
| 91 | +text-based search goes through antfly's tokenization + embedding pipeline rather than using |
| 92 | +the exact stored vector. |
| 93 | + |
| 94 | +In practice this should be **equivalent or better** — the text goes through the same |
| 95 | +embedding model, and hybrid search (BM25 + vector) adds keyword matching that pure |
| 96 | +vector search lacked. The `dev_inspect` tool (primary consumer) finds similar code files, |
| 97 | +where text-based similarity is a natural fit. |
| 98 | + |
| 99 | +### Method implementations |
| 100 | + |
| 101 | +**`initialize()`** |
| 102 | +- Create table with embedding index if not exists |
| 103 | +- Handle "already exists" gracefully (idempotent) |
| 104 | + |
| 105 | +**`add(documents, embeddings)`** |
| 106 | +- Ignore `embeddings` parameter — antfly auto-embeds |
| 107 | +- Convert `EmbeddingDocument[]` to antfly batch format: `{ [id]: { text, metadata } }` |
| 108 | +- Batch in chunks of 500 (antfly may have payload limits) |
| 109 | + |
| 110 | +**`searchText(query, options)`** |
| 111 | +- Use `client.query()` with `semantic_search` (and optionally `full_text_search`) |
| 112 | +- Map antfly `hits` to `SearchResult[]` |
| 113 | + |
| 114 | +**`get(id)`** |
| 115 | +- Use `client.tables.lookup(table, id)` |
| 116 | +- Map to `EmbeddingDocument | null` |
| 117 | + |
| 118 | +**`delete(ids)`** |
| 119 | +- Use `client.tables.batch(table, { deletes: ids })` |
| 120 | + |
| 121 | +**`count()`** |
| 122 | +- Use `client.tables.get(table)` and extract doc count from stats |
| 123 | + |
| 124 | +**`getAll()`** |
| 125 | +- Use `client.tables.query(table, { limit: 10000 })` or paginate |
| 126 | +- If more than 10000 docs, paginate with offset (test this in spike) |
| 127 | + |
| 128 | +**`searchByDocumentId(id)`** |
| 129 | +- Lookup document by key → get its text → run `searchText()` with that text |
| 130 | +- Note: behavioral change from vector-based to text-based similarity (see above) |
| 131 | + |
| 132 | +**`clear()`** |
| 133 | +- Drop and recreate the table |
| 134 | + |
| 135 | +**`optimize()`** — No-op (antfly manages compaction) |
| 136 | +**`close()`** — No-op (SDK is stateless HTTP) |
| 137 | + |
| 138 | +**`path` (readonly property)** |
| 139 | +- Return the antfly base URL + table name as identifier (e.g., `http://localhost:8080/dev-agent-code`) |
| 140 | +- Used for logging and stats, not for file I/O |
| 141 | + |
| 142 | +### Stats support |
| 143 | + |
| 144 | +`VectorStorage.getStats()` currently reads `dimension` and `modelName` from the embedder, |
| 145 | +and `storageSize` from the local LanceDB directory. After migration: |
| 146 | + |
| 147 | +- `dimension` — read from antfly config (known at table creation time from model) |
| 148 | +- `modelName` — read from antfly config (stored in `AntflyConfig.model`) |
| 149 | +- `storageSize` — antfly manages storage; report 0 or get from `client.tables.get()` if |
| 150 | + it exposes size stats. Spike will confirm. |
| 151 | +- `totalDocuments` — from `count()` |
| 152 | + |
| 153 | +Add a `getModelInfo()` method to `AntflyVectorStore`: |
| 154 | + |
| 155 | +```typescript |
| 156 | +getModelInfo(): { dimension: number; modelName: string } { |
| 157 | + return { |
| 158 | + dimension: MODEL_DIMENSIONS[this.config.model] ?? 384, |
| 159 | + modelName: this.config.model ?? 'BAAI/bge-small-en-v1.5', |
| 160 | + }; |
| 161 | +} |
| 162 | +``` |
| 163 | + |
| 164 | +## Tests |
| 165 | + |
| 166 | +New file: `packages/core/src/vector/__tests__/antfly-store.test.ts` |
| 167 | + |
| 168 | +Tests require running antfly server. Tagged with `describe.runIf(process.env.ANTFLY_URL)` |
| 169 | +so CI runs them in the docker-based job and local devs can skip them. |
| 170 | + |
| 171 | +Use a dedicated test table (`test-antfly-{random}`), clean up after each test. |
| 172 | + |
| 173 | +| Test | Description | |
| 174 | +|------|-------------| |
| 175 | +| creates table on initialize | Idempotent table creation | |
| 176 | +| inserts and retrieves documents | batch insert → lookup by key | |
| 177 | +| upserts on duplicate key | insert key X, re-insert with different text → second version stored | |
| 178 | +| searches by semantic query | insert → searchText → verify relevance | |
| 179 | +| handles hybrid search | BM25 + vector returns combined results | |
| 180 | +| deletes documents | insert → delete → lookup returns null | |
| 181 | +| counts documents | insert N → count returns N | |
| 182 | +| gets all documents | insert → getAll → verify all returned | |
| 183 | +| paginates getAll for large sets | insert 100+ → getAll returns all | |
| 184 | +| searches by document ID | insert A,B → searchByDocumentId(A) → B appears if similar | |
| 185 | +| clears all data | insert → clear → count returns 0 | |
| 186 | +| returns model info | getModelInfo() returns dimension + model name | |
| 187 | +| handles empty table search | search on empty table → returns [] | |
| 188 | +| handles missing server gracefully | Connection refused → meaningful error | |
| 189 | +| search(embedding) throws | Direct call to search() with vector → throws with guidance | |
| 190 | + |
| 191 | +## Exit criteria |
| 192 | + |
| 193 | +- `AntflyVectorStore` passes all tests |
| 194 | +- `searchText()` is the primary search method, `search()` throws |
| 195 | +- `searchByDocumentId` uses text-based similarity (behavioral change documented) |
| 196 | +- `getModelInfo()` provides dimension + model for stats |
| 197 | +- No antfly-specific concepts leak above this layer |
0 commit comments