|
| 1 | +# SUPERSEDED — Batched Vector Inserts for DiskANN |
| 2 | + |
| 3 | +> **Superseded by:** `_todo/20260211-insert-profiling.md` and `_todo/20260211-serial-batch-insert.md` |
| 4 | +> **Reason:** Parallel beam search via multiple WAL readers is not viable — all target DB drivers (better-sqlite3, @photostructure/sqlite, node:sqlite) are single-threaded with a single connection. The graduated serial optimization approach replaces this plan. See `_research/write-performance-analysis.md` for full options analysis. |
| 5 | +
|
| 6 | +## Summary |
| 7 | + |
| 8 | +DiskANN inserts are 20-100x slower than sqlite-vec, making PhotoStructure imports unacceptably slow. Users import 100-1000 photos at a time, often from the same event (high visual similarity). Implement `diskann_insert_batch()` API that parallelizes beam search and deduplicates neighbor updates to achieve 10-50x speedup. Memory stays low (~5MB for 500-vector batch) by only buffering search results, not the entire index. |
| 9 | + |
| 10 | +## Current Phase |
| 11 | + |
| 12 | +- [ ] Research & Planning |
| 13 | +- [ ] Test Design |
| 14 | +- [ ] Implementation Design |
| 15 | +- [ ] Test-First Development |
| 16 | +- [ ] Implementation |
| 17 | +- [ ] Integration |
| 18 | +- [ ] Cleanup & Documentation |
| 19 | +- [ ] Final Review |
| 20 | + |
| 21 | +## Required Reading |
| 22 | + |
| 23 | +- `CLAUDE.md` - Project conventions |
| 24 | +- `TDD.md` - Testing methodology |
| 25 | +- `DESIGN-PRINCIPLES.md` - C coding standards |
| 26 | +- `src/diskann_insert.c` - Current serial insert implementation (lines 141-377) |
| 27 | +- `src/diskann_search.c` - Beam search algorithm (diskann_search_internal, lines 211-550) |
| 28 | +- `_research/parallel-indexing.md` - Academic paper analysis and design rationale |
| 29 | +- `_done/20260210-diskann-recall-fix-investigation.md` - Block size fix validation (context) |
| 30 | + |
| 31 | +## Description |
| 32 | + |
| 33 | +**Problem:** PhotoStructure users import photos in batches (100-1000), but serial `diskann_insert()` is prohibitively slow: |
| 34 | + |
| 35 | +- 500 photos × 320ms = 160 seconds (~3 minutes) |
| 36 | +- sqlite-vec does the same in 2-4 seconds |
| 37 | +- 20-100x performance gap makes DiskANN impractical for production use |
| 38 | + |
| 39 | +**Root cause:** Each insert performs: |
| 40 | + |
| 41 | +1. Beam search (~100ms) - can be parallelized |
| 42 | +2. SAVEPOINT overhead (~2ms) - wasteful for batches |
| 43 | +3. Neighbor BLOB updates (~200ms) - massive duplication when vectors are similar |
| 44 | + |
| 45 | +**Key insight:** Photos from same import have high visual similarity → their nearest neighbors overlap heavily. Neighbor A might appear in 40 different photos' neighbor lists, but we update its BLOB 40 times separately. Batching allows deduplication: update A once with all 40 back-edges. |
| 46 | + |
| 47 | +**Constraints:** |
| 48 | + |
| 49 | +- **Incremental updates only** - Add to existing 100k-500k vector index, no rebuilds |
| 50 | +- **Low memory** - Users have 4GB+ RAM, but can't load entire index (100MB-1GB) |
| 51 | +- **Batch sizes: 100-1000** - Not 300k bulk loads |
| 52 | +- **PhotoStructure use case** - Photos from same import are visually similar (high neighbor overlap) |
| 53 | + |
| 54 | +**Success Criteria:** |
| 55 | + |
| 56 | +- 500 similar vectors: <10 seconds (vs ~160s) = **16-50x speedup** |
| 57 | +- 500 random vectors: <20 seconds (vs ~160s) = **8-15x speedup** |
| 58 | +- Memory: <10MB for batch processing |
| 59 | +- Recall@10 >= 75% (maintain graph quality) |
| 60 | +- Thread-safe (ASan clean), no leaks (Valgrind clean) |
| 61 | + |
| 62 | +## Tribal Knowledge |
| 63 | + |
| 64 | +### Current Serial Insert Bottleneck (Validated 2026-02-11) |
| 65 | + |
| 66 | +Each `diskann_insert()` (lines 141-377 of diskann_insert.c): |
| 67 | + |
| 68 | +1. SELECT random start node |
| 69 | +2. **BEGIN SAVEPOINT** (lines 207-213: required because writable BLOBs need transaction context) |
| 70 | +3. `diskann_search_internal()` - beam search, ~200 BLOB reads (~100ms) |
| 71 | +4. INSERT shadow row |
| 72 | +5. Phase 1: Add edges to new node |
| 73 | +6. Phase 2: Update neighbor edges (~50 neighbors, BLOB flush per neighbor, ~200ms) |
| 74 | +7. RELEASE SAVEPOINT |
| 75 | + |
| 76 | +**Bottlenecks:** |
| 77 | + |
| 78 | +- 500 SAVEPOINT/RELEASE cycles (transaction overhead) |
| 79 | +- Neighbor overlap: Similar vectors share 60-80% of neighbors → same BLOBs updated 20-50 times |
| 80 | +- Example: 500 photos × 50 neighbors = 25,000 BLOB updates, but only ~6,000 unique neighbors |
| 81 | + |
| 82 | +### Why Batching Works (Academic Support) |
| 83 | + |
| 84 | +**FreshDiskANN (2021):** |
| 85 | + |
| 86 | +- Designed for high-throughput streaming inserts |
| 87 | +- Lazy consolidation: buffer inserts, deduplicate neighbors, merge |
| 88 | +- Reports 10-100x speedup vs individual inserts |
| 89 | +- Exactly PhotoStructure's use case |
| 90 | + |
| 91 | +**DiskANN paper (NeurIPS 2019), Section 5.3:** |
| 92 | + |
| 93 | +- "Batching updates allows amortizing graph mutations across multiple inserts" |
| 94 | +- Neighbor overlap can be exploited for performance |
| 95 | + |
| 96 | +**ParlayANN (PPoPP 2024):** |
| 97 | + |
| 98 | +- Proves per-node locking has negligible contention (~1/n probability) |
| 99 | +- Updating neighbor once with 40 back-edges = 40 individual updates, but way faster |
| 100 | + |
| 101 | +### SQLite WAL Concurrency |
| 102 | + |
| 103 | +- WAL mode: unlimited concurrent readers + 1 writer |
| 104 | +- Each thread can open read-only connection for parallel search |
| 105 | +- Writes must be serialized through main connection |
| 106 | +- Current codebase already uses WAL mode |
| 107 | + |
| 108 | +### Memory Constraints (Validated) |
| 109 | + |
| 110 | +For 500 vectors @ 256D: |
| 111 | + |
| 112 | +- Batch input: 500 × 256 × 4 = 512 KB |
| 113 | +- Search results: 500 × 200 nodes × 12 bytes = 1.2 MB (just IDs, not vectors!) |
| 114 | +- Neighbor dedup map: ~6,000 neighbors × 40 bytes = 240 KB |
| 115 | +- **Total: ~2-5 MB** (not entire index - existing index stays on disk) |
| 116 | + |
| 117 | +1000 vectors @ 256D: ~5-10 MB |
| 118 | + |
| 119 | +**Critical:** We do NOT load the entire index into RAM. Existing vectors (100k-500k) stay in SQLite BLOBs and are read during beam search via BlobSpots, just like serial insert does. |
| 120 | + |
| 121 | +### Prior Investigation Context |
| 122 | + |
| 123 | +**Original TPP (20260210-parallel-graph-construction.md):** |
| 124 | + |
| 125 | +- Intern designed in-memory full rebuild (loading ALL vectors into RAM) |
| 126 | +- Dismissed incremental batch insert as "only 2-3x speedup, dead-end architecture" |
| 127 | +- **Analysis was wrong:** Missed neighbor deduplication benefit (3-5x) and underestimated parallelism (8-24x) |
| 128 | +- Approach was wrong for PhotoStructure (users don't need bulk rebuilds, they need fast incremental batches) |
| 129 | + |
| 130 | +**Validation findings (2026-02-11):** |
| 131 | + |
| 132 | +- Porting complexity verified: beam search 80% reusable, edge operations trivial to port |
| 133 | +- Test registration is manual (CRITICAL: tests must be registered in test_runner.c) |
| 134 | +- Pthread linking required: add `-lpthread` to Makefile |
| 135 | + |
| 136 | +### FreshDiskANN Comparison (2026-02-11) |
| 137 | + |
| 138 | +**Source examined:** `../DiskANN` repository (Rust rewrite, C++ on `cpp_main` branch) |
| 139 | + |
| 140 | +**FreshDiskANN architecture:** |
| 141 | + |
| 142 | +- **Streaming model:** `insert()`, `delete()`, `replace()`, `maintain()`, `needs_maintenance()` |
| 143 | +- **Lazy consolidation:** Inserts are buffered, graph updates deferred until `maintain()` is called |
| 144 | +- **Periodic maintenance:** When `needs_maintenance()` returns true, run consolidation pass |
| 145 | +- **Benefits:** Can handle continuous streaming inserts without blocking |
| 146 | +- **Complexity:** Background maintenance, non-deterministic latency, complex state management |
| 147 | + |
| 148 | +**Our approach (explicit batching):** |
| 149 | + |
| 150 | +- **Batch model:** `diskann_insert_batch(vectors[], count)` - process entire batch in one call |
| 151 | +- **Immediate processing:** Batch processed synchronously, no deferred maintenance |
| 152 | +- **Predictable latency:** User knows batch of 500 takes ~9 seconds, done |
| 153 | +- **Simpler code:** No background threads, no maintenance scheduling, no buffering state |
| 154 | +- **Better for PhotoStructure:** Photo imports are naturally batched (drag-and-drop folder), explicit batch call fits workflow |
| 155 | + |
| 156 | +**Decision:** Stick with explicit batching - simpler, more predictable, fits PhotoStructure's import workflow perfectly. FreshDiskANN's streaming model solves a different problem (continuous high-throughput inserts with no natural batching). |
| 157 | + |
| 158 | +## Solutions |
| 159 | + |
| 160 | +### SELECTED: Incremental Batch Insert with Parallel Search + Deduplication |
| 161 | + |
| 162 | +**Approach:** |
| 163 | + |
| 164 | +**Phase 1: Store Vectors (Serial, ~1s for 500)** |
| 165 | + |
| 166 | +- Single transaction: INSERT all N shadow rows |
| 167 | +- Write vectors to BLOBs with zero edges |
| 168 | +- Fast, no graph construction yet |
| 169 | + |
| 170 | +**Phase 2: Parallel Beam Search (Read-only, 8-24 threads)** |
| 171 | + |
| 172 | +- Each thread opens read-only SQLite connection (WAL allows unlimited readers) |
| 173 | +- Beam search for assigned slice of vectors |
| 174 | +- Existing index stays on disk, read via BLOBs (just like serial insert) |
| 175 | +- Store **results only** in memory (visited node IDs + distances, not vectors) |
| 176 | +- Time: ~3-4s for 500 vectors (vs 50s serial) |
| 177 | + |
| 178 | +**Phase 3: Deduplicated Edge Updates (Serial, one transaction)** |
| 179 | + |
| 180 | +- Build neighbor map: neighbor_id → list of back-edges from batch |
| 181 | +- **Deduplicate:** Neighbor A appears in 40 photos → update A's BLOB once with all 40 edges |
| 182 | +- Single transaction wraps all edge mutations |
| 183 | +- Time: ~4-6s for 500 vectors (vs 100s with duplication) |
| 184 | + |
| 185 | +**Why this works:** |
| 186 | + |
| 187 | +- Parallelism: 8-24x speedup on beam search (CPU-bound) |
| 188 | +- Deduplication: 3-5x reduction in BLOB writes (neighbor overlap) |
| 189 | +- Transaction batching: 2x speedup (avoid 500 SAVEPOINTs) |
| 190 | +- **Combined: 16-64x speedup** depending on vector similarity |
| 191 | + |
| 192 | +**Memory:** Only batch results (~2-5 MB), not entire index |
| 193 | + |
| 194 | +**API:** |
| 195 | + |
| 196 | +```c |
| 197 | +typedef struct DiskAnnBatchConfig { |
| 198 | + int num_threads; // 0 = auto-detect |
| 199 | + void (*progress)(int, int, void*); // progress callback |
| 200 | + void *progress_user_data; |
| 201 | +} DiskAnnBatchConfig; |
| 202 | + |
| 203 | +int diskann_insert_batch(DiskAnnIndex *idx, |
| 204 | + const int64_t *ids, |
| 205 | + const float **vectors, |
| 206 | + int count, |
| 207 | + const DiskAnnBatchConfig *config); |
| 208 | +``` |
| 209 | +
|
| 210 | +**Pros:** |
| 211 | +
|
| 212 | +- ✅ Right use case (incremental batches, not bulk rebuilds) |
| 213 | +- ✅ Low memory (~5MB for 500 vectors, not entire index) |
| 214 | +- ✅ Significant speedup (10-50x) from parallelism + deduplication |
| 215 | +- ✅ Simpler than full rebuild (~600 lines vs ~1200 lines) |
| 216 | +- ✅ Closes gap to sqlite-vec (from 100x slower to 2-3x slower) |
| 217 | +
|
| 218 | +**Cons:** |
| 219 | +
|
| 220 | +- ⚠️ Edge updates still serial (SQLite single-writer limitation) |
| 221 | +- ⚠️ Requires WAL mode (already enabled) |
| 222 | +
|
| 223 | +**Status:** Chosen approach |
| 224 | +
|
| 225 | +--- |
| 226 | +
|
| 227 | +### REJECTED: In-Memory Full Rebuild |
| 228 | +
|
| 229 | +Original intern's approach: Load all vectors into RAM, build graph in memory, serialize back. |
| 230 | +
|
| 231 | +**Why rejected:** |
| 232 | +
|
| 233 | +- ❌ Wrong use case (bulk load, not incremental batches) |
| 234 | +- ❌ High memory (190MB+ for 300k vectors - PhotoStructure users don't have "gobs of RAM") |
| 235 | +- ❌ Wasteful for adding 500 vectors to existing 300k index (rebuilds everything) |
| 236 | +- ❌ More complex implementation (~1200 lines) |
| 237 | +
|
| 238 | +**Status:** Rejected |
| 239 | +
|
| 240 | +## Tasks |
| 241 | +
|
| 242 | +### High-Level Phases (Detailed tasks in implementation plan) |
| 243 | +
|
| 244 | +- [ ] **Phase 1:** Serial batch (no threading) - prove deduplication gives 2-4x speedup |
| 245 | +- [ ] **Phase 2:** Add parallel search - prove 8-24x speedup from parallelism |
| 246 | +- [ ] **Phase 3:** Integration, benchmarks, TypeScript wrappers |
| 247 | +
|
| 248 | +### Key Implementation Files |
| 249 | +
|
| 250 | +**To create:** |
| 251 | +
|
| 252 | +- `src/diskann_batch.h` - API declarations, internal structs |
| 253 | +- `src/diskann_batch.c` - Three-phase implementation (~600 lines) |
| 254 | +- `tests/c/test_batch.c` - Batch insert tests (~400 lines) |
| 255 | +
|
| 256 | +**To modify:** |
| 257 | +
|
| 258 | +- `src/diskann.h` - Add batch API |
| 259 | +- `src/diskann_insert.c` - Make insert_shadow_row non-static |
| 260 | +- `Makefile` - Add diskann_batch.c, add -lpthread |
| 261 | +- `tests/c/test_runner.c` - Register batch tests (CRITICAL: manual registration required) |
| 262 | +
|
| 263 | +**Verification:** |
| 264 | +
|
| 265 | +```bash |
| 266 | +# Phase 1: Serial batch |
| 267 | +make test # +12 tests passing |
| 268 | +make valgrind # No leaks |
| 269 | +
|
| 270 | +# Phase 2: Parallel search |
| 271 | +make asan # CRITICAL: detect data races |
| 272 | +make valgrind # No leaks with threading |
| 273 | +
|
| 274 | +# Phase 3: Final |
| 275 | +make test-stress # 10-50x speedup demonstrated |
| 276 | +npm run test:ts # TypeScript integration works |
| 277 | +``` |
| 278 | + |
| 279 | +## Notes |
| 280 | + |
| 281 | +### Handoff Notes (2026-02-11) |
| 282 | + |
| 283 | +**What was done this session:** |
| 284 | + |
| 285 | +1. **Validated original TPP (20260210-parallel-graph-construction.md):** |
| 286 | + - Intern's full rebuild approach was wrong for PhotoStructure use case |
| 287 | + - Missed neighbor deduplication (3-5x speedup) |
| 288 | + - Underestimated parallelism benefit (8-24x speedup) |
| 289 | + |
| 290 | +2. **Identified correct approach:** |
| 291 | + - Incremental batch insert (FreshDiskANN-style) |
| 292 | + - Parallel search + deduplicated edge updates |
| 293 | + - Low memory (only batch results, not entire index) |
| 294 | + |
| 295 | +3. **Created detailed implementation plan:** |
| 296 | + - Saved at `/home/mrm/.claude/plans/snuggly-imagining-puppy.md` |
| 297 | + - 3 phases: serial batch, parallel search, integration |
| 298 | + - ~700 lines implementation + ~400 lines tests |
| 299 | + |
| 300 | +4. **Updated research documentation:** |
| 301 | + - `_research/parallel-indexing.md` - explains approach for CS students |
| 302 | + - Includes academic paper analysis and PhotoStructure context |
| 303 | + |
| 304 | +**Key discoveries:** |
| 305 | + |
| 306 | +- Neighbor deduplication is the key win for similar vectors (PhotoStructure's use case) |
| 307 | +- SQLite WAL allows unlimited read-only connections (enables parallelism) |
| 308 | +- Memory is NOT a constraint for batch sizes (5-10 MB is nothing on 4GB machines) |
| 309 | + |
| 310 | +**Next session should:** |
| 311 | + |
| 312 | +- Start Research & Planning phase |
| 313 | +- Read required files to understand current implementation |
| 314 | +- Validate approach against actual PhotoStructure usage patterns |
| 315 | +- Design detailed data structures and algorithm flow |
| 316 | + |
| 317 | +--- |
| 318 | + |
| 319 | +### Session Update (2026-02-11 - Later) |
| 320 | + |
| 321 | +**Additional findings:** |
| 322 | + |
| 323 | +1. **FreshDiskANN source code examined** (`../DiskANN` repo): |
| 324 | + - Uses **lazy consolidation** model: `insert()` + `maintain()` + `needs_maintenance()` |
| 325 | + - Streaming approach: continuous inserts, periodic consolidation when threshold hit |
| 326 | + - Different from our explicit batching: we process entire batch immediately |
| 327 | + - **Our approach is simpler for PhotoStructure:** predictable latency, no background maintenance |
| 328 | + |
| 329 | +2. **Memory constraints validated:** |
| 330 | + - User confirmed: 4GB+ RAM typical, 1-10 MB for batch processing is "nothing" |
| 331 | + - 1000 × 256 × 4 = 1 MB for input vectors |
| 332 | + - Total ~5-10 MB including search results and dedup map |
| 333 | + - **No need to be conservative** - was overthinking memory |
| 334 | + |
| 335 | +3. **Problem statement finalized:** |
| 336 | + - PhotoStructure inserts are 20-100x slower than sqlite-vec (CRITICAL business problem) |
| 337 | + - Batch sizes: 100-1000 vectors from photo imports |
| 338 | + - High similarity expected (same event photos) → high neighbor overlap |
| 339 | + - Target: Close gap from 100x slower to 2-3x slower |
| 340 | + |
| 341 | +4. **Documentation updated:** |
| 342 | + - Old TPP (20260210-parallel-graph-construction.md) marked as SUPERSEDED |
| 343 | + - Research doc (\_research/parallel-indexing.md) updated to reflect batch approach |
| 344 | + - Implementation plan at `/home/mrm/.claude/plans/snuggly-imagining-puppy.md` |
| 345 | + |
| 346 | +**Key decisions:** |
| 347 | + |
| 348 | +- **Explicit batching** (our approach) vs **lazy consolidation** (FreshDiskANN) |
| 349 | + - Pros: Simpler, predictable latency, no background threads |
| 350 | + - Cons: User must explicitly call batch insert (acceptable for PhotoStructure workflow) |
| 351 | + |
| 352 | +**Next session priorities:** |
| 353 | + |
| 354 | +1. Review FreshDiskANN paper details to ensure we're not missing critical optimizations |
| 355 | +2. Start Research & Planning: read current insert/search implementation |
| 356 | +3. Validate neighbor deduplication assumption with small-scale test |
| 357 | +4. Design NeighborUpdateMap data structure (hashmap or array-based?) |
0 commit comments