|
| 1 | +# Serial Batch Insert Optimization |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +Optimize DiskANN insert throughput for batch workloads using single-connection serial techniques: transaction batching, persistent cache, lazy back-edges, neighbor deduplication, and intra-batch candidates. Target: 4-12x speedup over serial `diskann_insert()` for 500-vector batches. No threading — all target DB drivers are single-connection. |
| 6 | + |
| 7 | +## Current Phase |
| 8 | + |
| 9 | +- [x] Research & Planning |
| 10 | +- [x] Test Design |
| 11 | +- [x] Implementation Design |
| 12 | +- [x] Test-First Development |
| 13 | +- [x] Implementation |
| 14 | +- [x] Integration |
| 15 | +- [ ] Cleanup & Documentation — pending Phase 1b, 2, 3 |
| 16 | +- [ ] Final Review — pending benchmarks + remaining phases |
| 17 | + |
| 18 | +## Required Reading |
| 19 | + |
| 20 | +- `CLAUDE.md` — Project conventions |
| 21 | +- `TDD.md` — Testing methodology |
| 22 | +- `DESIGN-PRINCIPLES.md` — C coding standards |
| 23 | +- `src/diskann_insert.c` — Current insert (lines 177-377) |
| 24 | +- `src/diskann_search.c` — Beam search internals |
| 25 | +- `src/diskann_cache.c` — BlobCache implementation |
| 26 | +- `_research/write-performance-analysis.md` — Full options analysis |
| 27 | +- `_todo/20260211-insert-profiling.md` — Phase 0 profiling results (prerequisite) |
| 28 | + |
| 29 | +## Description |
| 30 | + |
| 31 | +**Problem:** Serial `diskann_insert()` is too slow for batch workloads. PhotoStructure imports 100-1000 photos at a time. |
| 32 | + |
| 33 | +**Constraint:** Single SQLite connection only. No parallel WAL readers. All optimization must be serial. |
| 34 | + |
| 35 | +**Success Criteria:** 500-vector batch into 10k+ index at >= 300 inserts/sec (vs current 189/sec) with recall >= 85%. |
| 36 | + |
| 37 | +**Prerequisite complete:** `_todo/20260211-insert-profiling.md` — profiling data collected at 10k scale, decision gates evaluated below. |
| 38 | + |
| 39 | +## Tribal Knowledge |
| 40 | + |
| 41 | +- SAVEPOINT per insert is required because writable BLOBs need transaction context |
| 42 | +- BlobCache is created/destroyed per insert (lines 231-236, 369-371) — huge waste for batches |
| 43 | +- `insert_shadow_row()` calls `sqlite3_mprintf` + `sqlite3_prepare_v2` every time — should be prepared once |
| 44 | +- Phase 2 (back-edges, lines 311-330) flushes each neighbor BLOB individually |
| 45 | +- FreshDiskANN paper: collect back-edges in delta structure, apply in single pass |
| 46 | +- Reference impl: intra-batch candidates use CPU-only distance computation (no I/O) |
| 47 | + |
| 48 | +### Profiling Results (10k scale, post random_start fix) |
| 49 | + |
| 50 | +| Phase | Avg (us) | % of Total | Notes | |
| 51 | +| ------------------- | --------- | ---------- | ------------------------------------- | |
| 52 | +| search | 2,970 | 56.1% | Beam search — inherent cost, dominant | |
| 53 | +| phase2 (back-edges) | 1,624 | 30.7% | Acceptance rate 13% at 10k | |
| 54 | +| phase1 (fwd edges) | 505 | 9.5% | Low | |
| 55 | +| random_start | 46 | 0.9% | **Fixed** (was 26% / 1.5ms) | |
| 56 | +| savepoint | ~1 | 0.02% | Negligible | |
| 57 | +| **Total** | **5,291** | **100%** | **189 inserts/sec** | |
| 58 | + |
| 59 | +- **Cache hits: 0%** — BlobCache created/destroyed per-insert, ~130 nodes visited per insert |
| 60 | +- **Similar vectors 15% faster** — search converges faster (977us vs 2,970us), but phase2 33% more expensive |
| 61 | +- Source: `_todo/20260211-insert-profiling.md` Session 2026-02-11 |
| 62 | + |
| 63 | +## Solutions |
| 64 | + |
| 65 | +### Phase 1: Transaction Batching + Persistent Cache |
| 66 | + |
| 67 | +Wrap N inserts in a single BEGIN/COMMIT. Persist BlobCache across all inserts in the batch. Cache the prepared INSERT statement. |
| 68 | + |
| 69 | +**API:** |
| 70 | + |
| 71 | +```c |
| 72 | +int diskann_begin_batch(DiskAnnIndex *idx); |
| 73 | +/* call diskann_insert() N times */ |
| 74 | +int diskann_end_batch(DiskAnnIndex *idx); |
| 75 | +``` |
| 76 | +
|
| 77 | +**Decision gate:** SAVEPOINT overhead = 0.02% → **SKIP transaction batching**. Cache hit rate = 0% → **DO persistent cache** (top priority). |
| 78 | +
|
| 79 | +### Phase 2: Lazy Back-Edges + Batch Repair |
| 80 | +
|
| 81 | +Skip Phase 2 (lines 311-330) during insert — store forward edges only. After all inserts, run a repair pass that scans modified neighbors and applies all missing back-edges in one sequential pass with deduplication. |
| 82 | +
|
| 83 | +**Decision gate:** Phase 2 = 31% of insert time → just above 30% threshold → **DO this** (second priority). Expected -30% total time but high complexity. |
| 84 | +
|
| 85 | +### Phase 3: Intra-Batch Candidates |
| 86 | +
|
| 87 | +For similar vectors (same event photos), batch members are likely neighbors of each other. Compute all-pairs distances within batch (CPU-only, no I/O), add close batch members as candidate neighbors. Reduces the number of nodes that must be found via expensive beam search. |
| 88 | +
|
| 89 | +**Decision gate:** Search = 56% of insert time → above 50% threshold → **CONSIDER this** (third priority). Only benefits similar-vector batches (PhotoStructure's typical case). Deferred until Phase 1 + 2 measured. |
| 90 | +
|
| 91 | +## Tasks |
| 92 | +
|
| 93 | +Based on profiling (10k scale, 189 inserts/sec baseline): |
| 94 | +
|
| 95 | +- [x] **Phase 1a: Persistent BlobCache across batch** — 0% cache hits on ~130 visited nodes/insert. Expected -10-20% total time. Medium effort. |
| 96 | +- [ ] Phase 1b: Prepared statement caching for `insert_shadow_row()` — minor optimization, low effort |
| 97 | +- [ ] ~~Phase 1c: Transaction batching~~ — SKIPPED, SAVEPOINT = 0.02% of insert time |
| 98 | +- [ ] **Phase 2: Lazy back-edges + batch repair** — 31% of insert time. Expected -30% total time. High effort/complexity. |
| 99 | +- [ ] Phase 3: Intra-batch candidates — deferred until Phase 1+2 measured |
| 100 | +- [ ] Benchmark: 500-vector batch into 10k index, compare serial vs batch |
| 101 | +- [ ] Validate recall >= 85% for batch inserts |
| 102 | +
|
| 103 | +**Verification:** |
| 104 | +
|
| 105 | +```bash |
| 106 | +make test # All existing tests pass + new batch tests |
| 107 | +make asan # No memory errors |
| 108 | +make clean && make valgrind # No leaks |
| 109 | +``` |
| 110 | + |
| 111 | +## Notes |
| 112 | + |
| 113 | +### 2026-02-11: Profiling Complete, Priorities Established |
| 114 | + |
| 115 | +Phase 0 profiling (`_todo/20260211-insert-profiling.md`) is complete. Key findings: |
| 116 | + |
| 117 | +- **random_start fixed** (commit `ad404e1`) — was 26% of insert time, now 0.9% |
| 118 | +- **Search dominates** at 56% — inherent beam search cost, hard to optimize |
| 119 | +- **Phase 2 = 31%** — borderline for lazy back-edges, but a real win for batch workloads |
| 120 | +- **Cache hits = 0%** — BlobCache per-insert is pure waste; persistent cache is the top optimization |
| 121 | +- **SAVEPOINT negligible** — 0.02%, no need for transaction batching |
| 122 | + |
| 123 | +**Recommended implementation order:** Phase 1a (persistent cache) first — medium effort, immediate 10-20% win. Phase 2 (lazy back-edges) second — high effort, 30% win but affects graph quality (needs careful recall testing). |
| 124 | + |
| 125 | +### 2026-02-11: Research & Planning for Phase 1a (Persistent BlobCache) |
| 126 | + |
| 127 | +**Ownership Model — The Core Problem:** |
| 128 | + |
| 129 | +Current flow: `diskann_node_free()` → `blob_spot_free()`. Cache holds borrowed pointers. For persistent cache, cache must own BlobSpots across inserts, but nodes still need valid pointers during Phase 1 + Phase 2. |
| 130 | + |
| 131 | +**Solution: `is_cached` flag on BlobSpot + `owns_blobs` flag on BlobCache** |
| 132 | + |
| 133 | +1. `BlobSpot.is_cached` (default 0): When a cache with `owns_blobs=1` stores a BlobSpot, it sets `is_cached=1`. |
| 134 | +2. `diskann_node_free()`: If `blob_spot->is_cached`, skip `blob_spot_free()`. Cache owns it. |
| 135 | +3. On cache eviction: Set `evicted->is_cached=0`. Node still references it, `diskann_node_free()` will free it. |
| 136 | +4. `blob_cache_deinit()` with `owns_blobs=1`: Iterates LRU chain, `blob_spot_free()` all remaining entries. |
| 137 | +5. Per-insert caches (`owns_blobs=0`): No behavioral change. `is_cached` stays 0 throughout. |
| 138 | + |
| 139 | +**Why this works:** |
| 140 | + |
| 141 | +- During a single insert: nodes + cache both reference same BlobSpots. `is_cached=1` prevents node teardown from freeing them. |
| 142 | +- On eviction during search (capacity exceeded): `is_cached=0` on evicted entry. Node still has valid pointer, frees it normally during context deinit. |
| 143 | +- Between inserts: All nodes freed via `diskann_search_ctx_deinit()`. Cached BlobSpots survive (is_cached=1). Evicted ones freed by node teardown (is_cached=0). No dangling pointers. |
| 144 | +- On batch end: `blob_cache_deinit()` frees all remaining BlobSpots. |
| 145 | + |
| 146 | +**Cache data freshness after Phase 2:** |
| 147 | + |
| 148 | +- Phase 2 modifies a node's in-memory buffer (adds back-edge) then flushes to disk. |
| 149 | +- Cached copy IS the latest version — in-memory buffer is authoritative. |
| 150 | +- `blob_spot_reload()` with same rowid + `is_initialized=1` returns immediately (no stale read). |
| 151 | +- This means the persistent cache is BETTER than re-reading from disk. |
| 152 | + |
| 153 | +**SAVEPOINT rollback safety:** |
| 154 | + |
| 155 | +- If `diskann_insert()` fails and SAVEPOINT is rolled back, cached BlobSpots may have stale data (Phase 2 mods rolled back on disk but still in memory). |
| 156 | +- Solution: On any insert error during a batch, caller should call `diskann_end_batch()` which clears the cache. The `diskann_end_batch()` function should handle this gracefully. |
| 157 | +- `diskann_begin_batch()` could also wrap everything in a single outer transaction (BEGIN/COMMIT) for atomicity. |
| 158 | + |
| 159 | +**New node caching:** |
| 160 | + |
| 161 | +- After flushing the newly inserted node's BlobSpot, put it in the batch cache. Future inserts visiting this node get a cache hit instead of re-reading from disk. |
| 162 | + |
| 163 | +**Files to modify:** |
| 164 | + |
| 165 | +- `src/diskann_blob.h` — Add `is_cached` field to BlobSpot |
| 166 | +- `src/diskann_blob.c` — Initialize `is_cached=0` in `blob_spot_create()` |
| 167 | +- `src/diskann_cache.h` — Add `owns_blobs` field to BlobCache |
| 168 | +- `src/diskann_cache.c` — Eviction frees with `is_cached=0`; deinit frees remaining with `owns_blobs=1` |
| 169 | +- `src/diskann_node.c` — `diskann_node_free()` checks `is_cached` |
| 170 | +- `src/diskann_internal.h` — Add `BlobCache *batch_cache` to DiskAnnIndex |
| 171 | +- `src/diskann.h` — Declare `diskann_begin_batch()` / `diskann_end_batch()` |
| 172 | +- `src/diskann_api.c` — Implement begin/end batch |
| 173 | +- `src/diskann_insert.c` — Use `idx->batch_cache` when available; cache new node's BlobSpot |
| 174 | +- `tests/c/test_cache.c` — New tests for owning cache eviction + deinit |
| 175 | +- `tests/c/test_insert.c` — New tests for batch insert (begin/end API, recall, error handling) |
| 176 | + |
| 177 | +### 2026-02-11: Phase 1a Implementation Complete |
| 178 | + |
| 179 | +**What was done:** |
| 180 | + |
| 181 | +1. Added `is_cached` flag to `BlobSpot` (diskann_blob.h) |
| 182 | +2. Added `owns_blobs` flag to `BlobCache` (diskann_cache.h) |
| 183 | +3. Modified `blob_cache_put()`: sets `is_cached=1` in owning mode |
| 184 | +4. Modified `get_free_slot()`: clears `is_cached=0` on eviction in owning mode |
| 185 | +5. Modified `blob_cache_deinit()`: frees all BlobSpots via LRU chain in owning mode |
| 186 | +6. Modified `diskann_node_free()`: skips `blob_spot_free()` when `is_cached=1` |
| 187 | +7. Added `batch_cache` field to `DiskAnnIndex` |
| 188 | +8. Implemented `diskann_begin_batch()` / `diskann_end_batch()` in diskann_api.c |
| 189 | +9. Modified `diskann_insert()`: uses `idx->batch_cache` when available via `active_cache` pointer |
| 190 | +10. New node BlobSpot cached in batch mode after flush (future inserts get cache hit) |
| 191 | +11. `diskann_close_index()` cleans up active batch cache (safety net) |
| 192 | +12. 12 new tests: 4 cache ownership + 8 batch API |
| 193 | +13. Forward declaration for `BlobCache` in diskann_internal.h to avoid circular include |
| 194 | + |
| 195 | +**Verification:** |
| 196 | + |
| 197 | +- `make test`: 204/204 pass (was 192) |
| 198 | +- `make asan`: clean (no errors) |
| 199 | +- `make clean && make valgrind`: 0 errors, 0 leaks |
| 200 | +- Extension builds: `make all` produces diskann.so |
| 201 | + |
| 202 | +**Next: benchmark to measure actual speedup (expected 10-20%)** |
0 commit comments