Skip to content
This repository was archived by the owner on Feb 18, 2026. It is now read-only.

Commit 27079ed

Browse files
committed
chore(tpp): checkpoint
1 parent 1df89f3 commit 27079ed

4 files changed

Lines changed: 567 additions & 12 deletions

File tree

_done/20260211-batched-inserts.md

Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
# SUPERSEDED — Batched Vector Inserts for DiskANN
2+
3+
> **Superseded by:** `_todo/20260211-insert-profiling.md` and `_todo/20260211-serial-batch-insert.md`
4+
> **Reason:** Parallel beam search via multiple WAL readers is not viable — all target DB drivers (better-sqlite3, @photostructure/sqlite, node:sqlite) are single-threaded with a single connection. The graduated serial optimization approach replaces this plan. See `_research/write-performance-analysis.md` for full options analysis.
5+
6+
## Summary
7+
8+
DiskANN inserts are 20-100x slower than sqlite-vec, making PhotoStructure imports unacceptably slow. Users import 100-1000 photos at a time, often from the same event (high visual similarity). Implement `diskann_insert_batch()` API that parallelizes beam search and deduplicates neighbor updates to achieve 10-50x speedup. Memory stays low (~5MB for 500-vector batch) by only buffering search results, not the entire index.
9+
10+
## Current Phase
11+
12+
- [ ] Research & Planning
13+
- [ ] Test Design
14+
- [ ] Implementation Design
15+
- [ ] Test-First Development
16+
- [ ] Implementation
17+
- [ ] Integration
18+
- [ ] Cleanup & Documentation
19+
- [ ] Final Review
20+
21+
## Required Reading
22+
23+
- `CLAUDE.md` - Project conventions
24+
- `TDD.md` - Testing methodology
25+
- `DESIGN-PRINCIPLES.md` - C coding standards
26+
- `src/diskann_insert.c` - Current serial insert implementation (lines 141-377)
27+
- `src/diskann_search.c` - Beam search algorithm (diskann_search_internal, lines 211-550)
28+
- `_research/parallel-indexing.md` - Academic paper analysis and design rationale
29+
- `_done/20260210-diskann-recall-fix-investigation.md` - Block size fix validation (context)
30+
31+
## Description
32+
33+
**Problem:** PhotoStructure users import photos in batches (100-1000), but serial `diskann_insert()` is prohibitively slow:
34+
35+
- 500 photos × 320ms = 160 seconds (~3 minutes)
36+
- sqlite-vec does the same in 2-4 seconds
37+
- 20-100x performance gap makes DiskANN impractical for production use
38+
39+
**Root cause:** Each insert performs:
40+
41+
1. Beam search (~100ms) - can be parallelized
42+
2. SAVEPOINT overhead (~2ms) - wasteful for batches
43+
3. Neighbor BLOB updates (~200ms) - massive duplication when vectors are similar
44+
45+
**Key insight:** Photos from same import have high visual similarity → their nearest neighbors overlap heavily. Neighbor A might appear in 40 different photos' neighbor lists, but we update its BLOB 40 times separately. Batching allows deduplication: update A once with all 40 back-edges.
46+
47+
**Constraints:**
48+
49+
- **Incremental updates only** - Add to existing 100k-500k vector index, no rebuilds
50+
- **Low memory** - Users have 4GB+ RAM, but can't load entire index (100MB-1GB)
51+
- **Batch sizes: 100-1000** - Not 300k bulk loads
52+
- **PhotoStructure use case** - Photos from same import are visually similar (high neighbor overlap)
53+
54+
**Success Criteria:**
55+
56+
- 500 similar vectors: <10 seconds (vs ~160s) = **16-50x speedup**
57+
- 500 random vectors: <20 seconds (vs ~160s) = **8-15x speedup**
58+
- Memory: <10MB for batch processing
59+
- Recall@10 >= 75% (maintain graph quality)
60+
- Thread-safe (ASan clean), no leaks (Valgrind clean)
61+
62+
## Tribal Knowledge
63+
64+
### Current Serial Insert Bottleneck (Validated 2026-02-11)
65+
66+
Each `diskann_insert()` (lines 141-377 of diskann_insert.c):
67+
68+
1. SELECT random start node
69+
2. **BEGIN SAVEPOINT** (lines 207-213: required because writable BLOBs need transaction context)
70+
3. `diskann_search_internal()` - beam search, ~200 BLOB reads (~100ms)
71+
4. INSERT shadow row
72+
5. Phase 1: Add edges to new node
73+
6. Phase 2: Update neighbor edges (~50 neighbors, BLOB flush per neighbor, ~200ms)
74+
7. RELEASE SAVEPOINT
75+
76+
**Bottlenecks:**
77+
78+
- 500 SAVEPOINT/RELEASE cycles (transaction overhead)
79+
- Neighbor overlap: Similar vectors share 60-80% of neighbors → same BLOBs updated 20-50 times
80+
- Example: 500 photos × 50 neighbors = 25,000 BLOB updates, but only ~6,000 unique neighbors
81+
82+
### Why Batching Works (Academic Support)
83+
84+
**FreshDiskANN (2021):**
85+
86+
- Designed for high-throughput streaming inserts
87+
- Lazy consolidation: buffer inserts, deduplicate neighbors, merge
88+
- Reports 10-100x speedup vs individual inserts
89+
- Exactly PhotoStructure's use case
90+
91+
**DiskANN paper (NeurIPS 2019), Section 5.3:**
92+
93+
- "Batching updates allows amortizing graph mutations across multiple inserts"
94+
- Neighbor overlap can be exploited for performance
95+
96+
**ParlayANN (PPoPP 2024):**
97+
98+
- Proves per-node locking has negligible contention (~1/n probability)
99+
- Updating neighbor once with 40 back-edges = 40 individual updates, but way faster
100+
101+
### SQLite WAL Concurrency
102+
103+
- WAL mode: unlimited concurrent readers + 1 writer
104+
- Each thread can open read-only connection for parallel search
105+
- Writes must be serialized through main connection
106+
- Current codebase already uses WAL mode
107+
108+
### Memory Constraints (Validated)
109+
110+
For 500 vectors @ 256D:
111+
112+
- Batch input: 500 × 256 × 4 = 512 KB
113+
- Search results: 500 × 200 nodes × 12 bytes = 1.2 MB (just IDs, not vectors!)
114+
- Neighbor dedup map: ~6,000 neighbors × 40 bytes = 240 KB
115+
- **Total: ~2-5 MB** (not entire index - existing index stays on disk)
116+
117+
1000 vectors @ 256D: ~5-10 MB
118+
119+
**Critical:** We do NOT load the entire index into RAM. Existing vectors (100k-500k) stay in SQLite BLOBs and are read during beam search via BlobSpots, just like serial insert does.
120+
121+
### Prior Investigation Context
122+
123+
**Original TPP (20260210-parallel-graph-construction.md):**
124+
125+
- Intern designed in-memory full rebuild (loading ALL vectors into RAM)
126+
- Dismissed incremental batch insert as "only 2-3x speedup, dead-end architecture"
127+
- **Analysis was wrong:** Missed neighbor deduplication benefit (3-5x) and underestimated parallelism (8-24x)
128+
- Approach was wrong for PhotoStructure (users don't need bulk rebuilds, they need fast incremental batches)
129+
130+
**Validation findings (2026-02-11):**
131+
132+
- Porting complexity verified: beam search 80% reusable, edge operations trivial to port
133+
- Test registration is manual (CRITICAL: tests must be registered in test_runner.c)
134+
- Pthread linking required: add `-lpthread` to Makefile
135+
136+
### FreshDiskANN Comparison (2026-02-11)
137+
138+
**Source examined:** `../DiskANN` repository (Rust rewrite, C++ on `cpp_main` branch)
139+
140+
**FreshDiskANN architecture:**
141+
142+
- **Streaming model:** `insert()`, `delete()`, `replace()`, `maintain()`, `needs_maintenance()`
143+
- **Lazy consolidation:** Inserts are buffered, graph updates deferred until `maintain()` is called
144+
- **Periodic maintenance:** When `needs_maintenance()` returns true, run consolidation pass
145+
- **Benefits:** Can handle continuous streaming inserts without blocking
146+
- **Complexity:** Background maintenance, non-deterministic latency, complex state management
147+
148+
**Our approach (explicit batching):**
149+
150+
- **Batch model:** `diskann_insert_batch(vectors[], count)` - process entire batch in one call
151+
- **Immediate processing:** Batch processed synchronously, no deferred maintenance
152+
- **Predictable latency:** User knows batch of 500 takes ~9 seconds, done
153+
- **Simpler code:** No background threads, no maintenance scheduling, no buffering state
154+
- **Better for PhotoStructure:** Photo imports are naturally batched (drag-and-drop folder), explicit batch call fits workflow
155+
156+
**Decision:** Stick with explicit batching - simpler, more predictable, fits PhotoStructure's import workflow perfectly. FreshDiskANN's streaming model solves a different problem (continuous high-throughput inserts with no natural batching).
157+
158+
## Solutions
159+
160+
### SELECTED: Incremental Batch Insert with Parallel Search + Deduplication
161+
162+
**Approach:**
163+
164+
**Phase 1: Store Vectors (Serial, ~1s for 500)**
165+
166+
- Single transaction: INSERT all N shadow rows
167+
- Write vectors to BLOBs with zero edges
168+
- Fast, no graph construction yet
169+
170+
**Phase 2: Parallel Beam Search (Read-only, 8-24 threads)**
171+
172+
- Each thread opens read-only SQLite connection (WAL allows unlimited readers)
173+
- Beam search for assigned slice of vectors
174+
- Existing index stays on disk, read via BLOBs (just like serial insert)
175+
- Store **results only** in memory (visited node IDs + distances, not vectors)
176+
- Time: ~3-4s for 500 vectors (vs 50s serial)
177+
178+
**Phase 3: Deduplicated Edge Updates (Serial, one transaction)**
179+
180+
- Build neighbor map: neighbor_id → list of back-edges from batch
181+
- **Deduplicate:** Neighbor A appears in 40 photos → update A's BLOB once with all 40 edges
182+
- Single transaction wraps all edge mutations
183+
- Time: ~4-6s for 500 vectors (vs 100s with duplication)
184+
185+
**Why this works:**
186+
187+
- Parallelism: 8-24x speedup on beam search (CPU-bound)
188+
- Deduplication: 3-5x reduction in BLOB writes (neighbor overlap)
189+
- Transaction batching: 2x speedup (avoid 500 SAVEPOINTs)
190+
- **Combined: 16-64x speedup** depending on vector similarity
191+
192+
**Memory:** Only batch results (~2-5 MB), not entire index
193+
194+
**API:**
195+
196+
```c
197+
typedef struct DiskAnnBatchConfig {
198+
int num_threads; // 0 = auto-detect
199+
void (*progress)(int, int, void*); // progress callback
200+
void *progress_user_data;
201+
} DiskAnnBatchConfig;
202+
203+
int diskann_insert_batch(DiskAnnIndex *idx,
204+
const int64_t *ids,
205+
const float **vectors,
206+
int count,
207+
const DiskAnnBatchConfig *config);
208+
```
209+
210+
**Pros:**
211+
212+
- ✅ Right use case (incremental batches, not bulk rebuilds)
213+
- ✅ Low memory (~5MB for 500 vectors, not entire index)
214+
- ✅ Significant speedup (10-50x) from parallelism + deduplication
215+
- ✅ Simpler than full rebuild (~600 lines vs ~1200 lines)
216+
- ✅ Closes gap to sqlite-vec (from 100x slower to 2-3x slower)
217+
218+
**Cons:**
219+
220+
- ⚠️ Edge updates still serial (SQLite single-writer limitation)
221+
- ⚠️ Requires WAL mode (already enabled)
222+
223+
**Status:** Chosen approach
224+
225+
---
226+
227+
### REJECTED: In-Memory Full Rebuild
228+
229+
Original intern's approach: Load all vectors into RAM, build graph in memory, serialize back.
230+
231+
**Why rejected:**
232+
233+
- ❌ Wrong use case (bulk load, not incremental batches)
234+
- ❌ High memory (190MB+ for 300k vectors - PhotoStructure users don't have "gobs of RAM")
235+
- ❌ Wasteful for adding 500 vectors to existing 300k index (rebuilds everything)
236+
- ❌ More complex implementation (~1200 lines)
237+
238+
**Status:** Rejected
239+
240+
## Tasks
241+
242+
### High-Level Phases (Detailed tasks in implementation plan)
243+
244+
- [ ] **Phase 1:** Serial batch (no threading) - prove deduplication gives 2-4x speedup
245+
- [ ] **Phase 2:** Add parallel search - prove 8-24x speedup from parallelism
246+
- [ ] **Phase 3:** Integration, benchmarks, TypeScript wrappers
247+
248+
### Key Implementation Files
249+
250+
**To create:**
251+
252+
- `src/diskann_batch.h` - API declarations, internal structs
253+
- `src/diskann_batch.c` - Three-phase implementation (~600 lines)
254+
- `tests/c/test_batch.c` - Batch insert tests (~400 lines)
255+
256+
**To modify:**
257+
258+
- `src/diskann.h` - Add batch API
259+
- `src/diskann_insert.c` - Make insert_shadow_row non-static
260+
- `Makefile` - Add diskann_batch.c, add -lpthread
261+
- `tests/c/test_runner.c` - Register batch tests (CRITICAL: manual registration required)
262+
263+
**Verification:**
264+
265+
```bash
266+
# Phase 1: Serial batch
267+
make test # +12 tests passing
268+
make valgrind # No leaks
269+
270+
# Phase 2: Parallel search
271+
make asan # CRITICAL: detect data races
272+
make valgrind # No leaks with threading
273+
274+
# Phase 3: Final
275+
make test-stress # 10-50x speedup demonstrated
276+
npm run test:ts # TypeScript integration works
277+
```
278+
279+
## Notes
280+
281+
### Handoff Notes (2026-02-11)
282+
283+
**What was done this session:**
284+
285+
1. **Validated original TPP (20260210-parallel-graph-construction.md):**
286+
- Intern's full rebuild approach was wrong for PhotoStructure use case
287+
- Missed neighbor deduplication (3-5x speedup)
288+
- Underestimated parallelism benefit (8-24x speedup)
289+
290+
2. **Identified correct approach:**
291+
- Incremental batch insert (FreshDiskANN-style)
292+
- Parallel search + deduplicated edge updates
293+
- Low memory (only batch results, not entire index)
294+
295+
3. **Created detailed implementation plan:**
296+
- Saved at `/home/mrm/.claude/plans/snuggly-imagining-puppy.md`
297+
- 3 phases: serial batch, parallel search, integration
298+
- ~700 lines implementation + ~400 lines tests
299+
300+
4. **Updated research documentation:**
301+
- `_research/parallel-indexing.md` - explains approach for CS students
302+
- Includes academic paper analysis and PhotoStructure context
303+
304+
**Key discoveries:**
305+
306+
- Neighbor deduplication is the key win for similar vectors (PhotoStructure's use case)
307+
- SQLite WAL allows unlimited read-only connections (enables parallelism)
308+
- Memory is NOT a constraint for batch sizes (5-10 MB is nothing on 4GB machines)
309+
310+
**Next session should:**
311+
312+
- Start Research & Planning phase
313+
- Read required files to understand current implementation
314+
- Validate approach against actual PhotoStructure usage patterns
315+
- Design detailed data structures and algorithm flow
316+
317+
---
318+
319+
### Session Update (2026-02-11 - Later)
320+
321+
**Additional findings:**
322+
323+
1. **FreshDiskANN source code examined** (`../DiskANN` repo):
324+
- Uses **lazy consolidation** model: `insert()` + `maintain()` + `needs_maintenance()`
325+
- Streaming approach: continuous inserts, periodic consolidation when threshold hit
326+
- Different from our explicit batching: we process entire batch immediately
327+
- **Our approach is simpler for PhotoStructure:** predictable latency, no background maintenance
328+
329+
2. **Memory constraints validated:**
330+
- User confirmed: 4GB+ RAM typical, 1-10 MB for batch processing is "nothing"
331+
- 1000 × 256 × 4 = 1 MB for input vectors
332+
- Total ~5-10 MB including search results and dedup map
333+
- **No need to be conservative** - was overthinking memory
334+
335+
3. **Problem statement finalized:**
336+
- PhotoStructure inserts are 20-100x slower than sqlite-vec (CRITICAL business problem)
337+
- Batch sizes: 100-1000 vectors from photo imports
338+
- High similarity expected (same event photos) → high neighbor overlap
339+
- Target: Close gap from 100x slower to 2-3x slower
340+
341+
4. **Documentation updated:**
342+
- Old TPP (20260210-parallel-graph-construction.md) marked as SUPERSEDED
343+
- Research doc (\_research/parallel-indexing.md) updated to reflect batch approach
344+
- Implementation plan at `/home/mrm/.claude/plans/snuggly-imagining-puppy.md`
345+
346+
**Key decisions:**
347+
348+
- **Explicit batching** (our approach) vs **lazy consolidation** (FreshDiskANN)
349+
- Pros: Simpler, predictable latency, no background threads
350+
- Cons: User must explicitly call batch insert (acceptable for PhotoStructure workflow)
351+
352+
**Next session priorities:**
353+
354+
1. Review FreshDiskANN paper details to ensure we're not missing critical optimizations
355+
2. Start Research & Planning: read current insert/search implementation
356+
3. Validate neighbor deduplication assumption with small-scale test
357+
4. Design NeighborUpdateMap data structure (hashmap or array-based?)

0 commit comments

Comments
 (0)