DiskANN indices store configuration parameters in the <tablename>_metadata shadow table. Understanding which parameters can be changed after index creation is critical for tuning performance.
These parameters determine the physical storage layout and cannot be changed without recreating the index.
- What: Vector dimensionality (e.g., 64, 128, 256, 768, 1536)
- Default: None (required)
- Why immutable: Determines binary layout of nodes and edges
- Recommended values:
- Small models (MiniLM, etc.): 384
- Medium models (BERT, etc.): 768
- Large models (OpenAI): 1536
- Custom embeddings: Match your model's output
- What: Distance function for similarity search
- Default:
euclidean(0) - Options:
euclidean(0): L2 distance, best for general usecosine(1): Normalized dot product, good for text embeddingsdot(2): Inner product, for pre-normalized vectors
- Why immutable: Changing distance function would invalidate all edge relationships
- Recommended: Use
cosinefor text embeddings,euclideanfor general vectors
- What: Maximum edges per node in the graph
- Default: 32
- Why immutable: Determines block size via formula:
node_overhead = 16 + (dimensions × 4) edge_overhead = (dimensions × 4) + 16 margin = max_neighbors + (max_neighbors / 10) block_size = node_overhead + (margin × edge_overhead) - Recommended values:
- Small datasets (<10k): 32
- Medium datasets (10k-100k): 64
- Large datasets (100k-1M): 64-96
- Very large (1M+): 96-128
- Trade-offs:
- Higher = better recall, slower build, larger index
- Lower = faster build, smaller index, lower recall
- What: SQLite BLOB size in bytes for storing nodes
- Default: 0 (auto-calculate from dimensions × max_neighbors)
- Why immutable: SQLite BLOB handles are opened with fixed size
- Recommended: Always use auto-calculate (0) unless you have specific requirements
- Manual override: Only if you need precise control over index size
These parameters affect graph construction quality. Changing mid-build creates inconsistency.
- What: Beam width during INSERT (how many candidates to explore when adding a node)
- Default: 200
- Stored in: Metadata table
- How to change:
UPDATE tablename_metadata SET value=100 WHERE key='insert_list_size'; -- Then clear and rebuild graph: DELETE FROM tablename_shadow; -- Re-insert all vectors
- When to change:
- Before building: Set via CREATE VIRTUAL TABLE parameter
- After building: Requires clearing and reinserting all vectors
- Recommended values:
- Small datasets (<10k): 50-100
- Medium datasets (10k-100k): 100-150
- Large datasets (100k+): 150-200
- Trade-offs:
- Higher = better graph quality, slower builds, more I/O
- Lower = faster builds, lower recall
- Performance impact: Each insert does
insert_list_sizeBLOB reads = major build time factor
- What: Diversity parameter for edge pruning (higher = more diverse neighbors)
- Default: 1.4
- Stored in: Metadata table as
pruning_alpha_x1000(e.g., 1400 = 1.4) - How to change:
UPDATE tablename_metadata SET value=1200 WHERE key='pruning_alpha_x1000'; -- Clear and rebuild graph
- Recommended values:
- Standard: 1.2-1.4
- High diversity needed: 1.5-2.0
- Maximum connectivity: 1.0-1.2
- Trade-offs:
- Higher = more diverse edges, better recall on diverse queries
- Lower = more similar edges, may improve recall on clustered data
- Note: Changing mid-build creates edges with different pruning criteria (inconsistent graph quality)
These parameters control search behavior and can be overridden without rebuilding.
-
What: Beam width during search (how many candidates to explore per query)
-
Default: 100 (with automatic scaling — see below)
-
Stored in: Metadata table (but can be overridden)
-
How to override:
-- Use auto-scaled default SELECT rowid, distance FROM vectors WHERE vector MATCH ? AND k = 10; -- Override to 300 for higher recall SELECT rowid, distance FROM vectors WHERE vector MATCH ? AND k = 10 AND search_list_size = 300;
-
Automatic scaling: The effective beam width is
max(configured_default, sqrt(index_size)). This ensures recall stays high as the index grows, without requiring manual tuning. For example:- 10k vectors:
max(100, 100)= 100 - 25k vectors:
max(100, 158)= 158 - 100k vectors:
max(100, 316)= 316 - 1M vectors:
max(100, 1000)= 1000
- 10k vectors:
-
Per-query overrides also benefit from auto-scaling: the effective beam is
max(your_override, sqrt(n)), so low values are automatically raised. To force a specific beam width, set it abovesqrt(n). -
Trade-offs:
- Higher = better recall, slower queries
- Lower = faster queries, lower recall
-
Rule of thumb: Auto-scaling handles most cases. Override only if you need faster queries and can tolerate lower recall.
-
Performance impact: Linear with beam width (2x beam = ~2x query time)
CREATE VIRTUAL TABLE embeddings USING diskann(
dimension=768, -- Match your model
metric=cosine, -- Standard for text
max_neighbors=64, -- Good balance
insert_list_size=100, -- Faster builds
search_list_size=150 -- Good recall
);Expected performance:
- Build: ~100ms per 1k vectors
- Recall@10: 90-95%
- QPS: 500-1000 (depends on hardware)
CREATE VIRTUAL TABLE images USING diskann(
dimension=512,
metric=euclidean, -- Better for image features
max_neighbors=96, -- Higher connectivity
insert_list_size=150,
search_list_size=200
);Expected performance:
- Build: ~150ms per 1k vectors
- Recall@10: 92-97%
- QPS: 300-600
CREATE VIRTUAL TABLE small USING diskann(
dimension=128,
metric=euclidean,
max_neighbors=32, -- Less overhead
insert_list_size=50, -- Faster builds
search_list_size=100
);Note: For datasets <10k, sqlite-vec (brute force) may be faster. DiskANN's overhead only pays off at scale.
CREATE VIRTUAL TABLE large USING diskann(
dimension=256,
metric=cosine,
max_neighbors=96, -- Better connectivity at scale
insert_list_size=200, -- Higher quality graph
search_list_size=300 -- Maintain recall
);Expected performance:
- Build: ~10-20 seconds per 10k vectors
- Recall@10: 93-98%
- QPS: 100-300
Override search_list_size per query:
-- TypeScript
const results = searchNearest(db, 'vectors', query, 10, {
searchListSize: 300 // Override default
});
-- SQL
SELECT rowid, distance FROM vectors
WHERE vector MATCH ? AND k = 10 AND search_list_size = 300;To change dimensions, metric, max_neighbors, or block_size:
-- 1. Export vectors
CREATE TEMP TABLE backup AS
SELECT rowid, vector FROM old_vectors_shadow;
-- 2. Drop old index
DROP TABLE old_vectors;
-- 3. Create new index with new parameters
CREATE VIRTUAL TABLE new_vectors USING diskann(
dimension=256, -- Changed from 128
-- ... other params
);
-- 4. Re-insert vectors
INSERT INTO new_vectors (rowid, vector)
SELECT rowid, vector FROM backup;
-- 5. Verify
SELECT COUNT(*) FROM new_vectors_shadow;To change insert_list_size or pruning_alpha:
-- 1. Update metadata
UPDATE vectors_metadata SET value=100 WHERE key='insert_list_size';
UPDATE vectors_metadata SET value=1200 WHERE key='pruning_alpha_x1000';
-- 2. Clear graph but keep vectors
DELETE FROM vectors_shadow;
-- 3. Re-insert vectors (faster than full rebuild since vectors exist)
INSERT INTO vectors (rowid, vector)
SELECT rowid, vector FROM vectors_shadow_backup;Note: This still requires re-building the graph structure, but avoids re-computing embeddings.
-- View all parameters
SELECT key, value FROM vectors_metadata ORDER BY key;
-- Check specific parameter
SELECT value FROM vectors_metadata WHERE key='search_list_size';-- Test different search_list_size values
.timer on
-- Baseline
SELECT rowid, distance FROM vectors
WHERE vector MATCH ? AND k=10;
-- Higher recall
SELECT rowid, distance FROM vectors
WHERE vector MATCH ? AND k=10 AND search_list_size=300;
-- Maximum recall
SELECT rowid, distance FROM vectors
WHERE vector MATCH ? AND k=10 AND search_list_size=1000;See benchmarks/ directory for comprehensive recall measurement tools.
cd benchmarks
npm run bench:quick -- --profile=recall-sweep-- This will NOT work - max_neighbors determines block_size!
UPDATE vectors_metadata SET value=128 WHERE key='max_neighbors';Why: Block size is fixed at index creation. Changing max_neighbors creates mismatch.
CREATE VIRTUAL TABLE bad USING diskann(
dimension=128,
insert_list_size=5 -- Too small!
);
-- Insert with k=10 will only explore 5 candidatesRule: insert_list_size ≥ max_neighbors recommended
-- Too low - poor recall
WHERE vector MATCH ? AND k=100 AND search_list_size=10
-- Too high - no benefit past ~10x k
WHERE vector MATCH ? AND k=10 AND search_list_size=10000Sweet spot: search_list_size = k × 2 to k × 10
-
dimensionsmatches your embedding model exactly -
metricappropriate for your data (cosine for text, euclidean for images) -
max_neighborsbalanced for dataset size (32-128 range) -
insert_list_sizeappropriate for build time tolerance (50-200) -
search_list_sizetuned for recall/speed tradeoff (test with different values) - Block size auto-calculated (unless you have specific needs)
- Tested recall on representative queries before production
- Monitored query latency under load
- Algorithm Details: See DiskANN paper (NeurIPS 2019)
- Benchmarks:
benchmarks/README.md - Index Format:
src/diskann_node.h(binary layout documentation) - Build Optimization:
_todo/20260211-build-speed-optimization.md(if available)