Skip to content

Latest commit

 

History

History
456 lines (334 loc) · 17 KB

File metadata and controls

456 lines (334 loc) · 17 KB

Vector Search Optimization - Study Notes

1. Vector Index Fundamentals

Why Vector Indexes Matter

Without an index, pgvector performs an exact nearest-neighbor scan — it computes the distance between the query vector and every row in the table. This is accurate but O(n). For tables with millions of embeddings, sequential scans take seconds or longer per query.

Vector indexes trade a small amount of accuracy (recall) for dramatically faster queries. Both IVFFlat and HNSW are approximate nearest neighbor (ANN) algorithms — they may miss some true nearest neighbors but return results orders of magnitude faster.

Exam tip: If a question asks about "exact nearest neighbor search" in pgvector, the answer involves no index (sequential scan). Any index-backed query is approximate.

Recall vs Latency Tradeoff

Term Definition
Recall Fraction of true nearest neighbors returned by the ANN query (1.0 = perfect)
Latency Time to execute the query
QPS Queries per second the system can sustain

Higher recall requires more computation (more probes, larger ef_search), which increases latency. The goal is to find the sweet spot for your workload — typically recall >= 0.95 is acceptable for most applications.


2. IVFFlat (Inverted File with Flat Compression)

How IVFFlat Works

  1. Build phase: k-means clustering divides all vectors into lists clusters based on proximity
  2. Each cluster stores a centroid (center point) and a flat list of all vectors assigned to that cluster
  3. Query phase: The query vector is compared to all centroids, the closest probes clusters are selected, and only vectors within those clusters are scanned
Build: vectors → k-means → lists clusters with centroids
Query: query_vec → find closest centroids → scan probes clusters → return top-k

IVFFlat Build Parameters

Parameter Description Guideline
lists Number of clusters to create < 1M rows: rows / 1000; >= 1M rows: sqrt(rows)
-- Example: 500K rows → lists = 500
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 500);

-- Example: 4M rows → lists = 2000 (sqrt(4000000) ≈ 2000)
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 2000);

IVFFlat Query Parameters

Parameter Default Description
ivfflat.probes 1 Number of clusters to scan at query time
-- Increase probes for better recall (default 1 is too low for production)
SET ivfflat.probes = 10;

-- Guideline: probes = sqrt(lists) is a good starting point
-- For lists = 500: probes ≈ 22
-- For lists = 2000: probes ≈ 45

IVFFlat Key Characteristics

  • Requires data before building: k-means needs vectors to compute centroids. Building on an empty table creates an empty index — queries will return no results.
  • Not incremental: New rows inserted after index creation are added to the closest cluster, but clusters are never rebalanced. Over time, clusters become uneven and recall degrades.
  • Rebuild periodically: After significant data changes (>20% new data), drop and recreate the index to rebalance clusters.
  • Faster build time: k-means is computationally cheaper than HNSW graph construction.

Exam tip: If a question describes a scenario where "new data is continuously inserted and recall has degraded over time," the answer involves rebuilding the IVFFlat index.


3. HNSW (Hierarchical Navigable Small World)

How HNSW Works

  1. Build phase: Constructs a multi-layer graph where each vector is a node
  2. Top layers are sparse (few nodes, long-range connections for coarse navigation)
  3. Bottom layer is dense (all nodes, short-range connections for fine-grained search)
  4. Query phase: Search starts at the top layer, greedily navigates toward the query vector, then descends to lower layers for refinement
Layer 3:  o -------- o  (few nodes, long jumps)
Layer 2:  o --- o --- o --- o
Layer 1:  o - o - o - o - o - o - o
Layer 0:  o o o o o o o o o o o o o  (all nodes, short connections)

HNSW Build Parameters

Parameter Default Description
m 16 Maximum number of connections per node per layer
ef_construction 64 Size of the dynamic candidate list during index build
-- Default parameters (good starting point)
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Higher quality index (slower build, better recall)
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);

Parameter effects:

Change Build Time Index Size Query Recall Query Speed
Higher m Slower Larger Better Slightly slower
Higher ef_construction Slower Same Better No effect

HNSW Query Parameters

Parameter Default Description
hnsw.ef_search 40 Size of the dynamic candidate list during query
-- Increase ef_search for better recall
SET hnsw.ef_search = 100;

-- ef_search must be >= the number of results requested (LIMIT)
-- If LIMIT 50, ef_search must be at least 50

HNSW Key Characteristics

  • Can build on an empty table: The graph structure accommodates incremental inserts without degradation.
  • Incremental by design: New vectors are inserted into the graph at all appropriate layers. No periodic rebuilds needed.
  • Higher memory during build: Constructing the graph requires more working memory than IVFFlat k-means.
  • Larger index size: Each node stores m connections per layer, making HNSW indexes larger than IVFFlat.
  • Better recall out of the box: HNSW typically achieves higher recall than IVFFlat with default parameters.

Exam tip: HNSW is the recommended index type for most workloads on Azure Database for PostgreSQL. Choose IVFFlat only when build time is the primary constraint.


4. IVFFlat vs HNSW Comparison

Property IVFFlat HNSW
Algorithm k-means clustering + flat scan Multi-layer navigable graph
Build on empty table No (needs data for k-means) Yes
Incremental inserts Degrades over time Maintains quality
Build time Faster Slower (2-3x typical)
Index size Smaller Larger (2-3x typical)
Default recall Lower (probes=1) Higher (ef_search=40)
Best recall achievable High (with many probes) Very high
Rebuild needed Yes, after significant data changes No
Query latency Good (depends on probes) Better (typically faster)
Memory during build Moderate Higher

When to Choose IVFFlat

  • Data is loaded in bulk and rarely changes
  • Index build time is a constraint (large dataset, tight maintenance window)
  • You can schedule periodic index rebuilds
  • Acceptable recall with tuned probes

When to Choose HNSW

  • Data arrives incrementally (streaming inserts)
  • Best possible recall is required
  • Cannot tolerate recall degradation between rebuilds
  • Query latency is the primary concern

5. Distance Metrics and Operator Classes

Operator Classes for Index Creation

The operator class in the CREATE INDEX statement must match the distance operator used in queries. If they do not match, PostgreSQL falls back to a sequential scan.

Distance Operator Operator Class Metric ORDER BY Direction
<-> vector_l2_ops L2 (Euclidean) distance ASC (smaller = closer)
<=> vector_cosine_ops Cosine distance ASC (0 = identical)
<#> vector_ip_ops Negative inner product ASC (most negative = highest similarity)
-- Index for cosine distance queries
CREATE INDEX idx_cosine ON items USING hnsw (embedding vector_cosine_ops);

-- This query USES the index:
SELECT * FROM items ORDER BY embedding <=> query_vec LIMIT 10;

-- This query DOES NOT use the index (wrong operator):
SELECT * FROM items ORDER BY embedding <-> query_vec LIMIT 10;

Exam tip: A common exam scenario presents an index created with vector_cosine_ops but a query using <-> (L2 distance). The index is not used because the operator class does not match. Always verify operator class alignment.

Choosing a Distance Metric

Metric Best For Notes
Cosine distance (<=>) Text embeddings, semantic search Insensitive to vector magnitude; most common for NLP
L2 distance (<->) Image embeddings, spatial data Sensitive to magnitude; measures absolute distance
Inner product (<#>) Pre-normalized vectors, recommendation Equivalent to cosine for normalized vectors; fastest computation

Exam tip: Azure OpenAI embeddings are L2-normalized. For normalized vectors, cosine distance and inner product produce the same ranking. Cosine distance is the conventional choice for text search.


6. Index Build Performance

Server Parameters for Index Builds

Parameter Default Recommendation Effect
maintenance_work_mem 64 MB 1-2 GB for large indexes Memory available for index build operations
max_parallel_maintenance_workers 2 4-8 (based on available cores) Parallel workers for index builds (HNSW only in pgvector 0.7+)
-- Increase memory for a large HNSW build
SET maintenance_work_mem = '2GB';
SET max_parallel_maintenance_workers = 7;

-- Build the index (uses increased resources)
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Reset to defaults after build
RESET maintenance_work_mem;
RESET max_parallel_maintenance_workers;

Build Time Estimates

Build time depends on dataset size, dimensions, index parameters, and available compute. Rough guidelines for 1536-dimension vectors on a Memory Optimized E4ds (4 vCPUs):

Rows IVFFlat Build HNSW Build
100K ~10 seconds ~30 seconds
1M ~2 minutes ~5 minutes
10M ~20 minutes ~60 minutes

Exam tip: For large datasets (>1M rows), increase maintenance_work_mem before building HNSW indexes. Insufficient memory causes the build to spill to disk and slow down dramatically.

Batch Build Strategy

For initial data loads, build the index after inserting all data, not before:

-- Step 1: Create table without index
CREATE TABLE items (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)
);

-- Step 2: Bulk load all data
COPY items (content, embedding) FROM STDIN;
-- ... load data ...

-- Step 3: Build index on populated table
SET maintenance_work_mem = '2GB';
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
RESET maintenance_work_mem;

This is faster because:

  • IVFFlat computes better centroids with all data present
  • HNSW inserts are slightly faster during bulk graph construction than incremental inserts
  • No index maintenance overhead during the COPY operation

7. Query Performance Analysis

Using EXPLAIN ANALYZE

-- Check if an index is being used
EXPLAIN ANALYZE
SELECT id, content, embedding <=> '[0.1, 0.2, ...]' AS distance
FROM items
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 10;

What to look for in the output:

Plan Node Meaning
Index Scan using idx_hnsw HNSW index is being used
Index Scan using idx_ivfflat IVFFlat index is being used
Seq Scan No index used (sequential scan)
Sort Results are being sorted after scan (index not covering the ORDER BY)

Common Reasons for Sequential Scan

  1. No index exists on the vector column
  2. Wrong operator class: Index uses vector_cosine_ops but query uses <-> (L2)
  3. Missing ORDER BY ... LIMIT: Vector indexes require ORDER BY distance LIMIT k to be used
  4. Planner chooses seq scan: For small tables, PostgreSQL may decide a sequential scan is faster
  5. Filter before ordering: A restrictive WHERE clause may cause the planner to filter first, then sort
-- Force index usage for testing (not recommended for production)
SET enable_seqscan = off;
EXPLAIN ANALYZE SELECT ... ORDER BY embedding <=> query LIMIT 10;
SET enable_seqscan = on;

Exam tip: Vector indexes in pgvector only accelerate ORDER BY ... LIMIT k queries. A query without LIMIT or without ORDER BY the distance operator will not use the vector index.


8. Hybrid Search

Combining Vector and Full-Text Search

Hybrid search uses both semantic similarity (vector) and keyword matching (full-text search) to improve result relevance.

-- Add a tsvector column for full-text search
ALTER TABLE items ADD COLUMN content_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
CREATE INDEX ON items USING GIN (content_tsv);

-- Hybrid query: combine vector similarity with text relevance
SELECT id, content,
    embedding <=> query_vec AS vector_distance,
    ts_rank(content_tsv, plainto_tsquery('english', 'azure openai')) AS text_rank
FROM items
WHERE content_tsv @@ plainto_tsquery('english', 'azure openai')
ORDER BY embedding <=> query_vec
LIMIT 10;

Pre-filtering vs Post-filtering

Strategy Approach Tradeoff
Pre-filter WHERE category = 'ai' before vector search Fewer vectors to search; may miss results if filter is too restrictive
Post-filter Vector search first, then filter results Better recall; may return fewer than k results after filtering
-- Pre-filter: filter first, then vector search
-- Works well when the filter is not too selective
SELECT id, content
FROM items
WHERE category = 'ai'
ORDER BY embedding <=> query_vec
LIMIT 10;

-- Post-filter approach (subquery)
SELECT * FROM (
    SELECT id, content, category
    FROM items
    ORDER BY embedding <=> query_vec
    LIMIT 100  -- over-fetch to account for filtering
) sub
WHERE category = 'ai'
LIMIT 10;

Exam tip: Pre-filtering with a WHERE clause on indexed columns (B-tree) combined with vector ORDER BY can leverage partial index scans. If the filter is very selective, PostgreSQL may choose a sequential scan because fewer rows need to be checked.


9. Scaling Strategies

Partitioning

For very large tables (>10M vectors), consider partitioning by a non-vector column:

-- Range partition by date
CREATE TABLE items (
    id SERIAL,
    content TEXT,
    embedding vector(1536),
    created_at DATE
) PARTITION BY RANGE (created_at);

CREATE TABLE items_2024 PARTITION OF items
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

-- Each partition gets its own vector index
CREATE INDEX ON items_2024 USING hnsw (embedding vector_cosine_ops);

Benefits:

  • Smaller indexes per partition (faster builds, better cache utilization)
  • Queries that include a partition key in WHERE skip irrelevant partitions
  • Old partitions can be dropped without rebuilding indexes

Dimension Reduction

Reducing embedding dimensions lowers storage, index size, and query time:

# Use the dimensions parameter with text-embedding-3-* models
response = client.embeddings.create(
    input=["text to embed"],
    model="text-embedding-3-small",
    dimensions=512  # reduced from default 1536
)
Dimensions Storage per Vector Index Size Impact Recall Impact
1536 (default) 6,152 bytes Baseline Baseline
768 3,080 bytes ~50% smaller Slight decrease
512 2,056 bytes ~33% of baseline Moderate decrease
256 1,032 bytes ~17% of baseline Noticeable decrease

Exam tip: Only text-embedding-3-small and text-embedding-3-large support the dimensions parameter. text-embedding-ada-002 always outputs 1536 dimensions.


10. pgvector Version Considerations

Version-Specific Features

Feature pgvector 0.5.x pgvector 0.7.x
Max dimensions 2,000 16,000
HNSW index Yes Yes
IVFFlat index Yes Yes
Parallel HNSW build No Yes
Halfvec (float16) type No Yes
Sparse vectors (sparsevec) No Yes

Halfvec for Reduced Storage

pgvector 0.7+ supports halfvec (float16 vectors) which halve storage at the cost of precision:

-- Create a halfvec column (half the storage of vector)
CREATE TABLE items_half (
    id SERIAL PRIMARY KEY,
    embedding halfvec(1536)
);

-- Indexes work the same way
CREATE INDEX ON items_half USING hnsw (embedding halfvec_cosine_ops);
Type Storage per Value Storage per 1536-dim Vector
vector (float32) 4 bytes 6,152 bytes
halfvec (float16) 2 bytes 3,080 bytes

Exam tip: halfvec uses different operator classes: halfvec_l2_ops, halfvec_cosine_ops, halfvec_ip_ops. Do not mix vector_*_ops with halfvec columns.