Skip to content

Commit 2934367

Browse files
authored
Product quantization with IVF-PQ (#2)
1 parent d7613c4 commit 2934367

13 files changed

Lines changed: 1068 additions & 32 deletions

File tree

README.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ await writeVectors({
6666
normalize: true, // L2-normalize on write; lets search skip sqrt for cosine
6767
binary: true, // also write 1-bit-per-dim sign column for binary+rerank search
6868
clusters: 128, // k-means clusters for phase-1 pruning (implies binary: true)
69+
pq: true, // optional IVF-PQ index for approximate scoring before rerank
6970
vectors: myEmbedder(), // any sync or async iterable of { id, vector }
7071
})
7172
```
@@ -155,6 +156,7 @@ const results = await searchVectors({
155156
source: 'https://example.com/vectors.parquet', // URL, local file path, or an open AsyncBuffer
156157
query: queryVec, // Float32Array of length `dimension`
157158
topK: 10,
159+
algorithm: 'auto', // 'auto' | 'exact' | 'binary' | 'pq'
158160
rerankFactor: 10, // candidate pool = topK * rerankFactor (default 10). Set to 0 to force exact full scan.
159161
probe: 0.25, // fraction of clusters to scan in phase 1 (default 0.25). Set to 1 to scan all clusters; pass an integer > 1 for an absolute count.
160162
})
@@ -166,11 +168,11 @@ const results = await searchVectors({
166168

167169
### How it works
168170

169-
Three columns: `id` (STRING), `vector` (`FIXED_LEN_BYTE_ARRAY(4 × dim)`, raw float32 bytes, `UNCOMPRESSED`), and — when `binary: true``vector_bin` (`FIXED_LEN_BYTE_ARRAY(dim/8)`, 1 bit per dim).
171+
Core columns: `id` (STRING), `vector` (`FIXED_LEN_BYTE_ARRAY(4 × dim)`, raw float32 bytes, `UNCOMPRESSED`), and optional ANN columns: `vector_bin` (`FIXED_LEN_BYTE_ARRAY(dim/8)`, 1 bit per dim) when `binary: true`, and `vector_pq` (`FIXED_LEN_BYTE_ARRAY(pqSegments)`) when `pq: true`.
170172

171173
**Exact search path** (no binary column, or `rerankFactor: 0`): single pass over the float32 column via `parquetRead({ onChunk })`. Each row-group's decoded `Uint8Array[]` shares a backing buffer, so we view it as one aligned `Float32Array` and stride by `dim` — zero per-row allocations.
172174

173-
**Binary + cluster + rerank path** (default when `binary: true`):
175+
**Binary + cluster + rerank path** (default when `binary: true` and no PQ column is present):
174176

175177
1. **Build-time clustering** (when `clusters > 0`): k-means on the 1-bit codes using Hamming distance and bit-majority voting. Cluster ids are then renumbered via a greedy nearest-neighbor walk so that adjacent ids = similar centroids — this makes the top-N nearest clusters at query time tend to land in fewer contiguous row ranges. Rows are sorted by the new cluster id. Centroids and per-cluster row counts go into KV metadata.
176178
2. **Phase 1 — cluster pruning**: rank clusters by Hamming(query, centroid), pick the top `probe` fraction, and Hamming-scan only those clusters' row ranges. With 32 KB pages and `useOffsetIndex`, hyparquet fetches only the pages covering each cluster's rows.
@@ -179,6 +181,8 @@ Three columns: `id` (STRING), `vector` (`FIXED_LEN_BYTE_ARRAY(4 × dim)`, raw fl
179181

180182
A `cachedAsyncBuffer` deduplicates footer / offset-index byte ranges across all the parallel `parquetRead` calls.
181183

184+
**IVF-PQ + rerank path** (`algorithm: 'pq'`, or `auto` when a file has PQ but no binary column): rank stored float IVF centroids against the query, scan compact residual `vector_pq` codes over the selected IVF row groups, approximate-score candidates with lookup tables built from the query, IVF centroid, and residual PQ codebooks, then fetch full float32 vectors only for the candidate pool and exact-rerank as above. IVF-PQ uses its own row ordering and should not be combined with binary `clusters`.
185+
182186
For pre-normalized vectors with `metric: 'cosine'`, the search normalizes the query once and scores via dot product to skip the per-candidate sqrt loop.
183187

184188
### File layout
@@ -188,6 +192,7 @@ For pre-normalized vectors with `metric: 'cosine'`, the search normalizes the qu
188192
| `id` | `STRING` (UTF8) | variable | always |
189193
| `vector` | `FIXED_LEN_BYTE_ARRAY(4 × dim)` | `4 × dim` | always |
190194
| `vector_bin` | `FIXED_LEN_BYTE_ARRAY(dim/8)` | `dim/8` | when `binary: true` |
195+
| `vector_pq` | `FIXED_LEN_BYTE_ARRAY(pqSegments)` | `pqSegments` | when `pq: true` |
191196

192197
Key-value metadata:
193198

@@ -198,10 +203,18 @@ Key-value metadata:
198203
| `hypvector.metric` | `cosine` \| `dot` \| `euclidean` |
199204
| `hypvector.normalized` | `true` if vectors were L2-normalized on write |
200205
| `hypvector.binary` | `true` if the `vector_bin` column is present |
206+
| `hypvector.pq` | `true` if the `vector_pq` column is present |
201207
| `hypvector.count` | number of vectors |
202208
| `hypvector.clusters` | number of k-means clusters (0 if not clustered) |
203209
| `hypvector.centroids` | base64-encoded centroid binary codes (`clusters × dim/8` bytes); present when `clusters > 0` |
204210
| `hypvector.clusterCounts` | base64-encoded `Uint32Array` of per-cluster row counts; present when `clusters > 0` |
211+
| `hypvector.pq.mode` | `ivf`; present when `pq: true` |
212+
| `hypvector.pq.segments` | number of PQ sub-vectors / bytes per code; present when `pq: true` |
213+
| `hypvector.pq.centroids` | centroids per PQ sub-vector; present when `pq: true` |
214+
| `hypvector.pq.codebooks` | base64-encoded residual `Float32Array` codebooks (`pq.centroids × dim` floats); present when `pq: true` |
215+
| `hypvector.ivf.clusters` | number of non-empty IVF lists; present when `pq: true` |
216+
| `hypvector.ivf.centroids` | base64-encoded float IVF centroids (`ivf.clusters × dim` float32 values); present when `pq: true` |
217+
| `hypvector.ivf.counts` | base64-encoded `Uint32Array` of per-IVF-list row counts; present when `pq: true` |
205218

206219
### CLI
207220

bin/inspect.js

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,12 @@ export async function inspect({ path }) {
2222
console.log(`Metric: ${meta.metric}`)
2323
console.log(`Normalized: ${meta.normalized}`)
2424
console.log(`Binary column: ${meta.hasBinary}`)
25+
console.log(`PQ column: ${meta.hasPq}`)
26+
if (meta.hasPq) {
27+
console.log(`PQ segments: ${meta.pqSegments}`)
28+
console.log(`PQ centroids: ${meta.pqCentroids}`)
29+
console.log(`IVF clusters: ${meta.ivfClusters}`)
30+
}
2531
console.log(`Row groups: ${metadata.row_groups.length.toLocaleString()}`)
2632
console.log(`Raw float32 size: ${rawSize.toLocaleString()} bytes`)
2733
console.log(`Overhead: ${(ratio * 100).toFixed(1)}% of raw`)

scripts/ablation.js

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
* Variants:
77
* A) base vector + id only (search must use exact full scan)
88
* B) +binary adds vector_bin column (binary phase 1 + per-cand phase 2 reads)
9-
* C) +cluster B plus k-means clustering + cluster_id col + centroids/counts KV
10-
* D) +int8 C plus vector_i8 column (int8 cascade between phases 1 and 2)
9+
* C) +cluster B plus k-means clustering + centroids/counts KV
10+
* D) IVF-PQ vector_pq column + IVF centroids + residual PQ codebooks
1111
*
1212
* Page size is held at 32 KB for B-D so we isolate the feature contribution
1313
* from the page-size knob.
@@ -41,6 +41,7 @@ const variants = [
4141
{ name: 'A_base', label: 'A) base (vec only)', opts: { binary: false } },
4242
{ name: 'B_binary', label: 'B) +binary', opts: { binary: true } },
4343
{ name: 'C_cluster', label: 'C) +cluster', opts: { binary: true, clusters: 128 } },
44+
{ name: 'D_ivfpq', label: 'D) IVF-PQ', opts: { pq: true, ivfClusters: 128 }, search: { algorithm: 'pq' } },
4445
]
4546

4647
for (const v of variants) {
@@ -130,6 +131,7 @@ for (const v of variants) {
130131
const opts = {}
131132
// For base file, rerankFactor=0 forces exact path. For others, default rerank/probe.
132133
if (v.name === 'A_base') opts.rerankFactor = 0
134+
Object.assign(opts, v.search)
133135
const r = await bench(v.path, opts)
134136
let hits = 0, total = 0
135137
for (let q = 0; q < ref.tops.length; q += 1) {

src/constants.js

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ export const defaultVectorColumn = 'vector'
1313
// Default name of the binary (sign-bit) rerank column
1414
export const defaultBinaryColumn = 'vector_bin'
1515

16+
// Default name of the product-quantized vector code column
17+
export const defaultPqColumn = 'vector_pq'
18+
1619
// Default name of the id column
1720
export const defaultIdColumn = 'id'
1821

@@ -29,3 +32,15 @@ export const defaultClusterIterations = 6
2932
// Default fraction of clusters scanned in phase 1 at query time when the
3033
// file has cluster metadata. Lower = faster but lower recall.
3134
export const defaultClusterProbeFraction = 0.25
35+
36+
// Default residual product quantization settings. The IVF-PQ path stores
37+
// one code byte per segment, with values in [0, defaultPqCentroids).
38+
export const defaultPqSegments = 32
39+
export const defaultPqCentroids = 64
40+
export const defaultPqIterations = 8
41+
export const defaultPqSampleSize = 4096
42+
43+
// Default IVF coarse quantizer settings for the IVF-PQ path.
44+
export const defaultIvfClusters = 128
45+
export const defaultIvfIterations = 6
46+
export const defaultIvfSampleSize = 4096

src/index.d.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ export type {
1212
HypVectorMetadata,
1313
PrefetchBinaryOptions,
1414
ReadVectorsOptions,
15+
SearchAlgorithm,
1516
SearchResult,
1617
SearchVectorsOptions,
1718
VectorRecord,

0 commit comments

Comments
 (0)