Skip to content

Commit 92e09bc

Browse files
committed
Remove the IVF-PQ index path
Benchmarks showed IVF-PQ never beats the binary+cluster path on the target workloads: it loses on latency and recall at 384-dim, and even at 3072-dim (text-embedding-3-large) it only reads fewer phase-1 bytes while the float32 rerank column — which both paths keep — dominates file size and bytes-read. PQ would only pay off as a float-free lossy mode, which isn't built. Drops src/pq.js and src/search/pq.js plus all `pq`/`ivf` write options, search algorithm, KV metadata, and the CLI/README references. The findings and the reasoning are retained in PLAN_AUTO.md. Also gitignore .env (holds OPENAI_API_KEY used by the model sweep).
1 parent ff89376 commit 92e09bc

16 files changed

Lines changed: 393 additions & 1089 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ data
55
*.tgz
66
*.parquet
77
.DS_Store
8+
.env

CLAUDE.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -22,32 +22,32 @@ npm run benchmark # write + search benchmark
2222

2323
Vectors are stored in a single Parquet file with two columns:
2424

25-
- `id` (STRING) caller-supplied identifier, coerced to string
26-
- `vector` (FIXED_LEN_BYTE_ARRAY, `type_length = 4 * dimension`) raw little-endian float32 bytes
25+
- `id` (STRING): caller-supplied identifier, coerced to string
26+
- `vector` (FIXED_LEN_BYTE_ARRAY, `type_length = 4 * dimension`): raw little-endian float32 bytes
2727

2828
Format-level info lives in Parquet KV metadata so readers don't need out-of-band coordination:
2929

30-
- `hypvector.version` index format version
31-
- `hypvector.dimension` length of each vector
32-
- `hypvector.metric` intended similarity metric (`cosine` | `dot` | `euclidean`)
33-
- `hypvector.normalized` whether vectors were L2-normalized on write
34-
- `hypvector.count` vector count
30+
- `hypvector.version`: index format version
31+
- `hypvector.dimension`: length of each vector
32+
- `hypvector.metric`: intended similarity metric (`cosine` | `dot` | `euclidean`)
33+
- `hypvector.normalized`: whether vectors were L2-normalized on write
34+
- `hypvector.count`: vector count
3535

3636
### Core modules (`src/`)
3737

38-
- `writeVectors.js` packs each vector to float32 bytes and writes to a Parquet `BYTE_ARRAY` column. Accepts sync or async iterables.
39-
- `readVectors.js` async generator that yields `{ id, vector }` records, unpacking bytes back to `Float32Array`.
40-
- `searchVectors.js` linear-scan top-k search. Streams every vector, computes the chosen metric, keeps a bounded result set.
41-
- `utils.js` `cosineSimilarity`, `dotProduct`, `euclideanDistance`, `l2Normalize`, plus `packFloat32` / `unpackFloat32` / `parseKvMetadata`.
42-
- `constants.js` version and defaults.
38+
- `writeVectors.js`: packs each vector to float32 bytes and writes to a Parquet `BYTE_ARRAY` column. Accepts sync or async iterables.
39+
- `readVectors.js`: async generator that yields `{ id, vector }` records, unpacking bytes back to `Float32Array`.
40+
- `searchVectors.js`: linear-scan top-k search. Streams every vector, computes the chosen metric, keeps a bounded result set.
41+
- `utils.js`: `cosineSimilarity`, `dotProduct`, `euclideanDistance`, `l2Normalize`, plus `packFloat32` / `unpackFloat32` / `parseKvMetadata`.
42+
- `constants.js`: version and defaults.
4343

4444
### Known limitations (intentional for v0)
4545

46-
- **Linear scan only** no ANN index, no partitioning, no inverted lists.
47-
- **Full file read for search** every query reads the entire `vector` column.
48-
- **No quantization** float32 only; int8 / binary / product quantization are future experiments.
49-
- **PLAIN encoding** no `BYTE_STREAM_SPLIT` or other float-friendly encoding yet.
50-
- **No batching API** `writeVectors` materializes all packed bytes before writing.
46+
- **Linear scan only**: no ANN index, no partitioning, no inverted lists.
47+
- **Full file read for search**: every query reads the entire `vector` column.
48+
- **No quantization**: float32 only; int8 / binary / product quantization are future experiments.
49+
- **PLAIN encoding**: no `BYTE_STREAM_SPLIT` or other float-friendly encoding yet.
50+
- **No batching API**: `writeVectors` materializes all packed bytes before writing.
5151

5252
These are intentional starting points; each one is a candidate for a future experiment.
5353

@@ -68,8 +68,8 @@ These are intentional starting points; each one is a candidate for a future expe
6868

6969
## Dependencies
7070

71-
- **hyparquet** Parquet reading
72-
- **hyparquet-writer** Parquet writing
73-
- **hyparquet-compressors** Compression codecs
71+
- **hyparquet**: Parquet reading
72+
- **hyparquet-writer**: Parquet writing
73+
- **hyparquet-compressors**: Compression codecs
7474

7575
All three are maintained by Hyperparam.

PLAN_AUTO.md

Lines changed: 79 additions & 20 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 19 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
## What is hypvector?
88

9-
**HypVector** is a JavaScript library for storing and querying embedding vectors directly out of [Apache Parquet](https://parquet.apache.org) files. It builds on [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer) so that a Parquet file on S3 (or local disk) acts as the vector database — any client can run similarity search over HTTP range requests, without a server in between.
9+
**HypVector** is a JavaScript library for storing and querying embedding vectors directly out of [Apache Parquet](https://parquet.apache.org) files. It builds on [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer) so that a Parquet file on S3 (or local disk) acts as the vector database. Any client can run similarity search over HTTP range requests, without a server in between.
1010

1111
- Works in browsers and node.js
1212
- Self-describing files (dimension, metric, normalization, cluster centroids in Parquet KV metadata)
@@ -68,15 +68,15 @@ await writeVectors({
6868
})
6969
```
7070

71-
By default, `writeVectors` adds the binary sign-bit column and clusters rows automatically once the corpus crosses ~10k vectors. Below that, files are written as plain id + vector columns and search uses an exact full scan. To control these manually, pass `binary: true/false` and `clusters: <n>`; passing either disables the auto behavior for that knob. When the binary column is written, `pageSize` defaults to 32 KB so offset-index reads during search fetch tight ranges. Pass `pq: true` to additionally write an IVF-PQ index for approximate scoring before rerank (mutually exclusive with binary `clusters`).
71+
By default, `writeVectors` adds the binary sign-bit column and clusters rows automatically once the corpus crosses ~10k vectors. Below that, files are written as plain id + vector columns and search uses an exact full scan. To control these manually, pass `binary: true/false` and `clusters: <n>`; passing either disables the auto behavior for that knob. When the binary column is written, `pageSize` defaults to 32 KB so offset-index reads during search fetch tight ranges.
7272

7373
### Producing vectors
7474

7575
HypVector is BYO-embedding: you decide which model produces the vectors. It just stores `{ id, vector }` pairs and queries them. The only contracts are:
7676

7777
1. **Same model on write and query.** Embeddings from different models aren't comparable.
7878
2. **Same `dimension`** for every record (must match the `dimension` you pass to `writeVectors`).
79-
3. **`normalize: true`** is the right default for any model whose vectors aren't already unit-length and you intend to query with cosine it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), still pass `normalize: true` so the flag is recorded in KV metadata.
79+
3. **`normalize: true`** is the right default for any model whose vectors aren't already unit-length and you intend to query with cosine; it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), still pass `normalize: true` so the flag is recorded in KV metadata.
8080

8181
The natural shape is an async generator that yields embedded records as you batch them through your embedder.
8282

@@ -151,7 +151,7 @@ const results = await searchVectors({
151151
source: 'https://example.com/vectors.parquet', // URL, local file path, or an open AsyncBuffer
152152
query: queryVec, // Float32Array of length `dimension`
153153
topK: 10,
154-
algorithm: 'auto', // 'auto' | 'exact' | 'binary' | 'pq'
154+
algorithm: 'auto', // 'auto' | 'exact' | 'binary'
155155
rerankFactor: 10, // candidate pool = topK * rerankFactor (default 10). Set to 0 to force exact full scan.
156156
probe: 0.25, // fraction of clusters to scan in phase 1 (default 0.25). Set to 1 to scan all clusters; pass an integer > 1 for an absolute count.
157157
})
@@ -163,21 +163,19 @@ const results = await searchVectors({
163163

164164
### How it works
165165

166-
Core columns: `id` (STRING), `vector` (`FIXED_LEN_BYTE_ARRAY(4 × dim)`, raw float32 bytes, `UNCOMPRESSED`), and optional ANN columns: `vector_bin` (`FIXED_LEN_BYTE_ARRAY(dim/8)`, 1 bit per dim) when `binary: true`, and `vector_pq` (`FIXED_LEN_BYTE_ARRAY(pqSegments)`) when `pq: true`.
166+
Core columns: `id` (STRING), `vector` (`FIXED_LEN_BYTE_ARRAY(4 × dim)`, raw float32 bytes, `UNCOMPRESSED`), and an optional ANN column: `vector_bin` (`FIXED_LEN_BYTE_ARRAY(dim/8)`, 1 bit per dim) when `binary: true`.
167167

168-
**Exact search path** (no binary column, or `rerankFactor: 0`): single pass over the float32 column via `parquetRead({ onChunk })`. Each row-group's decoded `Uint8Array[]` shares a backing buffer, so we view it as one aligned `Float32Array` and stride by `dim` zero per-row allocations.
168+
**Exact search path** (no binary column, or `rerankFactor: 0`): single pass over the float32 column via `parquetRead({ onChunk })`. Each row-group's decoded `Uint8Array[]` shares a backing buffer, so we view it as one aligned `Float32Array` and stride by `dim`, with zero per-row allocations.
169169

170-
**Binary + cluster + rerank path** (default when `binary: true` and no PQ column is present):
170+
**Binary + cluster + rerank path** (default when `binary: true`):
171171

172-
1. **Build-time clustering** (when `clusters > 0`): k-means on the 1-bit codes using Hamming distance and bit-majority voting. Cluster ids are then renumbered via a greedy nearest-neighbor walk so that adjacent ids = similar centroids — this makes the top-N nearest clusters at query time tend to land in fewer contiguous row ranges. Rows are sorted by the new cluster id. Centroids and per-cluster row counts go into KV metadata.
173-
2. **Phase 1 cluster pruning**: rank clusters by Hamming(query, centroid), pick the top `probe` fraction, and Hamming-scan only those clusters' row ranges. With 32 KB pages and `useOffsetIndex`, hyparquet fetches only the pages covering each cluster's rows.
174-
3. **Phase 2 float32 rerank**: collect the top `topK × rerankFactor` candidate row indices, coalesce them into contiguous runs (merging gaps ≤ 64 rows), and issue one ranged `parquetRead` per run for the `vector` column only. Score under the exact metric.
175-
4. **Phase 3 id lookup**: fetch the `id` column for *only* the top-K winners (the id column is variable-length and reading it for every candidate doubles phase-2 cost).
172+
1. **Build-time clustering** (when `clusters > 0`): k-means on the 1-bit codes using Hamming distance and bit-majority voting. Cluster ids are then renumbered via a greedy nearest-neighbor walk so that adjacent ids = similar centroids. This makes the top-N nearest clusters at query time tend to land in fewer contiguous row ranges. Rows are sorted by the new cluster id. Centroids and per-cluster row counts go into KV metadata.
173+
2. **Phase 1, cluster pruning**: rank clusters by Hamming(query, centroid), pick the top `probe` fraction, and Hamming-scan only those clusters' row ranges. With 32 KB pages and `useOffsetIndex`, hyparquet fetches only the pages covering each cluster's rows.
174+
3. **Phase 2, float32 rerank**: collect the top `topK × rerankFactor` candidate row indices, coalesce them into contiguous runs (merging gaps ≤ 64 rows), and issue one ranged `parquetRead` per run for the `vector` column only. Score under the exact metric.
175+
4. **Phase 3, id lookup**: fetch the `id` column for *only* the top-K winners (the id column is variable-length and reading it for every candidate doubles phase-2 cost).
176176

177177
A `cachedAsyncBuffer` deduplicates footer / offset-index byte ranges across all the parallel `parquetRead` calls.
178178

179-
**IVF-PQ + rerank path** (`algorithm: 'pq'`, or `auto` when a file has PQ but no binary column): rank stored float IVF centroids against the query, scan compact residual `vector_pq` codes over the selected IVF row groups, approximate-score candidates with lookup tables built from the query, IVF centroid, and residual PQ codebooks, then fetch full float32 vectors only for the candidate pool and exact-rerank as above. IVF-PQ uses its own row ordering and should not be combined with binary `clusters`.
180-
181179
For pre-normalized vectors with `metric: 'cosine'`, the search normalizes the query once and scores via dot product to skip the per-candidate sqrt loop.
182180

183181
### File layout
@@ -187,7 +185,6 @@ For pre-normalized vectors with `metric: 'cosine'`, the search normalizes the qu
187185
| `id` | `STRING` (UTF8) | variable | always |
188186
| `vector` | `FIXED_LEN_BYTE_ARRAY(4 × dim)` | `4 × dim` | always |
189187
| `vector_bin` | `FIXED_LEN_BYTE_ARRAY(dim/8)` | `dim/8` | when `binary: true` |
190-
| `vector_pq` | `FIXED_LEN_BYTE_ARRAY(pqSegments)` | `pqSegments` | when `pq: true` |
191188

192189
Key-value metadata:
193190

@@ -198,17 +195,10 @@ Key-value metadata:
198195
| `hypvector.metric` | `cosine` \| `dot` \| `euclidean` |
199196
| `hypvector.normalized` | `true` if vectors were L2-normalized on write |
200197
| `hypvector.binary` | `true` if the `vector_bin` column is present |
201-
| `hypvector.pq` | `true` if the `vector_pq` column is present |
202198
| `hypvector.count` | number of vectors |
203199
| `hypvector.clusters` | number of k-means clusters (0 if not clustered) |
204200
| `hypvector.centroids` | base64-encoded centroid binary codes (`clusters × dim/8` bytes); present when `clusters > 0` |
205201
| `hypvector.clusterCounts` | base64-encoded `Uint32Array` of per-cluster row counts; present when `clusters > 0` |
206-
| `hypvector.pq.segments` | number of PQ sub-vectors / bytes per code; present when `pq: true` |
207-
| `hypvector.pq.centroids` | centroids per PQ sub-vector; present when `pq: true` |
208-
| `hypvector.pq.codebooks` | base64-encoded residual `Float32Array` codebooks (`pq.centroids × dim` floats); present when `pq: true` |
209-
| `hypvector.ivf.clusters` | number of non-empty IVF lists; present when `pq: true` |
210-
| `hypvector.ivf.centroids` | base64-encoded float IVF centroids (`ivf.clusters × dim` float32 values); present when `pq: true` |
211-
| `hypvector.ivf.counts` | base64-encoded `Uint32Array` of per-IVF-list row counts; present when `pq: true` |
212202

213203
### CLI
214204

@@ -229,7 +219,7 @@ The default `rerankFactor` of 10 is tuned for the hundreds-of-thousands range. A
229219
| 100 | 1,000 | 155 | 68% |
230220
| 300 | 3,000 | 443 | 98% |
231221

232-
Rough rule: `rerankFactor ≈ max(10, N / 3000)`. At 1M that's ~333, giving ~98% recall at ~440 ms still about an order of magnitude faster than the 950 ms exact scan.
222+
Rough rule: `rerankFactor ≈ max(10, N / 3000)`. At 1M that's ~333, giving ~98% recall at ~440 ms, still about an order of magnitude faster than the 950 ms exact scan.
233223

234224
## Performance
235225

@@ -239,7 +229,7 @@ From `scripts/ablation.js` (write-side optimizations):
239229

240230
| Variant | File MB | Query ms | Fetches | MB read | Recall@10 |
241231
|---|---:|---:|---:|---:|---:|
242-
| base (`vector` + `id`) forced exact scan | 241.5 | 108 | 33 | 242.0 | 100% |
232+
| base (`vector` + `id`), forced exact scan | 241.5 | 108 | 33 | 242.0 | 100% |
243233
| `+ binary` (phase 1 + 2 rerank) | 249.3 | 48 | 136 | 11.7 | 93% |
244234
| `+ cluster` (default; `probe=0.25`, `clusters=128`) | 249.4 | 15 | 162 | 6.2 | 91% |
245235

@@ -276,8 +266,8 @@ hypvector isn't a hosted service. The closest peers are:
276266

277267
| Engine | Server? | Cold p50 | Warm p50 | Fixed $/mo |
278268
|---|---|---:|---:|---:|
279-
| **hypvector** | nonefile on S3 | ~500 ms (CloudFront, home WAN) | same no cache | $0 |
280-
| **LanceDB** (S3 mode) | none embedded | bandwidth-bound | sub-50 ms (local) | $0 |
269+
| **hypvector** | none, file on S3 | ~500 ms (CloudFront, home WAN) | same, no cache | $0 |
270+
| **LanceDB** (S3 mode) | none, embedded | bandwidth-bound | sub-50 ms (local) | $0 |
281271
| **turbopuffer** | hosted | ~440 ms p90 | ~8 ms | $64 min |
282272
| **Pinecone Serverless** | hosted | 200 ms – 2 s | 50–100 ms | $0 + per-RU |
283273
| **Cloudflare Vectorize** | hosted (edge) | needs pre-warm | edge-fast | $0 + per-op |
@@ -286,10 +276,10 @@ Use hypvector for static datasets, browser-side search, or low-QPS where a hoste
286276

287277
## References
288278

289-
- [hyparquet](https://github.com/hyparam/hyparquet) Parquet reading
290-
- [hyparquet-writer](https://github.com/hyparam/hyparquet-writer) Parquet writing
291-
- [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) Compression codecs
292-
- [Apache Parquet](https://parquet.apache.org) Columnar storage format
279+
- [hyparquet](https://github.com/hyparam/hyparquet): Parquet reading
280+
- [hyparquet-writer](https://github.com/hyparam/hyparquet-writer): Parquet writing
281+
- [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors): Compression codecs
282+
- [Apache Parquet](https://parquet.apache.org): Columnar storage format
293283

294284
## Contributions
295285

bin/inspect.js

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,7 @@ export async function inspect({ path }) {
2222
console.log(`Metric: ${meta.metric}`)
2323
console.log(`Normalized: ${meta.normalized}`)
2424
console.log(`Binary column: ${meta.hasBinary}`)
25-
console.log(`PQ column: ${meta.hasPq}`)
26-
if (meta.hasPq) {
27-
console.log(`PQ segments: ${meta.pqSegments}`)
28-
console.log(`PQ centroids: ${meta.pqCentroids}`)
29-
console.log(`IVF clusters: ${meta.ivfClusters}`)
30-
}
25+
console.log(`Clusters: ${meta.clusters}`)
3126
console.log(`Row groups: ${metadata.row_groups.length.toLocaleString()}`)
3227
console.log(`Raw float32 size: ${rawSize.toLocaleString()} bytes`)
3328
console.log(`Overhead: ${(ratio * 100).toFixed(1)}% of raw`)

scripts/ablation.js

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,8 @@
77
* A) base vector + id only (search must use exact full scan)
88
* B) +binary adds vector_bin column (binary phase 1 + per-cand phase 2 reads)
99
* C) +cluster B plus k-means clustering + centroids/counts KV
10-
* D) IVF-PQ vector_pq column + IVF centroids + residual PQ codebooks
1110
*
12-
* Page size is held at 32 KB for B-D so we isolate the feature contribution
11+
* Page size is held at 32 KB for B-C so we isolate the feature contribution
1312
* from the page-size knob.
1413
*/
1514
import { promises as fs } from 'node:fs'
@@ -41,7 +40,6 @@ const variants = [
4140
{ name: 'A_base', label: 'A) base (vec only)', opts: { binary: false } },
4241
{ name: 'B_binary', label: 'B) +binary', opts: { binary: true } },
4342
{ name: 'C_cluster', label: 'C) +cluster', opts: { binary: true, clusters: 128 } },
44-
{ name: 'D_ivfpq', label: 'D) IVF-PQ', opts: { pq: true, ivfClusters: 128 }, search: { algorithm: 'pq' } },
4543
]
4644

4745
for (const v of variants) {
@@ -131,7 +129,6 @@ for (const v of variants) {
131129
const opts = {}
132130
// For base file, rerankFactor=0 forces exact path. For others, default rerank/probe.
133131
if (v.name === 'A_base') opts.rerankFactor = 0
134-
Object.assign(opts, v.search)
135132
const r = await bench(v.path, opts)
136133
let hits = 0, total = 0
137134
for (let q = 0; q < ref.tops.length; q += 1) {

0 commit comments

Comments
 (0)