You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Benchmarks showed IVF-PQ never beats the binary+cluster path on the target
workloads: it loses on latency and recall at 384-dim, and even at 3072-dim
(text-embedding-3-large) it only reads fewer phase-1 bytes while the float32
rerank column — which both paths keep — dominates file size and bytes-read.
PQ would only pay off as a float-free lossy mode, which isn't built.
Drops src/pq.js and src/search/pq.js plus all `pq`/`ivf` write options,
search algorithm, KV metadata, and the CLI/README references. The findings
and the reasoning are retained in PLAN_AUTO.md.
Also gitignore .env (holds OPENAI_API_KEY used by the model sweep).
Copy file name to clipboardExpand all lines: README.md
+19-29Lines changed: 19 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@
6
6
7
7
## What is hypvector?
8
8
9
-
**HypVector** is a JavaScript library for storing and querying embedding vectors directly out of [Apache Parquet](https://parquet.apache.org) files. It builds on [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer) so that a Parquet file on S3 (or local disk) acts as the vector database — any client can run similarity search over HTTP range requests, without a server in between.
9
+
**HypVector** is a JavaScript library for storing and querying embedding vectors directly out of [Apache Parquet](https://parquet.apache.org) files. It builds on [`hyparquet`](https://github.com/hyparam/hyparquet) and [`hyparquet-writer`](https://github.com/hyparam/hyparquet-writer) so that a Parquet file on S3 (or local disk) acts as the vector database. Any client can run similarity search over HTTP range requests, without a server in between.
By default, `writeVectors` adds the binary sign-bit column and clusters rows automatically once the corpus crosses ~10k vectors. Below that, files are written as plain id + vector columns and search uses an exact full scan. To control these manually, pass `binary: true/false` and `clusters: <n>`; passing either disables the auto behavior for that knob. When the binary column is written, `pageSize` defaults to 32 KB so offset-index reads during search fetch tight ranges. Pass `pq: true` to additionally write an IVF-PQ index for approximate scoring before rerank (mutually exclusive with binary `clusters`).
71
+
By default, `writeVectors` adds the binary sign-bit column and clusters rows automatically once the corpus crosses ~10k vectors. Below that, files are written as plain id + vector columns and search uses an exact full scan. To control these manually, pass `binary: true/false` and `clusters: <n>`; passing either disables the auto behavior for that knob. When the binary column is written, `pageSize` defaults to 32 KB so offset-index reads during search fetch tight ranges.
72
72
73
73
### Producing vectors
74
74
75
75
HypVector is BYO-embedding: you decide which model produces the vectors. It just stores `{ id, vector }` pairs and queries them. The only contracts are:
76
76
77
77
1.**Same model on write and query.** Embeddings from different models aren't comparable.
78
78
2.**Same `dimension`** for every record (must match the `dimension` you pass to `writeVectors`).
79
-
3.**`normalize: true`** is the right default for any model whose vectors aren't already unit-length and you intend to query with cosine — it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), still pass `normalize: true` so the flag is recorded in KV metadata.
79
+
3.**`normalize: true`** is the right default for any model whose vectors aren't already unit-length and you intend to query with cosine; it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), still pass `normalize: true` so the flag is recorded in KV metadata.
80
80
81
81
The natural shape is an async generator that yields embedded records as you batch them through your embedder.
Core columns: `id` (STRING), `vector` (`FIXED_LEN_BYTE_ARRAY(4 × dim)`, raw float32 bytes, `UNCOMPRESSED`), and optional ANN columns: `vector_bin` (`FIXED_LEN_BYTE_ARRAY(dim/8)`, 1 bit per dim) when `binary: true`, and `vector_pq` (`FIXED_LEN_BYTE_ARRAY(pqSegments)`) when `pq: true`.
166
+
Core columns: `id` (STRING), `vector` (`FIXED_LEN_BYTE_ARRAY(4 × dim)`, raw float32 bytes, `UNCOMPRESSED`), and an optional ANN column: `vector_bin` (`FIXED_LEN_BYTE_ARRAY(dim/8)`, 1 bit per dim) when `binary: true`.
167
167
168
-
**Exact search path** (no binary column, or `rerankFactor: 0`): single pass over the float32 column via `parquetRead({ onChunk })`. Each row-group's decoded `Uint8Array[]` shares a backing buffer, so we view it as one aligned `Float32Array` and stride by `dim` — zero per-row allocations.
168
+
**Exact search path** (no binary column, or `rerankFactor: 0`): single pass over the float32 column via `parquetRead({ onChunk })`. Each row-group's decoded `Uint8Array[]` shares a backing buffer, so we view it as one aligned `Float32Array` and stride by `dim`, with zero per-row allocations.
169
169
170
-
**Binary + cluster + rerank path** (default when `binary: true` and no PQ column is present):
170
+
**Binary + cluster + rerank path** (default when `binary: true`):
171
171
172
-
1.**Build-time clustering** (when `clusters > 0`): k-means on the 1-bit codes using Hamming distance and bit-majority voting. Cluster ids are then renumbered via a greedy nearest-neighbor walk so that adjacent ids = similar centroids — this makes the top-N nearest clusters at query time tend to land in fewer contiguous row ranges. Rows are sorted by the new cluster id. Centroids and per-cluster row counts go into KV metadata.
173
-
2.**Phase 1 — cluster pruning**: rank clusters by Hamming(query, centroid), pick the top `probe` fraction, and Hamming-scan only those clusters' row ranges. With 32 KB pages and `useOffsetIndex`, hyparquet fetches only the pages covering each cluster's rows.
174
-
3.**Phase 2 — float32 rerank**: collect the top `topK × rerankFactor` candidate row indices, coalesce them into contiguous runs (merging gaps ≤ 64 rows), and issue one ranged `parquetRead` per run for the `vector` column only. Score under the exact metric.
175
-
4.**Phase 3 — id lookup**: fetch the `id` column for *only* the top-K winners (the id column is variable-length and reading it for every candidate doubles phase-2 cost).
172
+
1.**Build-time clustering** (when `clusters > 0`): k-means on the 1-bit codes using Hamming distance and bit-majority voting. Cluster ids are then renumbered via a greedy nearest-neighbor walk so that adjacent ids = similar centroids. This makes the top-N nearest clusters at query time tend to land in fewer contiguous row ranges. Rows are sorted by the new cluster id. Centroids and per-cluster row counts go into KV metadata.
173
+
2.**Phase 1, cluster pruning**: rank clusters by Hamming(query, centroid), pick the top `probe` fraction, and Hamming-scan only those clusters' row ranges. With 32 KB pages and `useOffsetIndex`, hyparquet fetches only the pages covering each cluster's rows.
174
+
3.**Phase 2, float32 rerank**: collect the top `topK × rerankFactor` candidate row indices, coalesce them into contiguous runs (merging gaps ≤ 64 rows), and issue one ranged `parquetRead` per run for the `vector` column only. Score under the exact metric.
175
+
4.**Phase 3, id lookup**: fetch the `id` column for *only* the top-K winners (the id column is variable-length and reading it for every candidate doubles phase-2 cost).
176
176
177
177
A `cachedAsyncBuffer` deduplicates footer / offset-index byte ranges across all the parallel `parquetRead` calls.
178
178
179
-
**IVF-PQ + rerank path** (`algorithm: 'pq'`, or `auto` when a file has PQ but no binary column): rank stored float IVF centroids against the query, scan compact residual `vector_pq` codes over the selected IVF row groups, approximate-score candidates with lookup tables built from the query, IVF centroid, and residual PQ codebooks, then fetch full float32 vectors only for the candidate pool and exact-rerank as above. IVF-PQ uses its own row ordering and should not be combined with binary `clusters`.
180
-
181
179
For pre-normalized vectors with `metric: 'cosine'`, the search normalizes the query once and scores via dot product to skip the per-candidate sqrt loop.
182
180
183
181
### File layout
@@ -187,7 +185,6 @@ For pre-normalized vectors with `metric: 'cosine'`, the search normalizes the qu
|`hypvector.ivf.counts`| base64-encoded `Uint32Array` of per-IVF-list row counts; present when `pq: true`|
212
202
213
203
### CLI
214
204
@@ -229,7 +219,7 @@ The default `rerankFactor` of 10 is tuned for the hundreds-of-thousands range. A
229
219
| 100 | 1,000 | 155 | 68% |
230
220
| 300 | 3,000 | 443 | 98% |
231
221
232
-
Rough rule: `rerankFactor ≈ max(10, N / 3000)`. At 1M that's ~333, giving ~98% recall at ~440 ms — still about an order of magnitude faster than the 950 ms exact scan.
222
+
Rough rule: `rerankFactor ≈ max(10, N / 3000)`. At 1M that's ~333, giving ~98% recall at ~440 ms, still about an order of magnitude faster than the 950 ms exact scan.
233
223
234
224
## Performance
235
225
@@ -239,7 +229,7 @@ From `scripts/ablation.js` (write-side optimizations):
0 commit comments