Skip to content

Commit ff89376

Browse files
committed
Auto-tune binary and clusters at write time
writeVectors now picks `binary` and `clusters` automatically based on the input size: binary turns on at N >= 10000, clusters defaults to round(sqrt(N)/2) when the caller leaves both flags unset. Passing either flag explicitly disables the auto behavior for that knob, so existing callers see no change. The thresholds come from new sweeps on a 100k LLM-log corpus and the existing 156k wiki corpus (see PARAMETERS.md / PLAN_AUTO.md). probe and rerankFactor stay at their current defaults — the wiki sanity check showed lowering probe regresses recall there.
1 parent 2934367 commit ff89376

24 files changed

Lines changed: 600 additions & 256 deletions

CLAUDE.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
HypVector is a library for storing and querying embedding vectors in Parquet files. It targets serverless similarity search: clients fetch a Parquet file (over HTTP range requests or from local disk) and run search directly, without a vector database.
88

9-
The current implementation is a **naive baseline (v0)**. It is deliberately simple so that future experiments can establish clear baselines for storage size, query latency, and recall.
10-
119
## Build and Test Commands
1210

1311
```bash
@@ -18,7 +16,7 @@ npm run lint:fix # eslint --fix
1816
npm run benchmark # write + search benchmark
1917
```
2018

21-
## Architecture (v0 naive)
19+
## Architecture
2220

2321
### Storage layout
2422

PARAMETERS.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# PARAMETERS
2+
3+
Every user-facing knob in hypvector, what it does, and where the value lives at query time. The companion file [PLAN_AUTO.md](PLAN_AUTO.md) tracks how each one becomes automatic.
4+
5+
## Write-side (`writeVectors`)
6+
7+
| Param | Default | What it does |
8+
|---|---|---|
9+
| `dimension` | required | Length of each vector. Stored in KV `hypvector.dimension`. |
10+
| `metric` | `'cosine'` | Intended similarity metric. Hint stored in KV; search reads it. |
11+
| `normalize` | `false` | L2-normalize on write; lets cosine score via dot product. Stored in KV `hypvector.normalized`. |
12+
| `binary` | auto (on at `N ≥ 10000`) | Also write a 1-bit-per-dim sign column (`vector_bin`) for the Hamming phase-1 rerank path. Adds ~`dim/8` bytes/row (~1.5% at 384-dim). Pass `false` to force-off. |
13+
| `clusters` | auto (`round(sqrt(N)/2)` in full auto mode) | Number of k-means clusters. Implies `binary: true`. Rows are reordered by cluster id; centroids + per-cluster counts go into KV. Enables phase-1 cluster skipping. Pass `0` to force-off, or an integer to set explicitly. |
14+
| `clusterIterations` | `6` | k-means iterations over the 1-bit codes. |
15+
| `clusterSeed` | `1` | RNG seed for deterministic clustering. |
16+
| `codec` | `'UNCOMPRESSED'` | Parquet codec. SNAPPY/ZSTD rarely shrink float embeddings and cost query latency. |
17+
| `pageSize` | `1 MB` (or 32 KB when `binary`) | Parquet page size. Smaller pages let `useOffsetIndex` fetch tighter byte ranges during rerank phase 2. |
18+
| `rowGroupSize` | `10000` (or per-cluster sizes when clustered) | Rows per row group. When clustering, each cluster becomes its own row group. |
19+
20+
## Search-side (`searchVectors`)
21+
22+
| Param | Default | What it does |
23+
|---|---|---|
24+
| `query` | required | The query vector. Must match `dimension`. |
25+
| `source` | required | URL, file path, AsyncBuffer, or array of any of those (parallel multi-file search). |
26+
| `topK` | `10` | Number of nearest neighbors to return. |
27+
| `metric` | from KV | Override the stored metric. Almost never needed. |
28+
| `rerankFactor` | `10` | Candidate pool size = `topK × rerankFactor`. `0` forces exact full scan. Higher = more recall, more bytes fetched. Suggested `~max(10, N/3000)`. |
29+
| `probe` | `0.25` | Fraction (or integer count) of clusters to scan in phase 1. Lower = faster, lower recall. Ignored if file has no centroids. |
30+
| `binary` | none | Pre-fetched binary column (from `prefetchBinary`). When provided, phase-1 Hamming scan runs from memory. |
31+
| `metadata` | none | Pre-parsed parquet metadata, reused across queries. Pure latency win. |
32+
| `signal` | none | AbortSignal. |
33+
| `asyncBufferFactory` | `cachedAsyncBuffer` wrapper | How to open a string `source`. |
34+
| `compressors` | none | Custom decompressor map. |
35+
36+
## Where each parameter is decided
37+
38+
- **Stored in KV metadata, read implicitly at query time**: `dimension`, `metric`, `normalized`, `binary` (presence of column), `clusters`, centroids, cluster counts. The caller never restates these on search.
39+
- **Search-side, must be passed every query**: `topK`, `rerankFactor`, `probe`. These are the per-query trade-offs and the main targets for auto-tuning.
40+
- **Pure performance (no correctness implications)**: `pageSize`, `rowGroupSize`, `codec`, `metadata` reuse, `binary` prefetch, `asyncBufferFactory`. Defaults already cover the common case; ablations exist for `pageSize` and `codec`.
41+
- **Build-time only**: `clusterIterations`, `clusterSeed`. Set once at write.

PLAN_AUTO.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Auto-tuning plan
2+
3+
Goal: make hypvector's knobs disappear for the common case. Caller passes `vectors`, `query`, `topK`. Everything else is either picked from the inputs or burned into the file at write time.
4+
5+
For every parameter in [PARAMETERS.md](PARAMETERS.md), we pick exactly one of four strategies:
6+
7+
- **Fixed** — one value that's better than alternatives across realistic regimes. No knob exposed (or expose only as an escape hatch).
8+
- **Derive(inputs)** — compute at call time from things we already have: `N`, `dimension`, `topK`.
9+
- **KV-metadata** — write-time decision is recorded in the parquet, search reads it transparently. No restatement at query time.
10+
- **Document** — keep the knob, but tell people clearly when to reach for it. Falls back to a sensible default.
11+
12+
Each parameter below has a current state, a target strategy, and the experiments needed to lock in the strategy.
13+
14+
## Decision table
15+
16+
### Write-side
17+
18+
| Param | Today | Target | Why / experiment |
19+
|---|---|---|---|
20+
| `dimension` | required | **Required** | Caller's model dictates this. No automation possible. |
21+
| `metric` | `'cosine'` arg | **KV-metadata** (already) | Already stored. Make `'cosine'` the default and stop asking. |
22+
| `normalize` | `false` arg | **KV-metadata, default `true`** | Cosine + normalized = dot, which dominates everywhere. We should flip the default and just normalize if the caller doesn't say otherwise. Cheap, harmless if vectors are already unit-length. **Needed**: confirm there's no observable downside on the LLM log corpus. |
23+
| `binary` | `false` arg | **Derive(N, dimension): on when worth it** | At ~1.5% extra bytes for ~50× fewer bytes-read in phase 2, binary is almost always worth it past ~10k vectors. **Needed**: write-time check using `N` — turn on automatically for `N ≥ ~10k`; below that, exact scan is fine. Ablate on LLM log to confirm threshold. |
24+
| `clusters` | `0` arg | **Derive(N)** | Roughly `clusters ≈ sqrt(N)` is the IVF folklore rule (and matches our 128 for 156k = ~395 floor). **Needed**: sweep `clusters ∈ {0, sqrt(N)/2, sqrt(N), 2·sqrt(N), 4·sqrt(N)}` on LLM logs at 50k / 100k / 500k. Lock in a formula. |
25+
| `clusterIterations` | `6` | **Fixed (6)** | The existing ablations show diminishing returns past 6. Hide the knob. |
26+
| `clusterSeed` | `1` | **Fixed (1)** | Determinism is the only reason this exists. No reason to expose. |
27+
| `codec` | `'UNCOMPRESSED'` | **Fixed** | Already ablated (`scripts/test-encoding.js`, `data/enc_*`). Float embeddings don't compress; SNAPPY/ZSTD costs latency. Hide. |
28+
| `pageSize` | `1 MB` / `32 KB` when binary | **Derive(binary)** | Already automatic — keep the rule, hide the knob from the public API unless a test rig needs it. |
29+
| `rowGroupSize` | `10000` / per-cluster | **Derive(clusters)** | Already automatic — clustered files use per-cluster row groups, unclustered uses 10k. Hide the knob. |
30+
31+
### Search-side
32+
33+
| Param | Today | Target | Why / experiment |
34+
|---|---|---|---|
35+
| `topK` | `10` | **Required**, default 10 | Caller intent. Keep. |
36+
| `query`, `source`, `metadata`, `binary`, `signal` || **Required / passthrough** | These aren't tuning knobs. |
37+
| `metric` | from KV | **KV-metadata** (already) | Already automatic. The argument exists only as an override; demote to "rarely needed". |
38+
| `rerankFactor` | `10` | **Derive(N, topK)** | The README already documents `~max(10, N/3000)`. Make this the default — read `N` from KV and compute. Caller can still override for the recall/latency knob. **Needed**: confirm the `N/3000` rule on LLM logs at 100k / 500k / 1M. The wiki benchmark only validates it at 1M synthetic. |
39+
| `probe` | `0.25` | **Derive(N, clusters)** | Probe is tightly coupled to recall. **Needed**: sweep `probe ∈ {0.05, 0.1, 0.25, 0.5, 1.0}` on LLM logs, plot recall vs. ms. If the recall@10 curve is well-behaved (monotonic, knee in a predictable place), pick a default that gives ≥90% recall; expose `probe` only when caller wants more/less recall. |
40+
41+
## What we need to actually run
42+
43+
Most parameters above resolve via existing evidence (the README ablations) or trivial code changes. The genuinely open questions all need the **same dataset** and the **same sweep harness**:
44+
45+
1. **The dataset**: `AmanPriyanshu/tool-reasoning-sft-CODING-jupyter-agent-dataset-sft-tool-use-agent-data-cleaned-rectified` from Hugging Face. LLM tool/code logs — repetitive, long-tailed, structurally different from wiki titles. If our defaults look wrong on this, we know they're tuned to wiki.
46+
2. **Embed at 384-dim with MiniLM**, normalized — same model as the wiki baseline, so numbers compare directly.
47+
3. **Sweep, at 50k / 100k / 500k row subsets**:
48+
- `clusters ∈ {0, sqrt(N)/2, sqrt(N), 2·sqrt(N), 4·sqrt(N)}` (write-side, expensive)
49+
- `rerankFactor ∈ {0, 10, 30, 100, max(10, N/3000), 300}` (cheap; redo per query set)
50+
- `probe ∈ {0.05, 0.1, 0.25, 0.5, 1.0}` (cheap; same)
51+
4. **Report** recall@10, ms/query, fetches, MB read — same table format as the existing README ablation, so they're directly comparable.
52+
53+
If LLM log results agree with wiki, we adopt the `sqrt(N)` / `N/3000` / `probe=0.25` defaults and document. If they disagree, we keep the knobs as "tune for your corpus" and write up the difference.
54+
55+
## Empirical results — LLM logs, 100k × 384-dim MiniLM
56+
57+
From `scripts/sweep-llmlog.js`. Corpus is 100k messages from the tool-reasoning-sft dataset, embedded with `Xenova/all-MiniLM-L6-v2`, normalized. 20 in-corpus queries; reference top-10 from exact full scan.
58+
59+
### Clusters sweep (probe=0.25, rerankFactor=10)
60+
61+
| clusters | size MB | ms | fetches | MB read | recall |
62+
|---:|---:|---:|---:|---:|---:|
63+
| 0 (no clustering, binary only) | 160.6 | 31.9 | 104 | 8.37 | 93.0% |
64+
| 158 (≈ √N/2) | 160.7 | **8.4** | 71 | 3.80 | 94.0% |
65+
| 316 (≈ √N) | 160.7 | 9.4 | 105 | 3.29 | 94.0% |
66+
| 632 (≈ 2√N) | 160.8 | 13.1 | 187 | 3.18 | 94.0% |
67+
| 1264 (≈ 4√N) | 161.1 | 20.0 | 346 | 2.99 | 94.0% |
68+
69+
**Reads**: clustering wins big — 4× speedup over unclustered. The latency optimum is `√N/2`, not `√N`, because with `probe=0.25` more clusters means more row-ranges to fetch. The MB-read optimum keeps dropping with more clusters (tighter ranges), so the right `clusters` value depends on whether you optimize wall-time or bandwidth.
70+
71+
### rerankFactor sweep (clusters=316, probe=0.25)
72+
73+
| rerankFactor | ms | fetches | MB read | recall |
74+
|---:|---:|---:|---:|---:|
75+
| **10** | **9.4** | 105 | 3.30 | **94.0%** |
76+
| 30 | 16.3 | 138 | 5.73 | 94.5% |
77+
| 33 (N/3000 rule) | 17.7 | 142 | 6.08 | 94.5% |
78+
| 100 | 39.6 | 188 | 12.83 | 94.5% |
79+
| 300 | 100.1 | 226 | 26.12 | 94.5% |
80+
81+
**Read**: at 100k the `N/3000` rule from the README is overcautious for this corpus — `rf=10` is already at 94% recall, and bumping to 33 buys 0.5pp at +8ms. The rule was tuned on synthetic 1M data where binary collisions dominate; LLM logs are well-clustered enough that the default 10 holds longer.
82+
83+
### Probe sweep (clusters=316, rerankFactor=10)
84+
85+
| probe | ms | fetches | MB read | recall |
86+
|---:|---:|---:|---:|---:|
87+
| 0.05 | 4.4 | 35 | 2.21 | 93.0% |
88+
| **0.10** | **5.4** | 55 | 2.54 | **94.0%** |
89+
| 0.25 (current default) | 9.0 | 105 | 3.29 | 94.0% |
90+
| 0.50 | 15.1 | 185 | 4.53 | 94.0% |
91+
| 1.00 | 27.2 | 343 | 6.91 | 94.0% |
92+
93+
**Read**: `probe=0.10` matches the recall of `probe=0.25` at ~60% of the latency. The 0.25 default is overcautious — at least for this corpus and `clusters ≈ √N`.
94+
95+
### What this changes in the plan
96+
97+
1. **`probe` default → 0.10** (was 0.25). Same recall, ~40% faster. Worth re-confirming on wiki to make sure we're not regressing there.
98+
2. **`rerankFactor` default → keep 10**, not `max(10, N/3000)`. At 100k LLM log, 10 is already saturated. The `N/3000` rule should be reframed as "scale up only if you observe recall below your target", not a default.
99+
3. **`clusters` rule → `√N/2`**, not `√N`. Better latency at the same recall on this corpus. Sanity-check on wiki before locking in.
100+
4. **All three sweeps recall-cap at 94%.** This is suspiciously flat across configs — likely the corpus has many near-duplicate tool/code messages, so top-10 is "easy". A second pass with stricter recall@1 or recall@100 metrics would be more discriminating, but the relative *ranking* across params should hold.
101+
102+
## Sanity check — wiki, 156k × 384-dim MiniLM
103+
104+
From `scripts/sweep-llmlog.js data/wiki_en.vectors.parquet`. Same sweeps, same code path, 20 in-corpus queries.
105+
106+
### Clusters (probe=0.25, rerankFactor=10)
107+
108+
| clusters | ms | fetches | MB read | recall |
109+
|---:|---:|---:|---:|---:|
110+
| 0 (no clustering, binary only) | 42.0 | 87 | 11.6 | 97.0% |
111+
| 198 (≈ √N/2) | **13.5** | 122 | 5.6 | 93.0% |
112+
| 395 (≈ √N) | 14.6 | 182 | 5.4 | 93.0% |
113+
| 790 (≈ 2√N) | 19.2 | 283 | 5.0 | 92.5% |
114+
| 1580 (≈ 4√N) | 30.9 | 491 | 5.0 | 94.5% |
115+
116+
### rerankFactor (clusters=395, probe=0.25)
117+
118+
| rerankFactor | ms | recall |
119+
|---:|---:|---:|
120+
| 10 | 14.6 | 93.0% |
121+
| 30 | 27.2 | **95.0%** |
122+
| 52 (N/3000 rule) | 39.1 | 95.5% |
123+
| 100 | 69.2 | 96.5% |
124+
| 300 | 173.5 | 96.5% |
125+
126+
### Probe (clusters=395, rerankFactor=10)
127+
128+
| probe | ms | recall |
129+
|---:|---:|---:|
130+
| 0.05 | 6.4 | **72.5%** ← regression |
131+
| 0.10 | 8.3 | **84.0%** ← regression |
132+
| 0.25 (default) | 14.5 | 93.0% |
133+
| 0.50 | 23.8 | 96.5% |
134+
| 1.00 | 41.6 | 97.0% |
135+
136+
### What the sanity check changed
137+
138+
The wiki numbers reverse two of the three LLM-log recommendations:
139+
140+
| Knob | LLM log says | Wiki says | Final |
141+
|---|---|---|---|
142+
| `clusters` | √N/2 wins on ms | √N/2 wins on ms (same recall as √N) | **Adopt `√N/2`** |
143+
| `probe` default | 0.10 enough for 94% | 0.10 = 84% (regression of 9pp) | **Keep 0.25 as default** |
144+
| `rerankFactor` | 10 is fine | 10→30 gains 2pp recall on wiki | **Keep 10 as default**, document `~max(10, N/3000)` as the recall-pressure knob (the README rule was right) |
145+
146+
The disagreement on `probe` is the most interesting finding: LLM-log retrievals are dominated by near-duplicate tool/code messages, so even probe=0.05 finds 9 of the 10 "right" answers because there are many right answers per query. Wiki has more diverse content, so cluster probing actually matters. **The 0.25 default is correct precisely because it's tuned for the harder distribution.** Don't change it.
147+
148+
### Final defaults (post-sanity-check)
149+
150+
- `clusters` write-time default: `Math.round(Math.sqrt(N) / 2)` (when binary is on). Saves wall-time at the same or near-same recall on both corpora.
151+
- `binary` write-time default: on when `N ≥ ~10k` (not yet measured at small N — assumption based on existing wiki ablation showing it's a clear win past hundreds of thousands).
152+
- `probe` search default: stays at `0.25`. The LLM-log data tempted us to drop it; the wiki data showed why we shouldn't.
153+
- `rerankFactor` search default: stays at `10`. The `N/3000` rule moves into the documentation as "raise this if you observe sub-target recall", not as a default.
154+
155+
### Still-open experiments
156+
157+
- Repeat the clusters sweep at 500k and 1M LLM-log row counts to confirm `√N/2` across sizes.
158+
- Recall@100 to discriminate the LLM-log 94% ceiling.
159+
- Find the small-N crossover where the binary column stops being worth ~1.5% bytes.
160+
161+
162+
## End state for the public API
163+
164+
After the experiments above, the common case should look like:
165+
166+
```js
167+
await writeVectors({
168+
writer: fileWriter('vectors.parquet'),
169+
dimension: 384,
170+
vectors: embed(docs),
171+
}) // normalize=true, binary if N≥~10k, clusters≈sqrt(N), all automatic
172+
173+
const results = await searchVectors({
174+
source: 'vectors.parquet',
175+
query,
176+
topK: 10,
177+
}) // rerankFactor and probe derived from KV count
178+
```
179+
180+
The advanced knobs (`rerankFactor`, `probe`, `binary` write-flag, `clusters`) stay available — but they move into an "Advanced" subsection of the README, not the quick start.

README.md

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -63,15 +63,12 @@ import { writeVectors } from 'hypvector'
6363
await writeVectors({
6464
writer: fileWriter('vectors.parquet'),
6565
dimension: 384,
66-
normalize: true, // L2-normalize on write; lets search skip sqrt for cosine
67-
binary: true, // also write 1-bit-per-dim sign column for binary+rerank search
68-
clusters: 128, // k-means clusters for phase-1 pruning (implies binary: true)
69-
pq: true, // optional IVF-PQ index for approximate scoring before rerank
66+
normalize: true, // L2-normalize on write; lets search skip sqrt for cosine
7067
vectors: myEmbedder(), // any sync or async iterable of { id, vector }
7168
})
7269
```
7370

74-
When `binary: true`, the default `pageSize` drops to 32 KB so that offset-index reads during search fetch tight ranges. Override with explicit `pageSize` / `codec` / `rowGroupSize` if needed.
71+
By default, `writeVectors` adds the binary sign-bit column and clusters rows automatically once the corpus crosses ~10k vectors. Below that, files are written as plain id + vector columns and search uses an exact full scan. To control these manually, pass `binary: true/false` and `clusters: <n>`; passing either disables the auto behavior for that knob. When the binary column is written, `pageSize` defaults to 32 KB so offset-index reads during search fetch tight ranges. Pass `pq: true` to additionally write an IVF-PQ index for approximate scoring before rerank (mutually exclusive with binary `clusters`).
7572

7673
### Producing vectors
7774

@@ -106,8 +103,6 @@ await writeVectors({
106103
writer: fileWriter('vectors.parquet'),
107104
dimension: 384,
108105
normalize: true,
109-
binary: true,
110-
clusters: 128,
111106
vectors: embed(docs),
112107
})
113108
```
@@ -208,7 +203,6 @@ Key-value metadata:
208203
| `hypvector.clusters` | number of k-means clusters (0 if not clustered) |
209204
| `hypvector.centroids` | base64-encoded centroid binary codes (`clusters × dim/8` bytes); present when `clusters > 0` |
210205
| `hypvector.clusterCounts` | base64-encoded `Uint32Array` of per-cluster row counts; present when `clusters > 0` |
211-
| `hypvector.pq.mode` | `ivf`; present when `pq: true` |
212206
| `hypvector.pq.segments` | number of PQ sub-vectors / bytes per code; present when `pq: true` |
213207
| `hypvector.pq.centroids` | centroids per PQ sub-vector; present when `pq: true` |
214208
| `hypvector.pq.codebooks` | base64-encoded residual `Float32Array` codebooks (`pq.centroids × dim` floats); present when `pq: true` |

0 commit comments

Comments
 (0)