Skip to content

Commit 62a9b2c

Browse files
committed
Update docs and deps
1 parent 5a07164 commit 62a9b2c

4 files changed

Lines changed: 61 additions & 72 deletions

File tree

PARAMETERS.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ Every user-facing knob in hypvector, what it does, and where the value lives at
66

77
| Param | Default | What it does |
88
|---|---|---|
9+
| `writer` | required | Output parquet `Writer` (e.g. from `fileWriter('vectors.parquet')`). Where the bytes go. |
10+
| `vectors` | required | Sync or async iterable of `{ id, vector }` records to write. |
911
| `dimension` | required | Length of each vector. Stored in KV `hypvector.dimension`. |
1012
| `metric` | `'cosine'` | Intended similarity metric. Hint stored in KV; search reads it. |
1113
| `normalize` | `false` | L2-normalize on write; lets cosine score via dot product. Stored in KV `hypvector.normalized`. |
@@ -25,6 +27,7 @@ Every user-facing knob in hypvector, what it does, and where the value lives at
2527
| `source` | required | URL, file path, AsyncBuffer, or array of any of those (parallel multi-file search). |
2628
| `topK` | `10` | Number of nearest neighbors to return. |
2729
| `metric` | from KV | Override the stored metric. Almost never needed. |
30+
| `algorithm` | `'auto'` | Search path. `'auto'` uses binary+rerank when the file has a binary column, else exact full scan. `'exact'` forces a full scan; `'binary'` forces the rerank path (errors if the file has no binary column). |
2831
| `rerankFactor` | `10` | Candidate pool size = `topK × rerankFactor`. `0` forces exact full scan. Higher = more recall, more bytes fetched. Suggested `~max(10, N/3000)`. |
2932
| `probe` | `0.25` | Fraction (or integer count) of clusters to scan in phase 1. Lower = faster, lower recall. Ignored if file has no centroids. |
3033
| `binary` | none | Pre-fetched binary column (from `prefetchBinary`). When provided, phase-1 Hamming scan runs from memory. |

PLAN_AUTO.md

Lines changed: 31 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ Each parameter below has a current state, a target strategy, and the experiments
1818
| Param | Today | Target | Why / experiment |
1919
|---|---|---|---|
2020
| `dimension` | required | **Required** | Caller's model dictates this. No automation possible. |
21-
| `metric` | `'cosine'` arg | **KV-metadata** (already) | Already stored. Make `'cosine'` the default and stop asking. |
22-
| `normalize` | `false` arg | **KV-metadata, default `true`** | Cosine + normalized = dot, which dominates everywhere. We should flip the default and just normalize if the caller doesn't say otherwise. Cheap, harmless if vectors are already unit-length. **Needed**: confirm there's no observable downside on the LLM log corpus. |
23-
| `binary` | `false` arg | **Derive(N, dimension): on when worth it** | At ~1.5% extra bytes for ~50× fewer bytes-read in phase 2, binary is almost always worth it past ~10k vectors. **Needed**: write-time check using `N`; turn on automatically for `N ≥ ~10k`. Below that, exact scan is fine. Ablate on LLM log to confirm threshold. |
24-
| `clusters` | `0` arg | **Derive(N)** | Roughly `clusters ≈ sqrt(N)` is the IVF folklore rule (and matches our 128 for 156k = ~395 floor). **Needed**: sweep `clusters ∈ {0, sqrt(N)/2, sqrt(N), 2·sqrt(N), 4·sqrt(N)}` on LLM logs at 50k / 100k / 500k. Lock in a formula. |
21+
| `metric` | `'cosine'` default, in KV | **KV-metadata** (done) | Defaults to `'cosine'`, stored in KV, read transparently at search. |
22+
| `normalize` | `false` arg | **KV-metadata, default `true` (not yet flipped)** | Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside, and the README/quickstart already pass `true`. Open: flip the *code* default so callers can omit it. Harmless if vectors are already unit-length. |
23+
| `binary` | **Auto (shipped)** | **Derive(N): on at N ≥ 10k** | Shipped: auto-on at `defaultAutoBinaryThreshold = 10000` (~1.5% extra bytes for ~50× fewer bytes-read in phase 2). Below threshold, exact scan is fine. Small-N crossover still unmeasured (see open experiments). |
24+
| `clusters` | **Auto (shipped)** | **Derive(N): `round(√N/2)`** | Shipped: `round(√N/2)` when binary auto-on (`writeVectors.js`). The sweep below locked in `√N/2` over `√N` (better latency, same recall on both corpora). Caller can still pass an explicit count or `0`. |
2525
| `clusterIterations` | `6` | **Fixed (6)** | The existing ablations show diminishing returns past 6. Hide the knob. |
2626
| `clusterSeed` | `1` | **Fixed (1)** | Determinism is the only reason this exists. No reason to expose. |
27-
| `codec` | `'UNCOMPRESSED'` | **Fixed** | Already ablated (`scripts/test-encoding.js`, `data/enc_*`). Float embeddings don't compress; SNAPPY/ZSTD costs latency. Hide. |
27+
| `codec` | `'UNCOMPRESSED'` | **Fixed** | Already ablated (`scripts/test-encoding.js`). Float embeddings don't compress; SNAPPY/ZSTD costs latency. Hide. |
2828
| `pageSize` | `1 MB` / `32 KB` when binary | **Derive(binary)** | Already automatic: keep the rule, hide the knob from the public API unless a test rig needs it. |
2929
| `rowGroupSize` | `10000` / per-cluster | **Derive(clusters)** | Already automatic: clustered files use per-cluster row groups, unclustered uses 10k. Hide the knob. |
3030

@@ -35,22 +35,17 @@ Each parameter below has a current state, a target strategy, and the experiments
3535
| `topK` | `10` | **Required**, default 10 | Caller intent. Keep. |
3636
| `query`, `source`, `metadata`, `binary`, `signal` | n/a | **Required / passthrough** | These aren't tuning knobs. |
3737
| `metric` | from KV | **KV-metadata** (already) | Already automatic. The argument exists only as an override; demote to "rarely needed". |
38-
| `rerankFactor` | `10` | **Derive(N, topK)** | The README already documents `~max(10, N/3000)`. Make this the default: read `N` from KV and compute. Caller can still override for the recall/latency knob. **Needed**: confirm the `N/3000` rule on LLM logs at 100k / 500k / 1M. The wiki benchmark only validates it at 1M synthetic. |
39-
| `probe` | `0.25` | **Derive(N, clusters)** | Probe is tightly coupled to recall. **Needed**: sweep `probe ∈ {0.05, 0.1, 0.25, 0.5, 1.0}` on LLM logs, plot recall vs. ms. If the recall@10 curve is well-behaved (monotonic, knee in a predictable place), pick a default that gives ≥90% recall; expose `probe` only when caller wants more/less recall. |
38+
| `rerankFactor` | `10` | **Fixed (10), document override** | Sweeps below kept the default at 10 (already saturated at 100k LLM-log; +2pp on wiki only at rf=30). The `~max(10, N/3000)` rule lives in the README as "raise if you see sub-target recall", not as a derived default. |
39+
| `probe` | `0.25` | **Fixed (0.25), document override** | Sweeps below kept 0.25: LLM-log tempted 0.10, but wiki showed 0.10 → 84% recall (−9pp). 0.25 is tuned for the harder distribution. Expose only when the caller wants more/less recall. |
4040

41-
## What we need to actually run
41+
## Method
4242

43-
Most parameters above resolve via existing evidence (the README ablations) or trivial code changes. The genuinely open questions all need the **same dataset** and the **same sweep harness**:
43+
The open questions (clusters formula, rerank/probe defaults) were settled with one dataset and one harness, re-runnable as perf work continues:
4444

45-
1. **The dataset**: `AmanPriyanshu/tool-reasoning-sft-CODING-jupyter-agent-dataset-sft-tool-use-agent-data-cleaned-rectified` from Hugging Face. LLM tool/code logs: repetitive, long-tailed, structurally different from wiki titles. If our defaults look wrong on this, we know they're tuned to wiki.
46-
2. **Embed at 384-dim with MiniLM**, normalized: same model as the wiki baseline, so numbers compare directly.
47-
3. **Sweep, at 50k / 100k / 500k row subsets**:
48-
- `clusters ∈ {0, sqrt(N)/2, sqrt(N), 2·sqrt(N), 4·sqrt(N)}` (write-side, expensive)
49-
- `rerankFactor ∈ {0, 10, 30, 100, max(10, N/3000), 300}` (cheap; redo per query set)
50-
- `probe ∈ {0.05, 0.1, 0.25, 0.5, 1.0}` (cheap; same)
51-
4. **Report** recall@10, ms/query, fetches, MB read, using the same table format as the existing README ablation, so they're directly comparable.
45+
- **Dataset**: `AmanPriyanshu/tool-reasoning-sft-...` (Hugging Face). LLM tool/code logs, structurally unlike wiki titles, so defaults that overfit wiki show up here. Embedded 384-dim MiniLM, normalized, to compare directly with the wiki baseline.
46+
- **Harness**: `scripts/sweep-llmlog.js` (takes an optional file arg, e.g. a wiki parquet) sweeps `clusters` / `rerankFactor` / `probe` and reports recall@10, ms/query, fetches, MB read.
5247

53-
If LLM log results agree with wiki, we adopt the `sqrt(N)` / `N/3000` / `probe=0.25` defaults and document. If they disagree, we keep the knobs as "tune for your corpus" and write up the difference.
48+
Results below; conclusions folded into **Final defaults**.
5449

5550
## Empirical results: LLM logs, 100k × 384-dim MiniLM
5651

@@ -92,16 +87,13 @@ From `scripts/sweep-llmlog.js`. Corpus is 100k messages from the tool-reasoning-
9287

9388
**Read**: `probe=0.10` matches the recall of `probe=0.25` at ~60% of the latency. The 0.25 default is overcautious, at least for this corpus and `clusters ≈ √N`.
9489

95-
### What this changes in the plan
90+
### Reading (LLM-log alone)
9691

97-
1. **`probe` default → 0.10** (was 0.25). Same recall, ~40% faster. Worth re-confirming on wiki to make sure we're not regressing there.
98-
2. **`rerankFactor` default → keep 10**, not `max(10, N/3000)`. At 100k LLM log, 10 is already saturated. The `N/3000` rule should be reframed as "scale up only if you observe recall below your target", not a default.
99-
3. **`clusters` rule → `√N/2`**, not `√N`. Better latency at the same recall on this corpus. Sanity-check on wiki before locking in.
100-
4. **All three sweeps recall-cap at 94%.** This is suspiciously flat across configs; likely the corpus has many near-duplicate tool/code messages, so top-10 is "easy". A second pass with stricter recall@1 or recall@100 metrics would be more discriminating, but the relative *ranking* across params should hold.
92+
Taken on its own, this corpus suggested `probe → 0.10` (same recall, ~40% faster), `rerankFactor` stays 10 (already saturated), and `clusters → √N/2`. The wiki sanity check below **reverses the probe call** (see Final defaults). One caveat that holds: all three sweeps cap at ~94% recall, suspiciously flat because the corpus has many near-duplicate messages, so top-10 is "easy". recall@100 would discriminate better.
10193

10294
## Sanity check: wiki, 156k × 384-dim MiniLM
10395

104-
From `scripts/sweep-llmlog.js data/wiki_en.vectors.parquet`. Same sweeps, same code path, 20 in-corpus queries.
96+
From `scripts/sweep-llmlog.js` against the 156k wiki corpus. Same sweeps, same code path, 20 in-corpus queries.
10597

10698
### Clusters (probe=0.25, rerankFactor=10)
10799

@@ -159,38 +151,14 @@ The disagreement on `probe` is the most interesting finding: LLM-log retrievals
159151
- Find the small-N crossover where the binary column stops being worth ~1.5% bytes.
160152

161153

162-
## PQ tuning: does IVF-PQ ever beat binary+cluster?
154+
## Product quantization: evaluated, removed
163155

164-
From `scripts/sweep-pq.js` + `scripts/sweep-pq-probe.js` on the 100k LLM-log corpus (384-dim). Swept `pqSegments × pqCentroids × ivfClusters`, then probe/rerankFactor on the best config.
156+
An IVF-PQ path was built, swept at 384-dim and 3072-dim (`sweep-pq.js`, `hidim-pq.js`), and **removed** (commit `92e09bc`). The lesson, kept so we don't rebuild it:
165157

166-
Best PQ configs vs. the binary+cluster default (clusters=√N/2 = 158 → 8.4 ms / 94% recall / 3.8 MB):
167-
168-
| Path | Config | ms | recall | MB read |
169-
|---|---|---:|---:|---:|
170-
| **binary+cluster (default)** | clusters=158 | **8.4** | **94%** | 3.80 |
171-
| PQ — fastest decent | s32/c64/ivf128 | 12.6 | 90.5% | 3.65 |
172-
| PQ — best recall @ probe 0.25 | s64/c256/ivf316 | 28.9 | 94% | 4.60 |
173-
| PQ — full probe + rf=30 | s64/c64/ivf128 | 74.0 | 94.5% | 11.2 |
174-
| PQ — bandwidth optimum | s16/c64/ivf316 | 14.1 | 73.5% | **2.48** |
175-
176-
PQ's recall ceiling is ~94% even at probe=1.0 (residual codes lose top-10 signal); matching binary+cluster's recall costs 1.5–9× the latency. **At 384-dim, tuned PQ still loses on every axis except raw bytes-read at low recall** — and binary+cluster's `probe` knob beats it there too (probe=0.05 → 4.4 ms / 93% / 2.21 MB).
177-
178-
### High dimension: tested, PQ still loses
179-
180-
The remaining hope for PQ was high dimension — the binary column grows as `dim/8` while a PQ code stays at `pqSegments` bytes, so PQ's phase-1 scan should read far less. Tested at **3072-dim** (`text-embedding-3-large`), 30k LLM-log messages, via `scripts/hidim-pq.js`:
181-
182-
| variant | file MB | ms | fetches | MB read | recall |
183-
|---|---:|---:|---:|---:|---:|
184-
| **binary+cluster** | 381.2 | **11.6** | 48 | 15.6 | **95.6%** |
185-
| pq s32/c64 | 374.5 | 22.0 | 53 | **9.2** | 66.0% |
186-
| pq s64/c64 | 375.5 | 22.3 | 54 | 9.8 | 73.4% |
187-
| pq s96/c256 | 379.6 | 64.8 | 57 | 14.9 | 87.2% |
188-
189-
PQ *does* read fewer bytes (9.2 vs 15.6 MB) — the bandwidth hypothesis was real — but at catastrophic recall loss (66% vs 95.6%) and 2–6× the wall-time (building PQ distance tables across IVF cells is CPU-heavy at high dim). **So no: PQ does not win at OpenAI scale.**
190-
191-
The reason is structural and kills the whole premise: at 3072-dim the **float32 rerank column is 12,288 bytes/row**, so it dominates the file (369 of 381 MB). The binary column is only 384 bytes/row — already negligible — so shrinking phase-1 to a 32-byte PQ code saves nothing meaningful on total size (374 vs 381 MB), and phase-2 *float* fetches (which both paths keep, for exact rerank) dominate bytes-read regardless. PQ optimizes the cheap part.
192-
193-
**The actual high-dim cost driver is keeping the full float column at all.** The only way PQ pays off is a *lossy* mode — PQ codes with **no** float column, accepting approximate scores — which would shrink a 381 MB file to ~3 MB. That's a different feature (lossy/quantized-only storage) than the current PQ-then-float-rerank, and isn't built. Conclusion: **drop or de-emphasize the current IVF-PQ path; if PQ comes back, it should be as a float-free lossy mode, justified by its own benchmark.**
158+
- **384-dim**: tuned PQ lost on every axis except raw bytes-read at low recall, and binary+cluster's `probe` knob beats it there too (probe=0.05 → 4.4 ms / 93% / 2.21 MB).
159+
- **3072-dim**: PQ read fewer phase-1 bytes (9.2 vs 15.6 MB, so the bandwidth hypothesis was real) but at catastrophic recall (66% vs 95.6%) and 2-6× wall-time.
160+
- **Why it can't win as built**: PQ-then-float-rerank keeps the full float32 column, which dominates the file at high dim (369 of 381 MB at 3072-dim). Shrinking the phase-1 code saves nothing meaningful; phase-2 float fetches dominate bytes-read regardless. PQ optimizes the cheap part.
161+
- **The only way PQ pays off** is a *float-free lossy* mode (codes only, approximate scores), for a ~100× smaller file. That's a different feature, justified by its own benchmark, and isn't built.
194162

195163
## Embedding model comparison
196164

@@ -206,34 +174,30 @@ From `scripts/sweep-models.js`, 300 conversations → 5,412 messages, 300 labele
206174
| oai-3-small | 1536 | 33.3 | 8.0 | 33.3% | 41.3% | 0.364 |
207175
| oai-3-large | 3072 | 66.6 | 15.1 | **34.3%** | **42.0%** | **0.373** |
208176

209-
(`oai-*` from the OpenAI embeddings API via `OPENAI_API_KEY`; higher dims use the `dimensions` Matryoshka-truncation param. API embedding is ~150–375 msg/s vs MiniLM's 55.)
210-
211-
**The headline: embedding model choice barely moves the needle on this task, and dimension cost dominates.** From MiniLM-L6 (free, local, 384-dim) to `text-embedding-3-large` (OpenAI's best, 3072-dim), hits@1 moves 33.7% → 34.3% and hits@10 moves 40.7% → 42.0% — within noise. But oai-3-large costs **8× the file size (66.6 vs 8.4 MB) and 6× the per-query latency (15.1 vs 2.5 ms)** for that ~1 pp. Dim-matched at 384, OpenAI's small model *ties* MiniLM exactly (33.0% / 41.0%). So the only thing that materially changes the hypvector cost profile is the embedding **dimension**, not the model's pedigree.
212-
213-
**The code-specialized model actively hurts.** `jinaai/jina-embeddings-v2-base-code` (768-dim) dropped hits@1 from 33.7% to 12.3% and embeds ~10× slower (4–5 msg/s). Reason: **the eval task is natural-language → natural-language** — a prose user question ("Which feature has the most outliers?") retrieving a prose answer. A code encoder tunes its space for code structure (code↔code, code↔docstring), the wrong objective for NL Q&A. It would likely win on a *different* task — NL intent → retrieve the `tool_call`/code cell — but that's not what user→answer retrieval measures.
177+
(`oai-*` from the OpenAI API via `OPENAI_API_KEY`; higher dims use the `dimensions` Matryoshka-truncation param.)
214178

215-
Takeaways:
216-
- **Keep MiniLM-L6 as the documented default.** Nothing tested beats it on quality-per-byte; the SOTA paid model adds ~1 pp at 8× the storage.
217-
- **Dimension is the real cost lever.** If a user brings a 1536- or 3072-dim model, the linear-scan file and query both grow proportionally (visible above: ms/q 2.5 → 8.0 → 15.1, size 8.4 → 33.3 → 66.6 MB). This is exactly the regime where dimensionality reduction (Matryoshka truncation — oai-3-small→384 keeps the quality) or PQ compression earns its keep. **Recommend 384-dim models, or truncating, in the docs.**
218-
- Model choice is task-dependent: a code encoder may still help when retrieving *code/tool messages* rather than NL answers — worth a separate code-retrieval eval before recommending one there.
219-
- The eval (user→answer within-conversation) is a rough proxy; treat the absolute ~41% as a *relative* yardstick, not a quality bar.
179+
Lessons:
180+
- **Model pedigree barely moves the needle; dimension dominates cost.** MiniLM-L6 → `text-embedding-3-large` gains ~1pp (hits@1 33.7 → 34.3) for 8× the file size and 6× the latency. Dim-matched at 384, oai-3-small *ties* MiniLM. → **Keep MiniLM-L6 as the documented default; recommend 384-dim models (or Matryoshka truncation, which preserves the quality).**
181+
- **A code encoder actively hurts here**: jina-code dropped hits@1 to 12.3%, because the eval is NL→NL (prose question → prose answer). It might win on NL→code retrieval, which is worth a separate eval before recommending.
182+
- The eval (user→answer within-conversation) is a rough proxy; treat ~41% as a *relative* yardstick, not a quality bar.
220183

221184
## End state for the public API
222185

223-
After the experiments above, the common case should look like:
186+
The common case is now (one open item: flip the `normalize` default to `true`):
224187

225188
```js
226189
await writeVectors({
227190
writer: fileWriter('vectors.parquet'),
228191
dimension: 384,
192+
normalize: true, // still required explicitly; flipping the default is the last open write-side item
229193
vectors: embed(docs),
230-
}) // normalize=true, binary if N≥~10k, clusters≈sqrt(N), all automatic
194+
}) // binary auto at N≥10k, clusters≈√N/2 (both automatic)
231195

232196
const results = await searchVectors({
233197
source: 'vectors.parquet',
234198
query,
235199
topK: 10,
236-
}) // rerankFactor and probe derived from KV count
200+
}) // rerankFactor=10 and probe=0.25 are fixed defaults (not derived); override for recall pressure
237201
```
238202

239-
The advanced knobs (`rerankFactor`, `probe`, `binary` write-flag, `clusters`) stay available, but they move into an "Advanced" subsection of the README, not the quick start.
203+
The advanced knobs (`rerankFactor`, `probe`, `binary` write-flag, `clusters`) stay available but live in an "Advanced" subsection of the README, not the quick start.

0 commit comments

Comments
 (0)