You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: PARAMETERS.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,8 @@ Every user-facing knob in hypvector, what it does, and where the value lives at
6
6
7
7
| Param | Default | What it does |
8
8
|---|---|---|
9
+
|`writer`| required | Output parquet `Writer` (e.g. from `fileWriter('vectors.parquet')`). Where the bytes go. |
10
+
|`vectors`| required | Sync or async iterable of `{ id, vector }` records to write. |
9
11
|`dimension`| required | Length of each vector. Stored in KV `hypvector.dimension`. |
10
12
|`metric`|`'cosine'`| Intended similarity metric. Hint stored in KV; search reads it. |
11
13
|`normalize`|`false`| L2-normalize on write; lets cosine score via dot product. Stored in KV `hypvector.normalized`. |
@@ -25,6 +27,7 @@ Every user-facing knob in hypvector, what it does, and where the value lives at
25
27
|`source`| required | URL, file path, AsyncBuffer, or array of any of those (parallel multi-file search). |
26
28
|`topK`|`10`| Number of nearest neighbors to return. |
27
29
|`metric`| from KV | Override the stored metric. Almost never needed. |
30
+
|`algorithm`|`'auto'`| Search path. `'auto'` uses binary+rerank when the file has a binary column, else exact full scan. `'exact'` forces a full scan; `'binary'` forces the rerank path (errors if the file has no binary column). |
28
31
|`rerankFactor`|`10`| Candidate pool size = `topK × rerankFactor`. `0` forces exact full scan. Higher = more recall, more bytes fetched. Suggested `~max(10, N/3000)`. |
29
32
|`probe`|`0.25`| Fraction (or integer count) of clusters to scan in phase 1. Lower = faster, lower recall. Ignored if file has no centroids. |
30
33
|`binary`| none | Pre-fetched binary column (from `prefetchBinary`). When provided, phase-1 Hamming scan runs from memory. |
Copy file name to clipboardExpand all lines: PLAN_AUTO.md
+31-67Lines changed: 31 additions & 67 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,13 +18,13 @@ Each parameter below has a current state, a target strategy, and the experiments
18
18
| Param | Today | Target | Why / experiment |
19
19
|---|---|---|---|
20
20
|`dimension`| required |**Required**| Caller's model dictates this. No automation possible. |
21
-
|`metric`|`'cosine'`arg |**KV-metadata** (already) |Already stored. Make `'cosine'` the default and stop asking. |
22
-
|`normalize`|`false` arg |**KV-metadata, default `true`**| Cosine + normalized = dot, which dominates everywhere. We should flip the default and just normalize if the caller doesn't say otherwise. Cheap, harmless if vectors are already unit-length. **Needed**: confirm there's no observable downside on the LLM log corpus. |
23
-
|`binary`|`false` arg|**Derive(N, dimension): on when worth it**|At ~1.5% extra bytes for ~50× fewer bytes-read in phase 2, binary is almost always worth it past ~10k vectors. **Needed**: write-time check using `N`; turn on automatically for `N ≥ ~10k`. Below that, exact scan is fine. Ablate on LLM log to confirm threshold. |
24
-
|`clusters`|`0` arg|**Derive(N)**|Roughly `clusters ≈ sqrt(N)`is the IVF folklore rule (and matches our 128 for 156k = ~395 floor). **Needed**: sweep `clusters ∈ {0, sqrt(N)/2, sqrt(N), 2·sqrt(N), 4·sqrt(N)}`on LLM logs at 50k / 100k / 500k. Lock in a formula. |
21
+
|`metric`|`'cosine'`default, in KV |**KV-metadata** (done) |Defaults to `'cosine'`, stored in KV, read transparently at search. |
22
+
|`normalize`|`false` arg |**KV-metadata, default `true` (not yet flipped)**| Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside, and the README/quickstart already pass `true`. Open: flip the *code* default so callers can omit it. Harmless if vectors are already unit-length. |
23
+
|`binary`|**Auto (shipped)**|**Derive(N): on at N ≥ 10k**|Shipped: auto-on at `defaultAutoBinaryThreshold = 10000` (~1.5% extra bytes for ~50× fewer bytes-read in phase 2). Below threshold, exact scan is fine. Small-N crossover still unmeasured (see open experiments). |
24
+
|`clusters`|**Auto (shipped)**|**Derive(N): `round(√N/2)`**|Shipped: `round(√N/2)`when binary auto-on (`writeVectors.js`). The sweep below locked in `√N/2` over `√N` (better latency, same recall on both corpora). Caller can still pass an explicit count or `0`. |
25
25
|`clusterIterations`|`6`|**Fixed (6)**| The existing ablations show diminishing returns past 6. Hide the knob. |
26
26
|`clusterSeed`|`1`|**Fixed (1)**| Determinism is the only reason this exists. No reason to expose. |
|`pageSize`|`1 MB` / `32 KB` when binary |**Derive(binary)**| Already automatic: keep the rule, hide the knob from the public API unless a test rig needs it. |
29
29
|`rowGroupSize`|`10000` / per-cluster |**Derive(clusters)**| Already automatic: clustered files use per-cluster row groups, unclustered uses 10k. Hide the knob. |
30
30
@@ -35,22 +35,17 @@ Each parameter below has a current state, a target strategy, and the experiments
|`metric`| from KV |**KV-metadata** (already) | Already automatic. The argument exists only as an override; demote to "rarely needed". |
38
-
|`rerankFactor`|`10`|**Derive(N, topK)**|The README already documents `~max(10, N/3000)`. Make this the default: read `N` from KV and compute. Caller can still override for the recall/latency knob. **Needed**: confirm the `N/3000` rule on LLM logs at 100k / 500k / 1M. The wiki benchmark only validates it at 1M synthetic. |
39
-
|`probe`|`0.25`|**Derive(N, clusters)**|Probe is tightly coupled to recall. **Needed**: sweep `probe ∈ {0.05, 0.1, 0.25, 0.5, 1.0}` on LLM logs, plot recall vs. ms. If the recall@10 curve is well-behaved (monotonic, knee in a predictable place), pick a default that gives ≥90% recall; expose `probe`only when caller wants more/less recall. |
38
+
|`rerankFactor`|`10`|**Fixed (10), document override**|Sweeps below kept the default at 10 (already saturated at 100k LLM-log; +2pp on wiki only at rf=30). The `~max(10, N/3000)` rule lives in the README as "raise if you see sub-target recall", not as a derived default. |
39
+
|`probe`|`0.25`|**Fixed (0.25), document override**|Sweeps below kept 0.25: LLM-log tempted 0.10, but wiki showed 0.10 → 84% recall (−9pp). 0.25 is tuned for the harder distribution. Expose only when the caller wants more/less recall. |
40
40
41
-
## What we need to actually run
41
+
## Method
42
42
43
-
Most parameters above resolve via existing evidence (the README ablations) or trivial code changes. The genuinely open questions all need the **same dataset** and the **same sweep harness**:
43
+
The open questions (clusters formula, rerank/probe defaults) were settled with one dataset and one harness, re-runnable as perf work continues:
44
44
45
-
1.**The dataset**: `AmanPriyanshu/tool-reasoning-sft-CODING-jupyter-agent-dataset-sft-tool-use-agent-data-cleaned-rectified` from Hugging Face. LLM tool/code logs: repetitive, long-tailed, structurally different from wiki titles. If our defaults look wrong on this, we know they're tuned to wiki.
46
-
2.**Embed at 384-dim with MiniLM**, normalized: same model as the wiki baseline, so numbers compare directly.
4.**Report** recall@10, ms/query, fetches, MB read, using the same table format as the existing README ablation, so they're directly comparable.
45
+
-**Dataset**: `AmanPriyanshu/tool-reasoning-sft-...` (Hugging Face). LLM tool/code logs, structurally unlike wiki titles, so defaults that overfit wiki show up here. Embedded 384-dim MiniLM, normalized, to compare directly with the wiki baseline.
46
+
-**Harness**: `scripts/sweep-llmlog.js` (takes an optional file arg, e.g. a wiki parquet) sweeps `clusters` / `rerankFactor` / `probe` and reports recall@10, ms/query, fetches, MB read.
52
47
53
-
If LLM log results agree with wiki, we adopt the `sqrt(N)` / `N/3000` / `probe=0.25`defaults and document. If they disagree, we keep the knobs as "tune for your corpus" and write up the difference.
48
+
Results below; conclusions folded into **Final defaults**.
@@ -92,16 +87,13 @@ From `scripts/sweep-llmlog.js`. Corpus is 100k messages from the tool-reasoning-
92
87
93
88
**Read**: `probe=0.10` matches the recall of `probe=0.25` at ~60% of the latency. The 0.25 default is overcautious, at least for this corpus and `clusters ≈ √N`.
94
89
95
-
### What this changes in the plan
90
+
### Reading (LLM-log alone)
96
91
97
-
1.**`probe` default → 0.10** (was 0.25). Same recall, ~40% faster. Worth re-confirming on wiki to make sure we're not regressing there.
98
-
2.**`rerankFactor` default → keep 10**, not `max(10, N/3000)`. At 100k LLM log, 10 is already saturated. The `N/3000` rule should be reframed as "scale up only if you observe recall below your target", not a default.
99
-
3.**`clusters` rule → `√N/2`**, not `√N`. Better latency at the same recall on this corpus. Sanity-check on wiki before locking in.
100
-
4.**All three sweeps recall-cap at 94%.** This is suspiciously flat across configs; likely the corpus has many near-duplicate tool/code messages, so top-10 is "easy". A second pass with stricter recall@1 or recall@100 metrics would be more discriminating, but the relative *ranking* across params should hold.
92
+
Taken on its own, this corpus suggested `probe → 0.10` (same recall, ~40% faster), `rerankFactor` stays 10 (already saturated), and `clusters → √N/2`. The wiki sanity check below **reverses the probe call** (see Final defaults). One caveat that holds: all three sweeps cap at ~94% recall, suspiciously flat because the corpus has many near-duplicate messages, so top-10 is "easy". recall@100 would discriminate better.
101
93
102
94
## Sanity check: wiki, 156k × 384-dim MiniLM
103
95
104
-
From `scripts/sweep-llmlog.js data/wiki_en.vectors.parquet`. Same sweeps, same code path, 20 in-corpus queries.
96
+
From `scripts/sweep-llmlog.js` against the 156k wiki corpus. Same sweeps, same code path, 20 in-corpus queries.
105
97
106
98
### Clusters (probe=0.25, rerankFactor=10)
107
99
@@ -159,38 +151,14 @@ The disagreement on `probe` is the most interesting finding: LLM-log retrievals
159
151
- Find the small-N crossover where the binary column stops being worth ~1.5% bytes.
160
152
161
153
162
-
## PQ tuning: does IVF-PQ ever beat binary+cluster?
154
+
## Product quantization: evaluated, removed
163
155
164
-
From `scripts/sweep-pq.js` + `scripts/sweep-pq-probe.js` on the 100k LLM-log corpus (384-dim). Swept `pqSegments × pqCentroids × ivfClusters`, then probe/rerankFactor on the best config.
156
+
An IVF-PQ path was built, swept at 384-dim and 3072-dim (`sweep-pq.js`, `hidim-pq.js`), and **removed** (commit `92e09bc`). The lesson, kept so we don't rebuild it:
165
157
166
-
Best PQ configs vs. the binary+cluster default (clusters=√N/2 = 158 → 8.4 ms / 94% recall / 3.8 MB):
PQ's recall ceiling is ~94% even at probe=1.0 (residual codes lose top-10 signal); matching binary+cluster's recall costs 1.5–9× the latency. **At 384-dim, tuned PQ still loses on every axis except raw bytes-read at low recall** — and binary+cluster's `probe` knob beats it there too (probe=0.05 → 4.4 ms / 93% / 2.21 MB).
177
-
178
-
### High dimension: tested, PQ still loses
179
-
180
-
The remaining hope for PQ was high dimension — the binary column grows as `dim/8` while a PQ code stays at `pqSegments` bytes, so PQ's phase-1 scan should read far less. Tested at **3072-dim** (`text-embedding-3-large`), 30k LLM-log messages, via `scripts/hidim-pq.js`:
PQ *does* read fewer bytes (9.2 vs 15.6 MB) — the bandwidth hypothesis was real — but at catastrophic recall loss (66% vs 95.6%) and 2–6× the wall-time (building PQ distance tables across IVF cells is CPU-heavy at high dim). **So no: PQ does not win at OpenAI scale.**
190
-
191
-
The reason is structural and kills the whole premise: at 3072-dim the **float32 rerank column is 12,288 bytes/row**, so it dominates the file (369 of 381 MB). The binary column is only 384 bytes/row — already negligible — so shrinking phase-1 to a 32-byte PQ code saves nothing meaningful on total size (374 vs 381 MB), and phase-2 *float* fetches (which both paths keep, for exact rerank) dominate bytes-read regardless. PQ optimizes the cheap part.
192
-
193
-
**The actual high-dim cost driver is keeping the full float column at all.** The only way PQ pays off is a *lossy* mode — PQ codes with **no** float column, accepting approximate scores — which would shrink a 381 MB file to ~3 MB. That's a different feature (lossy/quantized-only storage) than the current PQ-then-float-rerank, and isn't built. Conclusion: **drop or de-emphasize the current IVF-PQ path; if PQ comes back, it should be as a float-free lossy mode, justified by its own benchmark.**
158
+
-**384-dim**: tuned PQ lost on every axis except raw bytes-read at low recall, and binary+cluster's `probe` knob beats it there too (probe=0.05 → 4.4 ms / 93% / 2.21 MB).
159
+
-**3072-dim**: PQ read fewer phase-1 bytes (9.2 vs 15.6 MB, so the bandwidth hypothesis was real) but at catastrophic recall (66% vs 95.6%) and 2-6× wall-time.
160
+
-**Why it can't win as built**: PQ-then-float-rerank keeps the full float32 column, which dominates the file at high dim (369 of 381 MB at 3072-dim). Shrinking the phase-1 code saves nothing meaningful; phase-2 float fetches dominate bytes-read regardless. PQ optimizes the cheap part.
161
+
-**The only way PQ pays off** is a *float-free lossy* mode (codes only, approximate scores), for a ~100× smaller file. That's a different feature, justified by its own benchmark, and isn't built.
(`oai-*` from the OpenAI embeddings API via `OPENAI_API_KEY`; higher dims use the `dimensions` Matryoshka-truncation param. API embedding is ~150–375 msg/s vs MiniLM's 55.)
210
-
211
-
**The headline: embedding model choice barely moves the needle on this task, and dimension cost dominates.** From MiniLM-L6 (free, local, 384-dim) to `text-embedding-3-large` (OpenAI's best, 3072-dim), hits@1 moves 33.7% → 34.3% and hits@10 moves 40.7% → 42.0% — within noise. But oai-3-large costs **8× the file size (66.6 vs 8.4 MB) and 6× the per-query latency (15.1 vs 2.5 ms)** for that ~1 pp. Dim-matched at 384, OpenAI's small model *ties* MiniLM exactly (33.0% / 41.0%). So the only thing that materially changes the hypvector cost profile is the embedding **dimension**, not the model's pedigree.
212
-
213
-
**The code-specialized model actively hurts.**`jinaai/jina-embeddings-v2-base-code` (768-dim) dropped hits@1 from 33.7% to 12.3% and embeds ~10× slower (4–5 msg/s). Reason: **the eval task is natural-language → natural-language** — a prose user question ("Which feature has the most outliers?") retrieving a prose answer. A code encoder tunes its space for code structure (code↔code, code↔docstring), the wrong objective for NL Q&A. It would likely win on a *different* task — NL intent → retrieve the `tool_call`/code cell — but that's not what user→answer retrieval measures.
177
+
(`oai-*` from the OpenAI API via `OPENAI_API_KEY`; higher dims use the `dimensions` Matryoshka-truncation param.)
214
178
215
-
Takeaways:
216
-
-**Keep MiniLM-L6 as the documented default.** Nothing tested beats it on quality-per-byte; the SOTA paid model adds ~1 pp at 8× the storage.
217
-
-**Dimension is the real cost lever.** If a user brings a 1536- or 3072-dim model, the linear-scan file and query both grow proportionally (visible above: ms/q 2.5 → 8.0 → 15.1, size 8.4 → 33.3 → 66.6 MB). This is exactly the regime where dimensionality reduction (Matryoshka truncation — oai-3-small→384 keeps the quality) or PQ compression earns its keep. **Recommend 384-dim models, or truncating, in the docs.**
218
-
- Model choice is task-dependent: a code encoder may still help when retrieving *code/tool messages* rather than NL answers — worth a separate code-retrieval eval before recommending one there.
219
-
- The eval (user→answer within-conversation) is a rough proxy; treat the absolute ~41% as a *relative* yardstick, not a quality bar.
179
+
Lessons:
180
+
-**Model pedigree barely moves the needle; dimension dominates cost.** MiniLM-L6 → `text-embedding-3-large` gains ~1pp (hits@1 33.7 → 34.3) for 8× the file size and 6× the latency. Dim-matched at 384, oai-3-small *ties* MiniLM. → **Keep MiniLM-L6 as the documented default; recommend 384-dim models (or Matryoshka truncation, which preserves the quality).**
181
+
-**A code encoder actively hurts here**: jina-code dropped hits@1 to 12.3%, because the eval is NL→NL (prose question → prose answer). It might win on NL→code retrieval, which is worth a separate eval before recommending.
182
+
- The eval (user→answer within-conversation) is a rough proxy; treat ~41% as a *relative* yardstick, not a quality bar.
220
183
221
184
## End state for the public API
222
185
223
-
After the experiments above, the common case should look like:
186
+
The common case is now (one open item: flip the `normalize` default to `true`):
224
187
225
188
```js
226
189
awaitwriteVectors({
227
190
writer:fileWriter('vectors.parquet'),
228
191
dimension:384,
192
+
normalize:true, // still required explicitly; flipping the default is the last open write-side item
229
193
vectors:embed(docs),
230
-
}) //normalize=true, binary if N≥~10k, clusters≈sqrt(N), all automatic
194
+
}) // binary auto at N≥10k, clusters≈√N/2 (both automatic)
231
195
232
196
constresults=awaitsearchVectors({
233
197
source:'vectors.parquet',
234
198
query,
235
199
topK:10,
236
-
}) // rerankFactor and probederived from KV count
200
+
}) // rerankFactor=10 and probe=0.25 are fixed defaults (not derived); override for recall pressure
237
201
```
238
202
239
-
The advanced knobs (`rerankFactor`, `probe`, `binary` write-flag, `clusters`) stay available, but they move into an "Advanced" subsection of the README, not the quick start.
203
+
The advanced knobs (`rerankFactor`, `probe`, `binary` write-flag, `clusters`) stay available but live in an "Advanced" subsection of the README, not the quick start.
0 commit comments