hyparam
diff --git a/‎CLAUDE.md‎
Lines changed: 1 addition & 3 deletions b/‎CLAUDE.md‎
Lines changed: 1 addition & 3 deletions
diff --git a/‎PARAMETERS.md‎
Lines changed: 41 additions & 0 deletions b/‎PARAMETERS.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎PLAN_AUTO.md‎
Lines changed: 180 additions & 0 deletions b/‎PLAN_AUTO.md‎
Lines changed: 180 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 2 additions & 8 deletions b/‎README.md‎
Lines changed: 2 additions & 8 deletions
@@ -6,8 +6,6 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 HypVector is a library for storing and querying embedding vectors in Parquet files. It targets serverless similarity search: clients fetch a Parquet file (over HTTP range requests or from local disk) and run search directly, without a vector database.
 
-The current implementation is a **naive baseline (v0)**. It is deliberately simple so that future experiments can establish clear baselines for storage size, query latency, and recall.
-
 ## Build and Test Commands
 
 ```bash
@@ -18,7 +16,7 @@ npm run lint:fix    # eslint --fix
 npm run benchmark   # write + search benchmark
 ```
 
-## Architecture (v0 naive)
+## Architecture
 
 ### Storage layout
 
 
@@ -0,0 +1,41 @@
+# PARAMETERS
+
+Every user-facing knob in hypvector, what it does, and where the value lives at query time. The companion file [PLAN_AUTO.md](PLAN_AUTO.md) tracks how each one becomes automatic.
+
+## Write-side (`writeVectors`)
+
+| Param | Default | What it does |
+|---|---|---|
+| `dimension` | required | Length of each vector. Stored in KV `hypvector.dimension`. |
+| `metric` | `'cosine'` | Intended similarity metric. Hint stored in KV; search reads it. |
+| `normalize` | `false` | L2-normalize on write; lets cosine score via dot product. Stored in KV `hypvector.normalized`. |
+| `binary` | auto (on at `N ≥ 10000`) | Also write a 1-bit-per-dim sign column (`vector_bin`) for the Hamming phase-1 rerank path. Adds ~`dim/8` bytes/row (~1.5% at 384-dim). Pass `false` to force-off. |
+| `clusters` | auto (`round(sqrt(N)/2)` in full auto mode) | Number of k-means clusters. Implies `binary: true`. Rows are reordered by cluster id; centroids + per-cluster counts go into KV. Enables phase-1 cluster skipping. Pass `0` to force-off, or an integer to set explicitly. |
+| `clusterIterations` | `6` | k-means iterations over the 1-bit codes. |
+| `clusterSeed` | `1` | RNG seed for deterministic clustering. |
+| `codec` | `'UNCOMPRESSED'` | Parquet codec. SNAPPY/ZSTD rarely shrink float embeddings and cost query latency. |
+| `pageSize` | `1 MB` (or 32 KB when `binary`) | Parquet page size. Smaller pages let `useOffsetIndex` fetch tighter byte ranges during rerank phase 2. |
+| `rowGroupSize` | `10000` (or per-cluster sizes when clustered) | Rows per row group. When clustering, each cluster becomes its own row group. |
+
+## Search-side (`searchVectors`)
+
+| Param | Default | What it does |
+|---|---|---|
+| `query` | required | The query vector. Must match `dimension`. |
+| `source` | required | URL, file path, AsyncBuffer, or array of any of those (parallel multi-file search). |
+| `topK` | `10` | Number of nearest neighbors to return. |
+| `metric` | from KV | Override the stored metric. Almost never needed. |
+| `rerankFactor` | `10` | Candidate pool size = `topK × rerankFactor`. `0` forces exact full scan. Higher = more recall, more bytes fetched. Suggested `~max(10, N/3000)`. |
+| `probe` | `0.25` | Fraction (or integer count) of clusters to scan in phase 1. Lower = faster, lower recall. Ignored if file has no centroids. |
+| `binary` | none | Pre-fetched binary column (from `prefetchBinary`). When provided, phase-1 Hamming scan runs from memory. |
+| `metadata` | none | Pre-parsed parquet metadata, reused across queries. Pure latency win. |
+| `signal` | none | AbortSignal. |
+| `asyncBufferFactory` | `cachedAsyncBuffer` wrapper | How to open a string `source`. |
+| `compressors` | none | Custom decompressor map. |
+
+## Where each parameter is decided
+
+- **Stored in KV metadata, read implicitly at query time**: `dimension`, `metric`, `normalized`, `binary` (presence of column), `clusters`, centroids, cluster counts. The caller never restates these on search.
+- **Search-side, must be passed every query**: `topK`, `rerankFactor`, `probe`. These are the per-query trade-offs and the main targets for auto-tuning.
+- **Pure performance (no correctness implications)**: `pageSize`, `rowGroupSize`, `codec`, `metadata` reuse, `binary` prefetch, `asyncBufferFactory`. Defaults already cover the common case; ablations exist for `pageSize` and `codec`.
+- **Build-time only**: `clusterIterations`, `clusterSeed`. Set once at write.
@@ -0,0 +1,180 @@
+# Auto-tuning plan
+
+Goal: make hypvector's knobs disappear for the common case. Caller passes `vectors`, `query`, `topK`. Everything else is either picked from the inputs or burned into the file at write time.
+
+For every parameter in [PARAMETERS.md](PARAMETERS.md), we pick exactly one of four strategies:
+
+- **Fixed** — one value that's better than alternatives across realistic regimes. No knob exposed (or expose only as an escape hatch).
+- **Derive(inputs)** — compute at call time from things we already have: `N`, `dimension`, `topK`.
+- **KV-metadata** — write-time decision is recorded in the parquet, search reads it transparently. No restatement at query time.
+- **Document** — keep the knob, but tell people clearly when to reach for it. Falls back to a sensible default.
+
+Each parameter below has a current state, a target strategy, and the experiments needed to lock in the strategy.
+
+## Decision table
+
+### Write-side
+
+| Param | Today | Target | Why / experiment |
+|---|---|---|---|
+| `dimension` | required | **Required** | Caller's model dictates this. No automation possible. |
+| `metric` | `'cosine'` arg | **KV-metadata** (already) | Already stored. Make `'cosine'` the default and stop asking. |
+| `normalize` | `false` arg | **KV-metadata, default `true`** | Cosine + normalized = dot, which dominates everywhere. We should flip the default and just normalize if the caller doesn't say otherwise. Cheap, harmless if vectors are already unit-length. **Needed**: confirm there's no observable downside on the LLM log corpus. |
+| `binary` | `false` arg | **Derive(N, dimension): on when worth it** | At ~1.5% extra bytes for ~50× fewer bytes-read in phase 2, binary is almost always worth it past ~10k vectors. **Needed**: write-time check using `N` — turn on automatically for `N ≥ ~10k`; below that, exact scan is fine. Ablate on LLM log to confirm threshold. |
+| `clusters` | `0` arg | **Derive(N)** | Roughly `clusters ≈ sqrt(N)` is the IVF folklore rule (and matches our 128 for 156k = ~395 floor). **Needed**: sweep `clusters ∈ {0, sqrt(N)/2, sqrt(N), 2·sqrt(N), 4·sqrt(N)}` on LLM logs at 50k / 100k / 500k. Lock in a formula. |
+| `clusterIterations` | `6` | **Fixed (6)** | The existing ablations show diminishing returns past 6. Hide the knob. |
+| `clusterSeed` | `1` | **Fixed (1)** | Determinism is the only reason this exists. No reason to expose. |
+| `codec` | `'UNCOMPRESSED'` | **Fixed** | Already ablated (`scripts/test-encoding.js`, `data/enc_*`). Float embeddings don't compress; SNAPPY/ZSTD costs latency. Hide. |
+| `pageSize` | `1 MB` / `32 KB` when binary | **Derive(binary)** | Already automatic — keep the rule, hide the knob from the public API unless a test rig needs it. |
+| `rowGroupSize` | `10000` / per-cluster | **Derive(clusters)** | Already automatic — clustered files use per-cluster row groups, unclustered uses 10k. Hide the knob. |
+
+### Search-side
+
+| Param | Today | Target | Why / experiment |
+|---|---|---|---|
+| `topK` | `10` | **Required**, default 10 | Caller intent. Keep. |
+| `query`, `source`, `metadata`, `binary`, `signal` | — | **Required / passthrough** | These aren't tuning knobs. |
+| `metric` | from KV | **KV-metadata** (already) | Already automatic. The argument exists only as an override; demote to "rarely needed". |
+| `rerankFactor` | `10` | **Derive(N, topK)** | The README already documents `~max(10, N/3000)`. Make this the default — read `N` from KV and compute. Caller can still override for the recall/latency knob. **Needed**: confirm the `N/3000` rule on LLM logs at 100k / 500k / 1M. The wiki benchmark only validates it at 1M synthetic. |
+| `probe` | `0.25` | **Derive(N, clusters)** | Probe is tightly coupled to recall. **Needed**: sweep `probe ∈ {0.05, 0.1, 0.25, 0.5, 1.0}` on LLM logs, plot recall vs. ms. If the recall@10 curve is well-behaved (monotonic, knee in a predictable place), pick a default that gives ≥90% recall; expose `probe` only when caller wants more/less recall. |
+
+## What we need to actually run
+
+Most parameters above resolve via existing evidence (the README ablations) or trivial code changes. The genuinely open questions all need the **same dataset** and the **same sweep harness**:
+
+1. **The dataset**: `AmanPriyanshu/tool-reasoning-sft-CODING-jupyter-agent-dataset-sft-tool-use-agent-data-cleaned-rectified` from Hugging Face. LLM tool/code logs — repetitive, long-tailed, structurally different from wiki titles. If our defaults look wrong on this, we know they're tuned to wiki.
+2. **Embed at 384-dim with MiniLM**, normalized — same model as the wiki baseline, so numbers compare directly.
+3. **Sweep, at 50k / 100k / 500k row subsets**:
+   - `clusters ∈ {0, sqrt(N)/2, sqrt(N), 2·sqrt(N), 4·sqrt(N)}` (write-side, expensive)
+   - `rerankFactor ∈ {0, 10, 30, 100, max(10, N/3000), 300}` (cheap; redo per query set)
+   - `probe ∈ {0.05, 0.1, 0.25, 0.5, 1.0}` (cheap; same)
+4. **Report** recall@10, ms/query, fetches, MB read — same table format as the existing README ablation, so they're directly comparable.
+
+If LLM log results agree with wiki, we adopt the `sqrt(N)` / `N/3000` / `probe=0.25` defaults and document. If they disagree, we keep the knobs as "tune for your corpus" and write up the difference.
+
+## Empirical results — LLM logs, 100k × 384-dim MiniLM
+
+From `scripts/sweep-llmlog.js`. Corpus is 100k messages from the tool-reasoning-sft dataset, embedded with `Xenova/all-MiniLM-L6-v2`, normalized. 20 in-corpus queries; reference top-10 from exact full scan.
+
+### Clusters sweep (probe=0.25, rerankFactor=10)
+
+| clusters | size MB | ms | fetches | MB read | recall |
+|---:|---:|---:|---:|---:|---:|
+| 0 (no clustering, binary only) | 160.6 | 31.9 | 104 | 8.37 | 93.0% |
+| 158 (≈ √N/2) | 160.7 | **8.4** | 71 | 3.80 | 94.0% |
+| 316 (≈ √N) | 160.7 | 9.4 | 105 | 3.29 | 94.0% |
+| 632 (≈ 2√N) | 160.8 | 13.1 | 187 | 3.18 | 94.0% |
+| 1264 (≈ 4√N) | 161.1 | 20.0 | 346 | 2.99 | 94.0% |
+
+**Reads**: clustering wins big — 4× speedup over unclustered. The latency optimum is `√N/2`, not `√N`, because with `probe=0.25` more clusters means more row-ranges to fetch. The MB-read optimum keeps dropping with more clusters (tighter ranges), so the right `clusters` value depends on whether you optimize wall-time or bandwidth.
+
+### rerankFactor sweep (clusters=316, probe=0.25)
+
+| rerankFactor | ms | fetches | MB read | recall |
+|---:|---:|---:|---:|---:|
+| **10** | **9.4** | 105 | 3.30 | **94.0%** |
+| 30 | 16.3 | 138 | 5.73 | 94.5% |
+| 33 (N/3000 rule) | 17.7 | 142 | 6.08 | 94.5% |
+| 100 | 39.6 | 188 | 12.83 | 94.5% |
+| 300 | 100.1 | 226 | 26.12 | 94.5% |
+
+**Read**: at 100k the `N/3000` rule from the README is overcautious for this corpus — `rf=10` is already at 94% recall, and bumping to 33 buys 0.5pp at +8ms. The rule was tuned on synthetic 1M data where binary collisions dominate; LLM logs are well-clustered enough that the default 10 holds longer.
+
+### Probe sweep (clusters=316, rerankFactor=10)
+
+| probe | ms | fetches | MB read | recall |
+|---:|---:|---:|---:|---:|
+| 0.05 | 4.4 | 35 | 2.21 | 93.0% |
+| **0.10** | **5.4** | 55 | 2.54 | **94.0%** |
+| 0.25 (current default) | 9.0 | 105 | 3.29 | 94.0% |
+| 0.50 | 15.1 | 185 | 4.53 | 94.0% |
+| 1.00 | 27.2 | 343 | 6.91 | 94.0% |
+
+**Read**: `probe=0.10` matches the recall of `probe=0.25` at ~60% of the latency. The 0.25 default is overcautious — at least for this corpus and `clusters ≈ √N`.
+
+### What this changes in the plan
+
+1. **`probe` default → 0.10** (was 0.25). Same recall, ~40% faster. Worth re-confirming on wiki to make sure we're not regressing there.
+2. **`rerankFactor` default → keep 10**, not `max(10, N/3000)`. At 100k LLM log, 10 is already saturated. The `N/3000` rule should be reframed as "scale up only if you observe recall below your target", not a default.
+3. **`clusters` rule → `√N/2`**, not `√N`. Better latency at the same recall on this corpus. Sanity-check on wiki before locking in.
+4. **All three sweeps recall-cap at 94%.** This is suspiciously flat across configs — likely the corpus has many near-duplicate tool/code messages, so top-10 is "easy". A second pass with stricter recall@1 or recall@100 metrics would be more discriminating, but the relative *ranking* across params should hold.
+
+## Sanity check — wiki, 156k × 384-dim MiniLM
+
+From `scripts/sweep-llmlog.js data/wiki_en.vectors.parquet`. Same sweeps, same code path, 20 in-corpus queries.
+
+### Clusters (probe=0.25, rerankFactor=10)
+
+| clusters | ms | fetches | MB read | recall |
+|---:|---:|---:|---:|---:|
+| 0 (no clustering, binary only) | 42.0 | 87 | 11.6 | 97.0% |
+| 198 (≈ √N/2) | **13.5** | 122 | 5.6 | 93.0% |
+| 395 (≈ √N) | 14.6 | 182 | 5.4 | 93.0% |
+| 790 (≈ 2√N) | 19.2 | 283 | 5.0 | 92.5% |
+| 1580 (≈ 4√N) | 30.9 | 491 | 5.0 | 94.5% |
+
+### rerankFactor (clusters=395, probe=0.25)
+
+| rerankFactor | ms | recall |
+|---:|---:|---:|
+| 10 | 14.6 | 93.0% |
+| 30 | 27.2 | **95.0%** |
+| 52 (N/3000 rule) | 39.1 | 95.5% |
+| 100 | 69.2 | 96.5% |
+| 300 | 173.5 | 96.5% |
+
+### Probe (clusters=395, rerankFactor=10)
+
+| probe | ms | recall |
+|---:|---:|---:|
+| 0.05 | 6.4 | **72.5%** ← regression |
+| 0.10 | 8.3 | **84.0%** ← regression |
+| 0.25 (default) | 14.5 | 93.0% |
+| 0.50 | 23.8 | 96.5% |
+| 1.00 | 41.6 | 97.0% |
+
+### What the sanity check changed
+
+The wiki numbers reverse two of the three LLM-log recommendations:
+
+| Knob | LLM log says | Wiki says | Final |
+|---|---|---|---|
+| `clusters` | √N/2 wins on ms | √N/2 wins on ms (same recall as √N) | **Adopt `√N/2`** |
+| `probe` default | 0.10 enough for 94% | 0.10 = 84% (regression of 9pp) | **Keep 0.25 as default** |
+| `rerankFactor` | 10 is fine | 10→30 gains 2pp recall on wiki | **Keep 10 as default**, document `~max(10, N/3000)` as the recall-pressure knob (the README rule was right) |
+
+The disagreement on `probe` is the most interesting finding: LLM-log retrievals are dominated by near-duplicate tool/code messages, so even probe=0.05 finds 9 of the 10 "right" answers because there are many right answers per query. Wiki has more diverse content, so cluster probing actually matters. **The 0.25 default is correct precisely because it's tuned for the harder distribution.** Don't change it.
+
+### Final defaults (post-sanity-check)
+
+- `clusters` write-time default: `Math.round(Math.sqrt(N) / 2)` (when binary is on). Saves wall-time at the same or near-same recall on both corpora.
+- `binary` write-time default: on when `N ≥ ~10k` (not yet measured at small N — assumption based on existing wiki ablation showing it's a clear win past hundreds of thousands).
+- `probe` search default: stays at `0.25`. The LLM-log data tempted us to drop it; the wiki data showed why we shouldn't.
+- `rerankFactor` search default: stays at `10`. The `N/3000` rule moves into the documentation as "raise this if you observe sub-target recall", not as a default.
+
+### Still-open experiments
+
+- Repeat the clusters sweep at 500k and 1M LLM-log row counts to confirm `√N/2` across sizes.
+- Recall@100 to discriminate the LLM-log 94% ceiling.
+- Find the small-N crossover where the binary column stops being worth ~1.5% bytes.
+
+
+## End state for the public API
+
+After the experiments above, the common case should look like:
+
+```js
+await writeVectors({
+  writer: fileWriter('vectors.parquet'),
+  dimension: 384,
+  vectors: embed(docs),
+}) // normalize=true, binary if N≥~10k, clusters≈sqrt(N), all automatic
+
+const results = await searchVectors({
+  source: 'vectors.parquet',
+  query,
+  topK: 10,
+}) // rerankFactor and probe derived from KV count
+```
+
+The advanced knobs (`rerankFactor`, `probe`, `binary` write-flag, `clusters`) stay available — but they move into an "Advanced" subsection of the README, not the quick start.
@@ -63,15 +63,12 @@ import { writeVectors } from 'hypvector'
 await writeVectors({
   writer: fileWriter('vectors.parquet'),
   dimension: 384,
-  normalize: true,    // L2-normalize on write; lets search skip sqrt for cosine
-  binary: true,       // also write 1-bit-per-dim sign column for binary+rerank search
-  clusters: 128,      // k-means clusters for phase-1 pruning (implies binary: true)
-  pq: true,           // optional IVF-PQ index for approximate scoring before rerank
+  normalize: true,       // L2-normalize on write; lets search skip sqrt for cosine
   vectors: myEmbedder(), // any sync or async iterable of { id, vector }
 })
 ```
 
-When `binary: true`, the default `pageSize` drops to 32 KB so that offset-index reads during search fetch tight ranges. Override with explicit `pageSize` / `codec` / `rowGroupSize` if needed.
+By default, `writeVectors` adds the binary sign-bit column and clusters rows automatically once the corpus crosses ~10k vectors. Below that, files are written as plain id + vector columns and search uses an exact full scan. To control these manually, pass `binary: true/false` and `clusters: <n>`; passing either disables the auto behavior for that knob. When the binary column is written, `pageSize` defaults to 32 KB so offset-index reads during search fetch tight ranges. Pass `pq: true` to additionally write an IVF-PQ index for approximate scoring before rerank (mutually exclusive with binary `clusters`).
 
 ### Producing vectors
 
@@ -106,8 +103,6 @@ await writeVectors({
   writer: fileWriter('vectors.parquet'),
   dimension: 384,
   normalize: true,
-  binary: true,
-  clusters: 128,
   vectors: embed(docs),
 })
 ```
@@ -208,7 +203,6 @@ Key-value metadata:
 | `hypvector.clusters` | number of k-means clusters (0 if not clustered) |
 | `hypvector.centroids` | base64-encoded centroid binary codes (`clusters × dim/8` bytes); present when `clusters > 0` |
 | `hypvector.clusterCounts` | base64-encoded `Uint32Array` of per-cluster row counts; present when `clusters > 0` |
-| `hypvector.pq.mode` | `ivf`; present when `pq: true` |
 | `hypvector.pq.segments` | number of PQ sub-vectors / bytes per code; present when `pq: true` |
 | `hypvector.pq.centroids` | centroids per PQ sub-vector; present when `pq: true` |
 | `hypvector.pq.codebooks` | base64-encoded residual `Float32Array` codebooks (`pq.centroids × dim` floats); present when `pq: true` |