Default normalize to true

platypii · platypii · commit 725d60275333 · 2026-06-20T16:32:28.000-07:00
Flip the writeVectors normalize default from false to true. Cosine on
normalized vectors reduces to dot product and every benchmark ran
normalized with no downside, so the common case no longer needs the flag.
Kept as an opt-out: dot/euclidean are magnitude-sensitive and would break
if normalization were forced.

Tests that asserted byte-exact round-trips now pass normalize: false
explicitly; the KV-metadata test expects normalized=true.
diff --git a/PLAN_AUTO.md b/PLAN_AUTO.md
@@ -19,7 +19,7 @@ Each parameter below has a current state, a target strategy, and the experiments
 |---|---|---|---|
 | `dimension` | required | **Required** | Caller's model dictates this. No automation possible. |
 | `metric` | `'cosine'` default, in KV | **KV-metadata** (done) | Defaults to `'cosine'`, stored in KV, read transparently at search. |
-| `normalize` | `false` arg | **KV-metadata, default `true` (not yet flipped)** | Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside, and the README/quickstart already pass `true`. Open: flip the *code* default so callers can omit it. Harmless if vectors are already unit-length. |
+| `normalize` | **default `true` (shipped)** | **KV-metadata, default `true`** | Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside. Code default flipped to `true`; callers can omit it. Harmless if already unit-length. Kept as a flag (not forced) because `dot`/`euclidean` are magnitude-sensitive and would silently break if always normalized. |
 | `binary` | **Auto (shipped)** | **Derive(N): on at N ≥ 10k** | Shipped: auto-on at `defaultAutoBinaryThreshold = 10000` (~1.5% extra bytes for ~50× fewer bytes-read in phase 2). Below threshold, exact scan is fine. Small-N crossover still unmeasured (see open experiments). |
 | `clusters` | **Auto (shipped)** | **Derive(N): `round(√N/2)`** | Shipped: `round(√N/2)` when binary auto-on (`writeVectors.js`). The sweep below locked in `√N/2` over `√N` (better latency, same recall on both corpora). Caller can still pass an explicit count or `0`. |
 | `clusterIterations` | `6` | **Fixed (6)** | The existing ablations show diminishing returns past 6. Hide the knob. |
@@ -183,15 +183,14 @@ Lessons:
 
 ## End state for the public API
 
-The common case is now (one open item: flip the `normalize` default to `true`):
+The common case is now:
 
 ```js
 await writeVectors({
   writer: fileWriter('vectors.parquet'),
   dimension: 384,
-  normalize: true, // still required explicitly; flipping the default is the last open write-side item
   vectors: embed(docs),
-}) // binary auto at N≥10k, clusters≈√N/2 (both automatic)
+}) // normalize defaults to true; binary auto at N≥10k, clusters≈√N/2 (all automatic)
 
 const results = await searchVectors({
   source: 'vectors.parquet',
diff --git a/README.md b/README.md
@@ -70,7 +70,8 @@ import { writeVectors } from 'hypvector'
 await writeVectors({
   writer: fileWriter('vectors.parquet'),
   dimension: 384,
-  normalize: true,       // L2-normalize on write; lets search skip sqrt for cosine
+  // normalize defaults to true: L2-normalize on write, lets search skip sqrt for cosine.
+  // Pass normalize: false only if you need raw magnitudes (e.g. dot/euclidean on unnormalized vectors).
   vectors: myEmbedder(), // any sync or async iterable of { id, vector }
 })
 ```
@@ -83,7 +84,7 @@ HypVector is BYO-embedding: you decide which model produces the vectors. It just
 
 1. **Same model on write and query.** Embeddings from different models aren't comparable.
 2. **Same `dimension`** for every record (must match the `dimension` you pass to `writeVectors`).
-3. **`normalize: true`** is the right default for any model whose vectors aren't already unit-length and you intend to query with cosine; it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), still pass `normalize: true` so the flag is recorded in KV metadata.
+3. **`normalize` defaults to `true`**, the right choice for any model whose vectors aren't already unit-length and you intend to query with cosine; it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), the default is harmless and records the flag in KV metadata. Pass `normalize: false` only when you want to preserve raw magnitudes for `dot`/`euclidean`.
 
 The natural shape is an async generator that yields embedded records as you batch them through your embedder.
 
diff --git a/src/types.d.ts b/src/types.d.ts
@@ -16,7 +16,7 @@ export interface WriteVectorsOptions {
   dimension: number // length of every vector (must match)
   rowGroupSize?: number // rows per row group (default: 10000)
   metric?: DistanceMetric // hint stored in kv metadata (default: 'cosine')
-  normalize?: boolean // l2-normalize vectors on write (default: false)
+  normalize?: boolean // l2-normalize vectors on write (default: true). Harmless if vectors are already unit-length. Pass `false` to preserve raw magnitudes (e.g. for dot/euclidean on unnormalized vectors).
   codec?: CompressionCodec // parquet codec (default: 'UNCOMPRESSED'; SNAPPY rarely shrinks float embeddings and costs ~2-3x query latency. ZSTD on write isn't supported here — hyparquet-compressors only ships decompressors.)
   binary?: boolean // also write a 1-bit-per-dim sign-bit column for binary+rerank search (default: auto — on when N ≥ 10000; adds ~1.5% file size at 384-dim). Pass `false` to force-disable.
   pageSize?: number // target page size in bytes (default: 1 MB). Smaller pages let `useOffsetIndex` fetch tighter ranges in rerank phase 2 at the cost of more page-header overhead.
diff --git a/src/writeVectors.js b/src/writeVectors.js
@@ -38,7 +38,7 @@ export async function writeVectors({
   dimension,
   rowGroupSize,
   metric = 'cosine',
-  normalize = false,
+  normalize = true,
   codec = 'UNCOMPRESSED',
   binary,
   pageSize,
diff --git a/test/readVectors.test.js b/test/readVectors.test.js
@@ -17,7 +17,7 @@ describe('readVectors', () => {
     const dimension = 16
     const original = makeVectors(25, dimension, 42)
     const writer = fileWriter(TEST_FILE)
-    await writeVectors({ writer, vectors: original, dimension })
+    await writeVectors({ writer, vectors: original, dimension, normalize: false })
 
     const file = await asyncBufferFromFile(TEST_FILE)
     const read = []
diff --git a/test/writeVectors.test.js b/test/writeVectors.test.js
@@ -36,7 +36,7 @@ describe('writeVectors', () => {
     expect(find('hypvector.version')).toBe('0')
     expect(find('hypvector.dimension')).toBe('8')
     expect(find('hypvector.metric')).toBe('cosine')
-    expect(find('hypvector.normalized')).toBe('false')
+    expect(find('hypvector.normalized')).toBe('true')
   })
 
   it('rejects vectors with the wrong dimension', async () => {
@@ -131,7 +131,7 @@ describe('writeVectors', () => {
 
     // binary: false takes the streaming fast path: row-group-sized batches are
     // packed and flushed without materializing the whole dataset.
-    await writeVectors({ writer, dimension, vectors: source, binary: false })
+    await writeVectors({ writer, dimension, vectors: source, binary: false, normalize: false })
 
     const file = await asyncBufferFromFile(TEST_FILE)
     const meta = await parquetMetadataAsync(file)