Skip to content

Commit 725d602

Browse files
committed
Default normalize to true
Flip the writeVectors normalize default from false to true. Cosine on normalized vectors reduces to dot product and every benchmark ran normalized with no downside, so the common case no longer needs the flag. Kept as an opt-out: dot/euclidean are magnitude-sensitive and would break if normalization were forced. Tests that asserted byte-exact round-trips now pass normalize: false explicitly; the KV-metadata test expects normalized=true.
1 parent 62a9b2c commit 725d602

6 files changed

Lines changed: 11 additions & 11 deletions

File tree

PLAN_AUTO.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Each parameter below has a current state, a target strategy, and the experiments
1919
|---|---|---|---|
2020
| `dimension` | required | **Required** | Caller's model dictates this. No automation possible. |
2121
| `metric` | `'cosine'` default, in KV | **KV-metadata** (done) | Defaults to `'cosine'`, stored in KV, read transparently at search. |
22-
| `normalize` | `false` arg | **KV-metadata, default `true` (not yet flipped)** | Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside, and the README/quickstart already pass `true`. Open: flip the *code* default so callers can omit it. Harmless if vectors are already unit-length. |
22+
| `normalize` | **default `true` (shipped)** | **KV-metadata, default `true`** | Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside. Code default flipped to `true`; callers can omit it. Harmless if already unit-length. Kept as a flag (not forced) because `dot`/`euclidean` are magnitude-sensitive and would silently break if always normalized. |
2323
| `binary` | **Auto (shipped)** | **Derive(N): on at N ≥ 10k** | Shipped: auto-on at `defaultAutoBinaryThreshold = 10000` (~1.5% extra bytes for ~50× fewer bytes-read in phase 2). Below threshold, exact scan is fine. Small-N crossover still unmeasured (see open experiments). |
2424
| `clusters` | **Auto (shipped)** | **Derive(N): `round(√N/2)`** | Shipped: `round(√N/2)` when binary auto-on (`writeVectors.js`). The sweep below locked in `√N/2` over `√N` (better latency, same recall on both corpora). Caller can still pass an explicit count or `0`. |
2525
| `clusterIterations` | `6` | **Fixed (6)** | The existing ablations show diminishing returns past 6. Hide the knob. |
@@ -183,15 +183,14 @@ Lessons:
183183

184184
## End state for the public API
185185

186-
The common case is now (one open item: flip the `normalize` default to `true`):
186+
The common case is now:
187187

188188
```js
189189
await writeVectors({
190190
writer: fileWriter('vectors.parquet'),
191191
dimension: 384,
192-
normalize: true, // still required explicitly; flipping the default is the last open write-side item
193192
vectors: embed(docs),
194-
}) // binary auto at N≥10k, clusters≈√N/2 (both automatic)
193+
}) // normalize defaults to true; binary auto at N≥10k, clusters≈√N/2 (all automatic)
195194

196195
const results = await searchVectors({
197196
source: 'vectors.parquet',

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,8 @@ import { writeVectors } from 'hypvector'
7070
await writeVectors({
7171
writer: fileWriter('vectors.parquet'),
7272
dimension: 384,
73-
normalize: true, // L2-normalize on write; lets search skip sqrt for cosine
73+
// normalize defaults to true: L2-normalize on write, lets search skip sqrt for cosine.
74+
// Pass normalize: false only if you need raw magnitudes (e.g. dot/euclidean on unnormalized vectors).
7475
vectors: myEmbedder(), // any sync or async iterable of { id, vector }
7576
})
7677
```
@@ -83,7 +84,7 @@ HypVector is BYO-embedding: you decide which model produces the vectors. It just
8384

8485
1. **Same model on write and query.** Embeddings from different models aren't comparable.
8586
2. **Same `dimension`** for every record (must match the `dimension` you pass to `writeVectors`).
86-
3. **`normalize: true`** is the right default for any model whose vectors aren't already unit-length and you intend to query with cosine; it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), still pass `normalize: true` so the flag is recorded in KV metadata.
87+
3. **`normalize` defaults to `true`**, the right choice for any model whose vectors aren't already unit-length and you intend to query with cosine; it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), the default is harmless and records the flag in KV metadata. Pass `normalize: false` only when you want to preserve raw magnitudes for `dot`/`euclidean`.
8788

8889
The natural shape is an async generator that yields embedded records as you batch them through your embedder.
8990

src/types.d.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ export interface WriteVectorsOptions {
1616
dimension: number // length of every vector (must match)
1717
rowGroupSize?: number // rows per row group (default: 10000)
1818
metric?: DistanceMetric // hint stored in kv metadata (default: 'cosine')
19-
normalize?: boolean // l2-normalize vectors on write (default: false)
19+
normalize?: boolean // l2-normalize vectors on write (default: true). Harmless if vectors are already unit-length. Pass `false` to preserve raw magnitudes (e.g. for dot/euclidean on unnormalized vectors).
2020
codec?: CompressionCodec // parquet codec (default: 'UNCOMPRESSED'; SNAPPY rarely shrinks float embeddings and costs ~2-3x query latency. ZSTD on write isn't supported here — hyparquet-compressors only ships decompressors.)
2121
binary?: boolean // also write a 1-bit-per-dim sign-bit column for binary+rerank search (default: auto — on when N ≥ 10000; adds ~1.5% file size at 384-dim). Pass `false` to force-disable.
2222
pageSize?: number // target page size in bytes (default: 1 MB). Smaller pages let `useOffsetIndex` fetch tighter ranges in rerank phase 2 at the cost of more page-header overhead.

src/writeVectors.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ export async function writeVectors({
3838
dimension,
3939
rowGroupSize,
4040
metric = 'cosine',
41-
normalize = false,
41+
normalize = true,
4242
codec = 'UNCOMPRESSED',
4343
binary,
4444
pageSize,

test/readVectors.test.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ describe('readVectors', () => {
1717
const dimension = 16
1818
const original = makeVectors(25, dimension, 42)
1919
const writer = fileWriter(TEST_FILE)
20-
await writeVectors({ writer, vectors: original, dimension })
20+
await writeVectors({ writer, vectors: original, dimension, normalize: false })
2121

2222
const file = await asyncBufferFromFile(TEST_FILE)
2323
const read = []

test/writeVectors.test.js

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ describe('writeVectors', () => {
3636
expect(find('hypvector.version')).toBe('0')
3737
expect(find('hypvector.dimension')).toBe('8')
3838
expect(find('hypvector.metric')).toBe('cosine')
39-
expect(find('hypvector.normalized')).toBe('false')
39+
expect(find('hypvector.normalized')).toBe('true')
4040
})
4141

4242
it('rejects vectors with the wrong dimension', async () => {
@@ -131,7 +131,7 @@ describe('writeVectors', () => {
131131

132132
// binary: false takes the streaming fast path: row-group-sized batches are
133133
// packed and flushed without materializing the whole dataset.
134-
await writeVectors({ writer, dimension, vectors: source, binary: false })
134+
await writeVectors({ writer, dimension, vectors: source, binary: false, normalize: false })
135135

136136
const file = await asyncBufferFromFile(TEST_FILE)
137137
const meta = await parquetMetadataAsync(file)

0 commit comments

Comments
 (0)