You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Flip the writeVectors normalize default from false to true. Cosine on
normalized vectors reduces to dot product and every benchmark ran
normalized with no downside, so the common case no longer needs the flag.
Kept as an opt-out: dot/euclidean are magnitude-sensitive and would break
if normalization were forced.
Tests that asserted byte-exact round-trips now pass normalize: false
explicitly; the KV-metadata test expects normalized=true.
Copy file name to clipboardExpand all lines: PLAN_AUTO.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@ Each parameter below has a current state, a target strategy, and the experiments
19
19
|---|---|---|---|
20
20
|`dimension`| required |**Required**| Caller's model dictates this. No automation possible. |
21
21
|`metric`|`'cosine'` default, in KV |**KV-metadata** (done) | Defaults to `'cosine'`, stored in KV, read transparently at search. |
22
-
|`normalize`|`false` arg|**KV-metadata, default `true` (not yet flipped)**| Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside, and the README/quickstart already pass `true`. Open: flip the *code* default so callers can omit it. Harmless if vectors are already unit-length. |
22
+
|`normalize`|**default `true` (shipped)**|**KV-metadata, default `true`**| Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside. Code default flipped to `true`; callers can omit it. Harmless if already unit-length. Kept as a flag (not forced) because `dot`/`euclidean` are magnitude-sensitive and would silently break if always normalized. |
23
23
|`binary`|**Auto (shipped)**|**Derive(N): on at N ≥ 10k**| Shipped: auto-on at `defaultAutoBinaryThreshold = 10000` (~1.5% extra bytes for ~50× fewer bytes-read in phase 2). Below threshold, exact scan is fine. Small-N crossover still unmeasured (see open experiments). |
24
24
|`clusters`|**Auto (shipped)**|**Derive(N): `round(√N/2)`**| Shipped: `round(√N/2)` when binary auto-on (`writeVectors.js`). The sweep below locked in `√N/2` over `√N` (better latency, same recall on both corpora). Caller can still pass an explicit count or `0`. |
25
25
|`clusterIterations`|`6`|**Fixed (6)**| The existing ablations show diminishing returns past 6. Hide the knob. |
@@ -183,15 +183,14 @@ Lessons:
183
183
184
184
## End state for the public API
185
185
186
-
The common case is now (one open item: flip the `normalize` default to `true`):
186
+
The common case is now:
187
187
188
188
```js
189
189
awaitwriteVectors({
190
190
writer:fileWriter('vectors.parquet'),
191
191
dimension:384,
192
-
normalize:true, // still required explicitly; flipping the default is the last open write-side item
193
192
vectors:embed(docs),
194
-
}) // binary auto at N≥10k, clusters≈√N/2 (both automatic)
193
+
}) //normalize defaults to true; binary auto at N≥10k, clusters≈√N/2 (all automatic)
Copy file name to clipboardExpand all lines: README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -70,7 +70,8 @@ import { writeVectors } from 'hypvector'
70
70
awaitwriteVectors({
71
71
writer:fileWriter('vectors.parquet'),
72
72
dimension:384,
73
-
normalize:true, // L2-normalize on write; lets search skip sqrt for cosine
73
+
// normalize defaults to true: L2-normalize on write, lets search skip sqrt for cosine.
74
+
// Pass normalize: false only if you need raw magnitudes (e.g. dot/euclidean on unnormalized vectors).
74
75
vectors:myEmbedder(), // any sync or async iterable of { id, vector }
75
76
})
76
77
```
@@ -83,7 +84,7 @@ HypVector is BYO-embedding: you decide which model produces the vectors. It just
83
84
84
85
1.**Same model on write and query.** Embeddings from different models aren't comparable.
85
86
2.**Same `dimension`** for every record (must match the `dimension` you pass to `writeVectors`).
86
-
3.**`normalize: true`** is the right default for any model whose vectors aren't already unit-length and you intend to query with cosine; it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), still pass `normalize: true` so the flag is recorded in KV metadata.
87
+
3.**`normalize` defaults to `true`**, the right choice for any model whose vectors aren't already unit-length and you intend to query with cosine; it saves the per-candidate sqrt at query time. If your model already normalizes (most modern sentence-transformer models do), the default is harmless and records the flag in KV metadata. Pass `normalize: false` only when you want to preserve raw magnitudes for `dot`/`euclidean`.
87
88
88
89
The natural shape is an async generator that yields embedded records as you batch them through your embedder.
dimension: number// length of every vector (must match)
17
17
rowGroupSize?: number// rows per row group (default: 10000)
18
18
metric?: DistanceMetric// hint stored in kv metadata (default: 'cosine')
19
-
normalize?: boolean// l2-normalize vectors on write (default: false)
19
+
normalize?: boolean// l2-normalize vectors on write (default: true). Harmless if vectors are already unit-length. Pass `false` to preserve raw magnitudes (e.g. for dot/euclidean on unnormalized vectors).
20
20
codec?: CompressionCodec// parquet codec (default: 'UNCOMPRESSED'; SNAPPY rarely shrinks float embeddings and costs ~2-3x query latency. ZSTD on write isn't supported here — hyparquet-compressors only ships decompressors.)
21
21
binary?: boolean// also write a 1-bit-per-dim sign-bit column for binary+rerank search (default: auto — on when N ≥ 10000; adds ~1.5% file size at 384-dim). Pass `false` to force-disable.
22
22
pageSize?: number// target page size in bytes (default: 1 MB). Smaller pages let `useOffsetIndex` fetch tighter ranges in rerank phase 2 at the cost of more page-header overhead.
0 commit comments