Skip to content

Commit 3d516c0

Browse files
committed
Optimization plan
1 parent 70aad8c commit 3d516c0

3 files changed

Lines changed: 334 additions & 0 deletions

File tree

OPTIMIZE.md

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# OPTIMIZE.md — reducing roundtrips and bytes for S3-backed search
2+
3+
## Goal
4+
5+
HypVector is meant to run fully serverless: the index is one Parquet file on
6+
S3 (or any HTTP range source), and *all* compute happens in the client over the
7+
network. The cost function we are optimizing is therefore:
8+
9+
```
10+
query cost ≈ (number of dependent network roundtrips) × (cold latency ~100–250 ms each)
11+
+ (bytes transferred) / (bandwidth)
12+
```
13+
14+
Three levers, in priority order of impact for *cold* object-storage reads:
15+
16+
1. **Roundtrips** — each dependent fetch is ~100–250 ms cold. Fewer, larger,
17+
parallel range GETs beat many small serial ones.
18+
2. **Bytes on wire** — dominated by the float32 `vector` column. Quantization is
19+
the only big lever; Parquet codecs barely move embeddings.
20+
3. **Query latency** — keep or improve client-side scan/rerank speed.
21+
22+
This file is a backlog of investigations. Each item states what it is, the
23+
concrete expected win, the implementation cost, and a way to validate it.
24+
PLAN_AUTO.md covers the *already-shipped* auto-tuning decisions; this file is
25+
about the next frontier.
26+
27+
---
28+
29+
## What we already do (baseline — don't re-investigate)
30+
31+
Several things the literature recommends are already in the code. Stating them
32+
so we don't waste an experiment re-discovering them:
33+
34+
- **IVF-style binary k-means clustering**, `round(√N/2)` clusters by default,
35+
centroids + per-cluster counts in Parquet KV metadata (`src/cluster.js`,
36+
`src/writeVectors.js`).
37+
- **Rows sorted by cluster**, and **each cluster written as its own row group**
38+
(`rowGroupSize` = array of per-cluster counts). A probed list is already a
39+
contiguous row range.
40+
- **Clusters renumbered by a greedy Hamming walk** (`reorderClustersByHamming`)
41+
so the nearest clusters to any query tend to land in adjacent id ranges,
42+
which `mergeRanges` then coalesces into fewer reads.
43+
- **Two-phase search**: phase-1 Hamming scan over the 1-bit `vector_bin`
44+
column, phase-2 float32 rerank over `rerankFactor × topK` candidates
45+
(`src/search/rerank.js`).
46+
- **`useOffsetIndex: true` in phase 2 and the id fetch**, with run coalescing
47+
(64-row gap tolerance) so scattered candidates become a few range GETs.
48+
- **Uncompressed PLAIN** float32 (correct default — see Experiment B).
49+
50+
So the IVF instinct, the contiguous-list layout, and offset-index page seeking
51+
in the rerank phase are done. The open work is below.
52+
53+
### Already tried and removed — do not rebuild as-is
54+
55+
Two quantization schemes were built, benchmarked, and **deleted** as net
56+
negatives. The shared lesson governs everything in Tier 2:
57+
58+
- **int8 cascade tier** (commit `e3e37f8`): an int8 column between phase-1
59+
binary and phase-2 float32. Saved only ~0.3 MB of phase-2 reads but added
60+
~38 MB of file size and ~22 extra fetches per query. Net negative.
61+
- **IVF-PQ** (commit `92e09bc`, documented in PLAN_AUTO.md): lost on every axis
62+
except raw phase-1 bytes; at 3072-dim it read fewer phase-1 bytes but at 66%
63+
recall and 2–6× wall-time.
64+
65+
**The lesson:** any quantizer that *adds a tier while keeping the full float32
66+
`vector` column* optimizes the cheap part. Phase-2 float fetches dominate
67+
bytes-read regardless, so shrinking phase-1 codes saves nothing meaningful, and
68+
a new column only adds size and fetches. **The only quantization that can win is
69+
a float-free lossy mode** — codes only, approximate final scores, no float32
70+
column at all — for a multiplicatively smaller file. That reframes Tier 2 below:
71+
the bar is "replace float32," never "add a tier beside it."
72+
73+
---
74+
75+
## Tier 1 — highest leverage
76+
77+
### Experiment A: RaBitQ in place of raw sign bits (bytes-neutral recall win)
78+
79+
**What.** Our `vector_bin` column is the raw sign bit per dimension. RaBitQ
80+
(Gao & Long, SIGMOD 2024, arxiv 2405.12497) keeps the *same 1 bit/dim, same
81+
32× size* but first applies a random orthogonal rotation (Johnson–Lindenstrauss)
82+
and uses an *unbiased* distance estimator with a provable `O(1/√D)` error bound.
83+
It is the same byte cost as what we ship, with a strictly better phase-1
84+
estimator that doesn't collapse on hard distributions the way PQ can.
85+
86+
**Win.** Higher phase-1 recall at fixed candidate budget → we can lower
87+
`rerankFactor` (fewer phase-2 bytes) at equal end recall, or raise recall at
88+
fixed `rerankFactor`. Pure upside at the same on-wire size for `vector_bin`.
89+
90+
**Cost.** Medium. Need: a fixed random rotation (seedable, stored in KV
91+
metadata so the reader reproduces it), encode = rotate then sign, and a phase-1
92+
scorer that uses the RaBitQ estimator instead of raw Hamming. The rotation is
93+
the only new moving part; everything else is our existing pipeline. Reference:
94+
github.com/VectorDB-NTU/RaBitQ-Library.
95+
96+
**Validate.** Reuse `scripts/validate-params.js` recall harness: compare
97+
recall@10/@100 of raw-sign vs RaBitQ at identical `rerankFactor` and probe, on
98+
wiki (384-dim) and a 1024-dim corpus. Win = higher recall, or equal recall at
99+
lower `rerankFactor`.
100+
101+
### Experiment C: phase-1 offset-index page skipping — RESOLVED (no change)
102+
103+
**Outcome (2026-06-21): already handled by design; not an opportunity.**
104+
105+
The premise was wrong. Phase 1 deliberately reads whole binary column chunks,
106+
and `rerank.js:51-57` documents why: the binary column is `dim/8` bytes/row, so
107+
per-page `useOffsetIndex` seeking costs an extra roundtrip to read the offset
108+
index without saving meaningful bytes. The 32 KB binary page size exists for
109+
*phase 2* candidate seeking, not phase 1. Moreover there is a `prefetchBinary`
110+
path (`src/prefetch.js`) that loads the entire small binary column into RAM
111+
once, making phase 1 *zero-network* — strictly better than page-seeking it.
112+
Nothing to do here.
113+
114+
### Experiment D: nprobe — RESOLVED (cap the fraction at scale)
115+
116+
**Outcome (2026-06-21): keep the fraction, but add an absolute cap.**
117+
118+
Measured probe sweeps (`scripts/validate-params.js probe`) on wiki (384-dim,
119+
N=20k–156k) and tpuf (1024-dim, N=250k/1M), clusters at the shipped √N/2:
120+
121+
- **The "switch to absolute probe" idea is refuted.** A fixed absolute count
122+
lets recall *slide* as N grows, because clusters grow as √N/2 so a constant
123+
count is a shrinking fraction. probe=16 → 91% @20k, 79% @80k, 81% @156k.
124+
The 0.25 *fraction* holds recall steady (91→90→93%) across 8× scale — it is
125+
the correct parameterization here, not absolute count. (This is why the
126+
literature's "~16–32 probes" rule doesn't transfer: it assumes
127+
nlist≈C·√N with large C; we use √N/2, far fewer/bigger lists.)
128+
129+
- **But the fraction over-probes at large N.** At 1M (500 clusters), probed
130+
list count vs cost/recall:
131+
132+
| lists | fetches | MB read | recall@10 |
133+
|------:|--------:|--------:|----------:|
134+
| 48 | 118 | 17.8 | 89.0% |
135+
| 64 | 137 | 21.8 | 91.0% |
136+
| 80 | 155 | 25.7 | 92.0% |
137+
| 96 | 172 | 29.9 | 92.5% |
138+
| **125 (=0.25 frac)** | **202** | **37.0** | **93.0%** |
139+
140+
Recall knees at ~80 lists (92%). The fraction's last 1pp (92→93%) costs +47
141+
fetches and +11 MB — ~30% more roundtrips and bytes for marginal recall.
142+
143+
**Recommended change:** `probe = min(ceil(fraction × nlist), cap)` with
144+
`cap ≈ 80–96`. The cap only binds above ~400k vectors (where 0.25·√N/2 > 80),
145+
so all current small/medium-N behavior is unchanged; at 1M it trims ~25% of
146+
roundtrips and ~30% of bytes for ~1pp recall. Backward-compatible, low risk.
147+
Open question: exact cap value (80 vs 96) and whether it's user-overridable.
148+
149+
---
150+
151+
## Tier 2 — meaningful, more work
152+
153+
### Experiment E: float-free lossy mode (the only quantization that can win)
154+
155+
**What.** A search mode with **no float32 column at all** — final scores come
156+
from a multi-bit code. Candidate codec: Extended RaBitQ (SIGMOD 2025, arxiv
157+
2409.09913), B bits/dim, reported **B=5 → >95% recall at 6.4×, B=7 → >99% at
158+
4.5×**, beating scalar quantization at equal bits and good enough that there is
159+
nothing to rerank against. This is the *float-free lossy* feature PLAN_AUTO
160+
named as "the only way quantization pays off," now with a codec that might
161+
actually hit the recall bar.
162+
163+
**Win.** Multiplicatively smaller file — the float32 `vector` column is ~3/4 of
164+
the bytes and the bulk of phase-2 reads. Removing it (not shrinking it, not
165+
adding a tier beside it) is the single biggest bytes-on-wire reduction
166+
available. This is a *different feature* from today's exact-rerank index, with
167+
its own recall/size contract, not a drop-in tier.
168+
169+
**Cost.** High. New multi-bit codec, new scorer, a new file mode, and a clear
170+
API story that this trades exactness for ~5–6× smaller files. Reuses the RaBitQ
171+
rotation from Experiment A. Gate strictly behind its own benchmark.
172+
173+
**Validate.** This is the make-or-break number for the whole quantization line:
174+
does a *float-free* index hold ≥95% recall@10 on real corpora (384- and
175+
1024-dim, ≥500k)? Compare file size, MB read, and recall against today's
176+
binary+float32. If float-free can't clear the recall bar, quantization stays
177+
shelved — adding a tier beside float32 is already proven net-negative (see
178+
"Already tried and removed").
179+
180+
> **Rejected: int8 / any tier beside float32.** An int8 cascade tier was built
181+
> and removed (`e3e37f8`) for exactly the "optimizes the cheap part" reason.
182+
> Do not re-propose int8, PQ, or RaBitQ *as an added column* — only as a
183+
> float32 *replacement* per Experiment E.
184+
185+
### Experiment G: two-level centroid index for large nlist
186+
187+
**What.** Centroids live in KV metadata and are scanned linearly to rank
188+
clusters (`ranges.js`). Fine for √N/2 clusters at small N; at 1M+ vectors
189+
(~700+ clusters, growing) that linear scan and the metadata size both grow.
190+
SPANN's answer: a small index *over the centroids* so finding the K nearest
191+
clusters is sub-ms, plus optional boundary-vector replication into a few nearby
192+
lists to lift recall without raising nprobe.
193+
194+
**Win.** Keeps cluster selection cheap as nlist grows, and bounds KV-metadata
195+
size. Mostly a scale concern (>1M).
196+
197+
**Cost.** Medium–high. New in-file structure for centroids; replication
198+
inflates data ~20%. Only worth it once nlist is large enough that linear
199+
centroid scan or metadata bloat actually shows up.
200+
201+
**Validate.** Measure centroid-scan time and KV-metadata bytes vs N; only pursue
202+
if either becomes material at target corpus sizes.
203+
204+
---
205+
206+
## Tier 3 — measure first, likely small or negative
207+
208+
### Experiment B: float column encoding & compression (probably a no-op)
209+
210+
**What.** We ship PLAIN + UNCOMPRESSED float32. Candidates: BYTE_STREAM_SPLIT
211+
encoding, and zstd/snappy compression.
212+
213+
**Expected.** **Small or zero.** Unit-norm float32 embeddings are near-
214+
incompressible: the mantissa is ~7.3 bits/byte, so lossless ratios sit around
215+
1.08–1.20×, and snappy/zstd cost decode latency for ~5–10%. BYTE_STREAM_SPLIT
216+
averages ~30% on *scientific* floats but is unproven on embeddings and has gone
217+
*negative* on some data. This is why UNCOMPRESSED is the current default and the
218+
right one.
219+
220+
**Cost.** Low to test (writer flags), but **gate any change behind a write-time
221+
sample A/B** — never always-on. Most likely outcome: confirm the default and
222+
move on.
223+
224+
**Validate.** On real corpora, write the float column under PLAIN, PLAIN+zstd,
225+
and BYTE_STREAM_SPLIT; compare file size and phase-2 decode time. Adopt only if
226+
a corpus shows a clear net win.
227+
228+
### Experiment H: cold-open roundtrip floor
229+
230+
**What.** Confirm the very first fetch sequence is minimal: over-read the file
231+
tail (~64 KB) in one GET to grab the footer + KV metadata in a single roundtrip,
232+
then the page index, then coalesced data ranges. Target ~3 roundtrips for a
233+
selective cold query.
234+
235+
**Win.** Shaves fixed startup latency off every cold query. Small but every
236+
query pays it.
237+
238+
**Cost.** Low. Mostly verifying what hyparquet already does on a real S3/HTTP
239+
source and adding a tail over-read hint if it issues a separate tiny footer GET.
240+
241+
**Validate.** Count `fetches` on a cold `asyncBuffer` for a single query; aim to
242+
drive the fixed overhead to ~2 (footer+index) before data reads begin.
243+
244+
### Explicitly out of scope / rejected
245+
246+
- **Bloom filters** — answer equality/membership only, never nearest-neighbor.
247+
Irrelevant to the vector path.
248+
- **Dictionary encoding** — embeddings are continuous/high-cardinality; Parquet
249+
falls back to PLAIN anyway, and PLAIN is what we want for SIMD scan.
250+
- **Graph indexes (HNSW / DiskANN / Vamana)** — dozens of *serial dependent*
251+
hops per query, each a cold fetch. Catastrophic on object storage. IVF is the
252+
correct family for S3 and we already use it. Do not pursue graph indexes for
253+
the cold tier.
254+
255+
---
256+
257+
## Suggested sequencing
258+
259+
1. ~~**D (nprobe)** and **C (phase-1 offset index)**~~ — DONE (2026-06-21).
260+
C was already handled by design (no change). D → keep the fraction but add an
261+
absolute cap (~80–96) that bounds over-probing above ~400k vectors. The cap
262+
is the one remaining code change from this tier; everything else here was a
263+
confirm-the-default.
264+
2. **A (RaBitQ pre-rank)** — bytes-neutral recall win; also builds the rotation
265+
machinery E depends on.
266+
3. **B (encoding A/B)** — quick confirm-or-reject, probably confirms the default.
267+
4. **E (float-free lossy mode)** — the only quantization that can win, since a
268+
tier beside float32 is already proven net-negative. One make-or-break recall
269+
benchmark decides whether the whole quantization line is alive.
270+
5. **G (two-level centroids)** and **H (cold-open floor)** — scale and
271+
fixed-overhead polish, pursue when measurements say they matter.
272+
273+
All experiments validate through `scripts/validate-params.js` (extend its
274+
subcommands) using `recall@10` / `recall@100`, average `fetches`, and average
275+
`MB read` on real corpora (wiki 384-dim, a 1024-dim set, and a ≥500k corpus for
276+
scale). A change ships only when it improves bytes or roundtrips at equal-or-
277+
better recall.

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,21 @@
2323

2424
At 156k 384-dim wiki embeddings (249 MB), a single top-10 query reads **~6 MB across ~160 ranged HTTP fetches** with ~91% recall against an exact full scan. Over a localhost HTTP server with 20 ms of injected per-request latency, the rerank path lands at **~140 ms/query** vs ~360 ms for an exact full scan.
2525

26+
## Benchmarks
27+
28+
Vector search over 3,199,860 OpenAI embeddings (1024-dim) of real LLM conversations ([WildChat-4.8M](https://huggingface.co/datasets/allenai/WildChat-4.8M)), top-10 recall against exact truth. Every competitor was queried over the network, the way it is actually deployed. hypvector keeps the vectors in object storage and runs the query in the client, so there is no server and no idle cost.
29+
30+
| Engine | Storage | Recall@10 | Warm query (p50) | All-in / mo | Server |
31+
|---|---:|---:|---:|---:|---|
32+
| **hypvector** | 13.7 GB | 0.925 | 147 ms | **~$0.32** | none |
33+
| Pinecone | 13.1 GB | 0.920 | 85 ms | $50 min | managed |
34+
| turbopuffer | 13.1 GB | 0.915 | 198 ms | $64 min | managed |
35+
| S3 Vectors | 13.1 GB | 0.905 | 133 ms | ~$0.79 | serverless |
36+
| pgvector | 41.9 GB | 0.870 | 80 ms | $372 | r5.2xlarge 24/7 |
37+
| Qdrant | 13.1 GB | 0.865 | 70 ms | $186 | r5.xlarge 24/7 |
38+
39+
The managed and always-on engines keep the index hot to answer fast, which is what the monthly bill pays for. hypvector trades a little latency for zero idle cost and no infrastructure.
40+
2641
## Quick Start
2742

2843
### Browser Example

scripts/validate-params.js

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,49 @@ async function runScale() {
237237
}
238238
}
239239

240+
// --- probe sweep: absolute count vs fraction across N ---------------------
241+
// Question (OPTIMIZE.md Experiment D): the default probe is a FRACTION (0.25).
242+
// IVF theory says recall@10 depends on the ABSOLUTE number of probed lists
243+
// (~16-32 for 95-99%), roughly independent of N. If true, a fixed fraction
244+
// over-probes as N grows — spending roundtrips/bytes for recall already had.
245+
// This sweeps absolute probe counts and the 0.25 fraction at several N, on a
246+
// file clustered at the shipped default (round(sqrt(N)/2)).
247+
async function runProbe() {
248+
const Ns = (ARG ?? '').split(',').filter(Boolean).map(Number)
249+
if (!Ns.length) { console.error('probe needs comma-separated N list'); process.exit(1) }
250+
const maxN = Math.max(...Ns)
251+
console.log(`probe: loading up to ${maxN.toLocaleString()} from ${SRC}...`)
252+
const { records: all, meta } = await loadRecords(SRC, maxN)
253+
console.log(` loaded ${all.length.toLocaleString()} × ${meta.dimension}-dim`)
254+
const absProbes = [2, 4, 8, 16, 24, 32, 48]
255+
for (const N of Ns) {
256+
const records = all.slice(0, N)
257+
const clusters = Math.round(Math.sqrt(N) / 2)
258+
const queries = pickQueries(records, QUERY_COUNT)
259+
const base = `${SRC.replace(/\.parquet$/, '').split('/').pop()}_pb${N}`
260+
console.log(`\n=== N=${N.toLocaleString()} clusters=${clusters} (√N/2) ===`)
261+
// Reference: exact top-10 on a binary, no-cluster file.
262+
const refPath = await writeVariant(`${base}_c0`, records, meta, 0, true)
263+
const path = await writeVariant(`${base}_c${clusters}`, records, meta, clusters, true)
264+
const ref = await bench(refPath, queries, 10, { rerankFactor: 0 })
265+
console.log(`${'probe'.padStart(12)} ${'lists'.padStart(6)} ${'ms'.padStart(7)} ${'fetches'.padStart(8)} ${'MB read'.padStart(9)} ${'recall'.padStart(8)}`)
266+
console.log('-'.repeat(56))
267+
// Each absolute count, then the shipped 0.25 fraction for comparison.
268+
const variants = [
269+
...absProbes.filter(p => p <= clusters).map(p => ({ probe: p, label: String(p) })),
270+
{ probe: 0.25, label: '0.25 frac' },
271+
]
272+
for (const v of variants) {
273+
const lists = v.probe > 1 ? Math.min(v.probe, clusters) : Math.max(1, Math.ceil(clusters * v.probe))
274+
const r = await bench(path, queries, 10, { probe: v.probe })
275+
const rec = recallAt(ref.tops, r.tops, 10)
276+
console.log(`${v.label.padStart(12)} ${String(lists).padStart(6)} ${r.ms.toFixed(1).padStart(7)} ${r.fetches.toFixed(0).padStart(8)} ${r.mb.toFixed(2).padStart(9)} ${(rec * 100).toFixed(1).padStart(7)}%`)
277+
}
278+
}
279+
}
280+
240281
if (MODE === 'recall') await runRecall()
241282
else if (MODE === 'smalln') await runSmallN()
242283
else if (MODE === 'scale') await runScale()
284+
else if (MODE === 'probe') await runProbe()
243285
else { console.error(`unknown mode: ${MODE}`); process.exit(1) }

0 commit comments

Comments
 (0)