|
| 1 | +# OPTIMIZE.md — reducing roundtrips and bytes for S3-backed search |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +HypVector is meant to run fully serverless: the index is one Parquet file on |
| 6 | +S3 (or any HTTP range source), and *all* compute happens in the client over the |
| 7 | +network. The cost function we are optimizing is therefore: |
| 8 | + |
| 9 | +``` |
| 10 | +query cost ≈ (number of dependent network roundtrips) × (cold latency ~100–250 ms each) |
| 11 | + + (bytes transferred) / (bandwidth) |
| 12 | +``` |
| 13 | + |
| 14 | +Three levers, in priority order of impact for *cold* object-storage reads: |
| 15 | + |
| 16 | +1. **Roundtrips** — each dependent fetch is ~100–250 ms cold. Fewer, larger, |
| 17 | + parallel range GETs beat many small serial ones. |
| 18 | +2. **Bytes on wire** — dominated by the float32 `vector` column. Quantization is |
| 19 | + the only big lever; Parquet codecs barely move embeddings. |
| 20 | +3. **Query latency** — keep or improve client-side scan/rerank speed. |
| 21 | + |
| 22 | +This file is a backlog of investigations. Each item states what it is, the |
| 23 | +concrete expected win, the implementation cost, and a way to validate it. |
| 24 | +PLAN_AUTO.md covers the *already-shipped* auto-tuning decisions; this file is |
| 25 | +about the next frontier. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## What we already do (baseline — don't re-investigate) |
| 30 | + |
| 31 | +Several things the literature recommends are already in the code. Stating them |
| 32 | +so we don't waste an experiment re-discovering them: |
| 33 | + |
| 34 | +- **IVF-style binary k-means clustering**, `round(√N/2)` clusters by default, |
| 35 | + centroids + per-cluster counts in Parquet KV metadata (`src/cluster.js`, |
| 36 | + `src/writeVectors.js`). |
| 37 | +- **Rows sorted by cluster**, and **each cluster written as its own row group** |
| 38 | + (`rowGroupSize` = array of per-cluster counts). A probed list is already a |
| 39 | + contiguous row range. |
| 40 | +- **Clusters renumbered by a greedy Hamming walk** (`reorderClustersByHamming`) |
| 41 | + so the nearest clusters to any query tend to land in adjacent id ranges, |
| 42 | + which `mergeRanges` then coalesces into fewer reads. |
| 43 | +- **Two-phase search**: phase-1 Hamming scan over the 1-bit `vector_bin` |
| 44 | + column, phase-2 float32 rerank over `rerankFactor × topK` candidates |
| 45 | + (`src/search/rerank.js`). |
| 46 | +- **`useOffsetIndex: true` in phase 2 and the id fetch**, with run coalescing |
| 47 | + (64-row gap tolerance) so scattered candidates become a few range GETs. |
| 48 | +- **Uncompressed PLAIN** float32 (correct default — see Experiment B). |
| 49 | + |
| 50 | +So the IVF instinct, the contiguous-list layout, and offset-index page seeking |
| 51 | +in the rerank phase are done. The open work is below. |
| 52 | + |
| 53 | +### Already tried and removed — do not rebuild as-is |
| 54 | + |
| 55 | +Two quantization schemes were built, benchmarked, and **deleted** as net |
| 56 | +negatives. The shared lesson governs everything in Tier 2: |
| 57 | + |
| 58 | +- **int8 cascade tier** (commit `e3e37f8`): an int8 column between phase-1 |
| 59 | + binary and phase-2 float32. Saved only ~0.3 MB of phase-2 reads but added |
| 60 | + ~38 MB of file size and ~22 extra fetches per query. Net negative. |
| 61 | +- **IVF-PQ** (commit `92e09bc`, documented in PLAN_AUTO.md): lost on every axis |
| 62 | + except raw phase-1 bytes; at 3072-dim it read fewer phase-1 bytes but at 66% |
| 63 | + recall and 2–6× wall-time. |
| 64 | + |
| 65 | +**The lesson:** any quantizer that *adds a tier while keeping the full float32 |
| 66 | +`vector` column* optimizes the cheap part. Phase-2 float fetches dominate |
| 67 | +bytes-read regardless, so shrinking phase-1 codes saves nothing meaningful, and |
| 68 | +a new column only adds size and fetches. **The only quantization that can win is |
| 69 | +a float-free lossy mode** — codes only, approximate final scores, no float32 |
| 70 | +column at all — for a multiplicatively smaller file. That reframes Tier 2 below: |
| 71 | +the bar is "replace float32," never "add a tier beside it." |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +## Tier 1 — highest leverage |
| 76 | + |
| 77 | +### Experiment A: RaBitQ in place of raw sign bits (bytes-neutral recall win) |
| 78 | + |
| 79 | +**What.** Our `vector_bin` column is the raw sign bit per dimension. RaBitQ |
| 80 | +(Gao & Long, SIGMOD 2024, arxiv 2405.12497) keeps the *same 1 bit/dim, same |
| 81 | +32× size* but first applies a random orthogonal rotation (Johnson–Lindenstrauss) |
| 82 | +and uses an *unbiased* distance estimator with a provable `O(1/√D)` error bound. |
| 83 | +It is the same byte cost as what we ship, with a strictly better phase-1 |
| 84 | +estimator that doesn't collapse on hard distributions the way PQ can. |
| 85 | + |
| 86 | +**Win.** Higher phase-1 recall at fixed candidate budget → we can lower |
| 87 | +`rerankFactor` (fewer phase-2 bytes) at equal end recall, or raise recall at |
| 88 | +fixed `rerankFactor`. Pure upside at the same on-wire size for `vector_bin`. |
| 89 | + |
| 90 | +**Cost.** Medium. Need: a fixed random rotation (seedable, stored in KV |
| 91 | +metadata so the reader reproduces it), encode = rotate then sign, and a phase-1 |
| 92 | +scorer that uses the RaBitQ estimator instead of raw Hamming. The rotation is |
| 93 | +the only new moving part; everything else is our existing pipeline. Reference: |
| 94 | +github.com/VectorDB-NTU/RaBitQ-Library. |
| 95 | + |
| 96 | +**Validate.** Reuse `scripts/validate-params.js` recall harness: compare |
| 97 | +recall@10/@100 of raw-sign vs RaBitQ at identical `rerankFactor` and probe, on |
| 98 | +wiki (384-dim) and a 1024-dim corpus. Win = higher recall, or equal recall at |
| 99 | +lower `rerankFactor`. |
| 100 | + |
| 101 | +### Experiment C: phase-1 offset-index page skipping — RESOLVED (no change) |
| 102 | + |
| 103 | +**Outcome (2026-06-21): already handled by design; not an opportunity.** |
| 104 | + |
| 105 | +The premise was wrong. Phase 1 deliberately reads whole binary column chunks, |
| 106 | +and `rerank.js:51-57` documents why: the binary column is `dim/8` bytes/row, so |
| 107 | +per-page `useOffsetIndex` seeking costs an extra roundtrip to read the offset |
| 108 | +index without saving meaningful bytes. The 32 KB binary page size exists for |
| 109 | +*phase 2* candidate seeking, not phase 1. Moreover there is a `prefetchBinary` |
| 110 | +path (`src/prefetch.js`) that loads the entire small binary column into RAM |
| 111 | +once, making phase 1 *zero-network* — strictly better than page-seeking it. |
| 112 | +Nothing to do here. |
| 113 | + |
| 114 | +### Experiment D: nprobe — RESOLVED (cap the fraction at scale) |
| 115 | + |
| 116 | +**Outcome (2026-06-21): keep the fraction, but add an absolute cap.** |
| 117 | + |
| 118 | +Measured probe sweeps (`scripts/validate-params.js probe`) on wiki (384-dim, |
| 119 | +N=20k–156k) and tpuf (1024-dim, N=250k/1M), clusters at the shipped √N/2: |
| 120 | + |
| 121 | +- **The "switch to absolute probe" idea is refuted.** A fixed absolute count |
| 122 | + lets recall *slide* as N grows, because clusters grow as √N/2 so a constant |
| 123 | + count is a shrinking fraction. probe=16 → 91% @20k, 79% @80k, 81% @156k. |
| 124 | + The 0.25 *fraction* holds recall steady (91→90→93%) across 8× scale — it is |
| 125 | + the correct parameterization here, not absolute count. (This is why the |
| 126 | + literature's "~16–32 probes" rule doesn't transfer: it assumes |
| 127 | + nlist≈C·√N with large C; we use √N/2, far fewer/bigger lists.) |
| 128 | + |
| 129 | +- **But the fraction over-probes at large N.** At 1M (500 clusters), probed |
| 130 | + list count vs cost/recall: |
| 131 | + |
| 132 | + | lists | fetches | MB read | recall@10 | |
| 133 | + |------:|--------:|--------:|----------:| |
| 134 | + | 48 | 118 | 17.8 | 89.0% | |
| 135 | + | 64 | 137 | 21.8 | 91.0% | |
| 136 | + | 80 | 155 | 25.7 | 92.0% | |
| 137 | + | 96 | 172 | 29.9 | 92.5% | |
| 138 | + | **125 (=0.25 frac)** | **202** | **37.0** | **93.0%** | |
| 139 | + |
| 140 | + Recall knees at ~80 lists (92%). The fraction's last 1pp (92→93%) costs +47 |
| 141 | + fetches and +11 MB — ~30% more roundtrips and bytes for marginal recall. |
| 142 | + |
| 143 | +**Recommended change:** `probe = min(ceil(fraction × nlist), cap)` with |
| 144 | +`cap ≈ 80–96`. The cap only binds above ~400k vectors (where 0.25·√N/2 > 80), |
| 145 | +so all current small/medium-N behavior is unchanged; at 1M it trims ~25% of |
| 146 | +roundtrips and ~30% of bytes for ~1pp recall. Backward-compatible, low risk. |
| 147 | +Open question: exact cap value (80 vs 96) and whether it's user-overridable. |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## Tier 2 — meaningful, more work |
| 152 | + |
| 153 | +### Experiment E: float-free lossy mode (the only quantization that can win) |
| 154 | + |
| 155 | +**What.** A search mode with **no float32 column at all** — final scores come |
| 156 | +from a multi-bit code. Candidate codec: Extended RaBitQ (SIGMOD 2025, arxiv |
| 157 | +2409.09913), B bits/dim, reported **B=5 → >95% recall at 6.4×, B=7 → >99% at |
| 158 | +4.5×**, beating scalar quantization at equal bits and good enough that there is |
| 159 | +nothing to rerank against. This is the *float-free lossy* feature PLAN_AUTO |
| 160 | +named as "the only way quantization pays off," now with a codec that might |
| 161 | +actually hit the recall bar. |
| 162 | + |
| 163 | +**Win.** Multiplicatively smaller file — the float32 `vector` column is ~3/4 of |
| 164 | +the bytes and the bulk of phase-2 reads. Removing it (not shrinking it, not |
| 165 | +adding a tier beside it) is the single biggest bytes-on-wire reduction |
| 166 | +available. This is a *different feature* from today's exact-rerank index, with |
| 167 | +its own recall/size contract, not a drop-in tier. |
| 168 | + |
| 169 | +**Cost.** High. New multi-bit codec, new scorer, a new file mode, and a clear |
| 170 | +API story that this trades exactness for ~5–6× smaller files. Reuses the RaBitQ |
| 171 | +rotation from Experiment A. Gate strictly behind its own benchmark. |
| 172 | + |
| 173 | +**Validate.** This is the make-or-break number for the whole quantization line: |
| 174 | +does a *float-free* index hold ≥95% recall@10 on real corpora (384- and |
| 175 | +1024-dim, ≥500k)? Compare file size, MB read, and recall against today's |
| 176 | +binary+float32. If float-free can't clear the recall bar, quantization stays |
| 177 | +shelved — adding a tier beside float32 is already proven net-negative (see |
| 178 | +"Already tried and removed"). |
| 179 | + |
| 180 | +> **Rejected: int8 / any tier beside float32.** An int8 cascade tier was built |
| 181 | +> and removed (`e3e37f8`) for exactly the "optimizes the cheap part" reason. |
| 182 | +> Do not re-propose int8, PQ, or RaBitQ *as an added column* — only as a |
| 183 | +> float32 *replacement* per Experiment E. |
| 184 | +
|
| 185 | +### Experiment G: two-level centroid index for large nlist |
| 186 | + |
| 187 | +**What.** Centroids live in KV metadata and are scanned linearly to rank |
| 188 | +clusters (`ranges.js`). Fine for √N/2 clusters at small N; at 1M+ vectors |
| 189 | +(~700+ clusters, growing) that linear scan and the metadata size both grow. |
| 190 | +SPANN's answer: a small index *over the centroids* so finding the K nearest |
| 191 | +clusters is sub-ms, plus optional boundary-vector replication into a few nearby |
| 192 | +lists to lift recall without raising nprobe. |
| 193 | + |
| 194 | +**Win.** Keeps cluster selection cheap as nlist grows, and bounds KV-metadata |
| 195 | +size. Mostly a scale concern (>1M). |
| 196 | + |
| 197 | +**Cost.** Medium–high. New in-file structure for centroids; replication |
| 198 | +inflates data ~20%. Only worth it once nlist is large enough that linear |
| 199 | +centroid scan or metadata bloat actually shows up. |
| 200 | + |
| 201 | +**Validate.** Measure centroid-scan time and KV-metadata bytes vs N; only pursue |
| 202 | +if either becomes material at target corpus sizes. |
| 203 | + |
| 204 | +--- |
| 205 | + |
| 206 | +## Tier 3 — measure first, likely small or negative |
| 207 | + |
| 208 | +### Experiment B: float column encoding & compression (probably a no-op) |
| 209 | + |
| 210 | +**What.** We ship PLAIN + UNCOMPRESSED float32. Candidates: BYTE_STREAM_SPLIT |
| 211 | +encoding, and zstd/snappy compression. |
| 212 | + |
| 213 | +**Expected.** **Small or zero.** Unit-norm float32 embeddings are near- |
| 214 | +incompressible: the mantissa is ~7.3 bits/byte, so lossless ratios sit around |
| 215 | +1.08–1.20×, and snappy/zstd cost decode latency for ~5–10%. BYTE_STREAM_SPLIT |
| 216 | +averages ~30% on *scientific* floats but is unproven on embeddings and has gone |
| 217 | +*negative* on some data. This is why UNCOMPRESSED is the current default and the |
| 218 | +right one. |
| 219 | + |
| 220 | +**Cost.** Low to test (writer flags), but **gate any change behind a write-time |
| 221 | +sample A/B** — never always-on. Most likely outcome: confirm the default and |
| 222 | +move on. |
| 223 | + |
| 224 | +**Validate.** On real corpora, write the float column under PLAIN, PLAIN+zstd, |
| 225 | +and BYTE_STREAM_SPLIT; compare file size and phase-2 decode time. Adopt only if |
| 226 | +a corpus shows a clear net win. |
| 227 | + |
| 228 | +### Experiment H: cold-open roundtrip floor |
| 229 | + |
| 230 | +**What.** Confirm the very first fetch sequence is minimal: over-read the file |
| 231 | +tail (~64 KB) in one GET to grab the footer + KV metadata in a single roundtrip, |
| 232 | +then the page index, then coalesced data ranges. Target ~3 roundtrips for a |
| 233 | +selective cold query. |
| 234 | + |
| 235 | +**Win.** Shaves fixed startup latency off every cold query. Small but every |
| 236 | +query pays it. |
| 237 | + |
| 238 | +**Cost.** Low. Mostly verifying what hyparquet already does on a real S3/HTTP |
| 239 | +source and adding a tail over-read hint if it issues a separate tiny footer GET. |
| 240 | + |
| 241 | +**Validate.** Count `fetches` on a cold `asyncBuffer` for a single query; aim to |
| 242 | +drive the fixed overhead to ~2 (footer+index) before data reads begin. |
| 243 | + |
| 244 | +### Explicitly out of scope / rejected |
| 245 | + |
| 246 | +- **Bloom filters** — answer equality/membership only, never nearest-neighbor. |
| 247 | + Irrelevant to the vector path. |
| 248 | +- **Dictionary encoding** — embeddings are continuous/high-cardinality; Parquet |
| 249 | + falls back to PLAIN anyway, and PLAIN is what we want for SIMD scan. |
| 250 | +- **Graph indexes (HNSW / DiskANN / Vamana)** — dozens of *serial dependent* |
| 251 | + hops per query, each a cold fetch. Catastrophic on object storage. IVF is the |
| 252 | + correct family for S3 and we already use it. Do not pursue graph indexes for |
| 253 | + the cold tier. |
| 254 | + |
| 255 | +--- |
| 256 | + |
| 257 | +## Suggested sequencing |
| 258 | + |
| 259 | +1. ~~**D (nprobe)** and **C (phase-1 offset index)**~~ — DONE (2026-06-21). |
| 260 | + C was already handled by design (no change). D → keep the fraction but add an |
| 261 | + absolute cap (~80–96) that bounds over-probing above ~400k vectors. The cap |
| 262 | + is the one remaining code change from this tier; everything else here was a |
| 263 | + confirm-the-default. |
| 264 | +2. **A (RaBitQ pre-rank)** — bytes-neutral recall win; also builds the rotation |
| 265 | + machinery E depends on. |
| 266 | +3. **B (encoding A/B)** — quick confirm-or-reject, probably confirms the default. |
| 267 | +4. **E (float-free lossy mode)** — the only quantization that can win, since a |
| 268 | + tier beside float32 is already proven net-negative. One make-or-break recall |
| 269 | + benchmark decides whether the whole quantization line is alive. |
| 270 | +5. **G (two-level centroids)** and **H (cold-open floor)** — scale and |
| 271 | + fixed-overhead polish, pursue when measurements say they matter. |
| 272 | + |
| 273 | +All experiments validate through `scripts/validate-params.js` (extend its |
| 274 | +subcommands) using `recall@10` / `recall@100`, average `fetches`, and average |
| 275 | +`MB read` on real corpora (wiki 384-dim, a 1024-dim set, and a ≥500k corpus for |
| 276 | +scale). A change ships only when it improves bytes or roundtrips at equal-or- |
| 277 | +better recall. |
0 commit comments