Fix clustered Hamming scan on non-contiguous chunks; validate auto-tuning params

platypii · platypii · commit 70aad8c9d210 · 2026-06-20T23:38:15.000-07:00
While validating the open experiments in PLAN_AUTO.md, found that the
clustered binary scan (hammingScoreChunk) assumed every decoded row was a
tightly-packed slice of one backing buffer. A clustered row range can be
assembled from several parquet pages, so on some queries the flat Uint32Array
view ran past the buffer (RangeError) or silently scored the wrong bytes.
Gate the fast path on an O(1) contiguity check, else fall back to the
always-correct per-row scratch copy. Regression test in test/chunks.test.js.

Validation harness scripts/validate-params.js with three modes:
- recall:  recall@100 shows the 94% recall@10 ceiling is a near-duplicate
           artifact; clustered recall@100 reaches 95-97%.
- smalln:  binary-rerank only beats exact scan above ~5k rows (clear win at
           10k); column overhead is ~3.2% at 384-dim. 10k threshold confirmed.
- scale:   sqrt(N)/2 is fastest-or-tied at 100k/500k/1M across 384 and
           1024-dim corpora. Confirmed across sizes and distributions.

Re-ran the headline sweeps post-fix: wiki reproduces the plan exactly
(probe=0.25, rerankFactor=10 confirmed), llmlog orderings unchanged.
diff --git a/PLAN_AUTO.md b/PLAN_AUTO.md
@@ -20,7 +20,7 @@ Each parameter below has a current state, a target strategy, and the experiments
 | `dimension` | required | **Required** | Caller's model dictates this. No automation possible. |
 | `metric` | `'cosine'` default, in KV | **KV-metadata** (done) | Defaults to `'cosine'`, stored in KV, read transparently at search. |
 | `normalize` | **default `true` (shipped)** | **KV-metadata, default `true`** | Cosine + normalized = dot, which dominates everywhere. Every benchmark ran normalized with no downside. Code default flipped to `true`; callers can omit it. Harmless if already unit-length. Kept as a flag (not forced) because `dot`/`euclidean` are magnitude-sensitive and would silently break if always normalized. |
-| `binary` | **Auto (shipped)** | **Derive(N): on at N ≥ 10k** | Shipped: auto-on at `defaultAutoBinaryThreshold = 10000` (~1.5% extra bytes for ~50× fewer bytes-read in phase 2). Below threshold, exact scan is fine. Small-N crossover still unmeasured (see open experiments). |
+| `binary` | **Auto (shipped)** | **Derive(N): on at N ≥ 10k** | Shipped: auto-on at `defaultAutoBinaryThreshold = 10000`. Small-N crossover now measured (`validate-params.js smalln`): binary-rerank is *slower* than exact scan below ~5k (0.76–0.80×) and only a clear win at 10k (1.07×) and 20k (1.30×). Column overhead is **~3.2% at 384-dim** (48 binary bytes / 1536 float bytes), not the ~1.5% guessed here. The 10k threshold is well-placed: it's where binary turns net-positive *and* where clustering (its real payoff) kicks in. |
 | `clusters` | **Auto (shipped)** | **Derive(N): `round(√N/2)`** | Shipped: `round(√N/2)` when binary auto-on (`writeVectors.js`). The sweep below locked in `√N/2` over `√N` (better latency, same recall on both corpora). Caller can still pass an explicit count or `0`. |
 | `clusterIterations` | `6` | **Fixed (6)** | The existing ablations show diminishing returns past 6. Hide the knob. |
 | `clusterSeed` | `1` | **Fixed (1)** | Determinism is the only reason this exists. No reason to expose. |
@@ -144,11 +144,38 @@ The disagreement on `probe` is the most interesting finding: LLM-log retrievals
 - `probe` search default: stays at `0.25`. The LLM-log data tempted us to drop it; the wiki data showed why we shouldn't.
 - `rerankFactor` search default: stays at `10`. The `N/3000` rule moves into the documentation as "raise this if you observe sub-target recall", not as a default.
 
-### Still-open experiments
+### Validation pass (2026-06-20)
 
-- Repeat the clusters sweep at 500k and 1M LLM-log row counts to confirm `√N/2` across sizes.
-- Recall@100 to discriminate the LLM-log 94% ceiling.
-- Find the small-N crossover where the binary column stops being worth ~1.5% bytes.
+The three open experiments were run via `scripts/validate-params.js` (subcommands `recall`, `smalln`, `scale`). All three defaults held. One latent bug surfaced and was fixed.
+
+**Bug found while validating (fixed):** the clustered binary scan (`src/search/chunks.js`, `hammingScoreChunk`) built a single flat `Uint32Array` view over a decoded row chunk, assuming all rows are tightly packed in one backing buffer. A clustered row-range read can be assembled from several parquet pages, so on some queries that view ran past the buffer end (`RangeError`) or, when the first buffer was large enough, silently scored the wrong bytes. Now the fast path is gated on an O(1) contiguity check (last row's buffer + offset + bounds), falling back to a per-row scratch copy otherwise. Regression test in `test/chunks.test.js`. The wiki sweep below reproduces the plan's pre-bug numbers exactly; the LLM-log sweep shifted (see below), consistent with the fix only touching the non-contiguous chunks LLM-log clustering produced.
+
+**1. Recall@100 on LLM-log (`recall` mode, 100k × 384).** The recall@10 ceiling (~89–90%) is flat across all cluster counts — top-10 is saturated by near-duplicate messages, so it can't discriminate. recall@100 *does* discriminate and stays healthy:
+
+| clusters | ms | recall@10 | recall@100 |
+|---:|---:|---:|---:|
+| 0 | 214 | 89.5% | 94.3% |
+| 158 (√N/2) | 41 | 89.0% | 95.7% |
+| 316 (√N) | 40 | 89.0% | 96.8% |
+| 632 (2√N) | 46 | 89.0% | 97.0% |
+
+More clusters buy ~1pp recall@100 per step, but √N/2 is fastest and within ~1pp of √N. The 94% "ceiling" was a recall@10 artifact, not an index limit — recall@100 reaches 95–97%. **√N/2 holds.**
+
+**2. Small-N binary crossover (`smalln` mode, LLM-log subsets, 384-dim).** Binary column overhead is a flat ~3.2%. Binary-rerank vs exact scan: 500–2k → 0.76–0.80× (slower) at 99.5% recall; 5k → 1.01× / 97%; 10k → 1.07× / 96.5%; 20k → 1.30× / 96.5%. Crossover is ~5k; binary becomes a clear win at 10k. **The 10k auto-on threshold holds** (and is where clustering, the bigger lever, also turns on). The crossover is dimension-driven (Hamming vs float scan cost), so it generalizes across 384-dim corpora.
+
+**3. Clusters √N/2 at scale (`scale` mode).** LLM-log only has 51,389 raw messages, so a 500k/1M LLM-log corpus isn't available. Substituted the 1M × 1024-dim `tpuf-bench` corpus (different distribution and dimension — a stronger generalization test of the formula's constant). Probe/rerank at defaults, recall@10 vs exact full scan:
+
+| N (dim) | √N/2 | √N |
+|---|---|---|
+| 100k (384) | 8.4 ms / 89% | 9.4 ms / 89% |
+| 500k (1024) | 49.8 ms / 89.5% | 49.3 ms / 90.5% |
+| 1M (1024) | 72.8 ms / 92.5% | 76.4 ms / 93.5% |
+
+`√N/2` is fastest-or-tied on latency at every size and reads fewer fetches; `√N` consistently buys ~1pp recall for more bandwidth. At 1024-dim the latency gap narrows (phase-2 float fetches dominate, so cluster count matters less for wall-time) but `√N/2` never loses. **`√N/2` holds across sizes and across a different dimension/distribution.** (The `2√N` point at 1M was skipped — its k-means write exceeded 20 min and `√N/2` vs `√N` already answers the question.)
+
+**Re-validated headline sweeps (post-fix):**
+- LLM-log clusters/rerank/probe: √N/2 fastest at equal recall; rf=10 saturated (+0.5pp at rf=30 for +7ms); probe 0.10 ties 0.25 on LLM-log alone.
+- Wiki clusters/rerank/probe: reproduces the plan exactly — probe 0.10 → 84% (−9pp regression), rf=10→30 gains 2pp. **probe=0.25 and rerankFactor=10 confirmed.**
 
 
 ## Product quantization: evaluated, removed
diff --git a/scripts/validate-params.js b/scripts/validate-params.js
@@ -0,0 +1,243 @@
+/**
+ * Validation harness for the still-open experiments in PLAN_AUTO.md.
+ *
+ * Subcommands:
+ *
+ *   recall <src> [N]   — clusters sweep reporting BOTH recall@10 and recall@100,
+ *                        to discriminate the ~94% recall@10 ceiling on LLM logs.
+ *   smalln <src>       — binary-column crossover: for small N, compare file size
+ *                        and search latency/recall of binary-rerank vs exact scan.
+ *   scale <src> <Ns>   — clusters sweep at one or more N subsets (comma-separated),
+ *                        to confirm the sqrt(N)/2 latency optimum holds across sizes.
+ *
+ * All files are written under data/_vp_*.parquet and reused if present.
+ *
+ * Usage:
+ *   node scripts/validate-params.js recall data/llmlog.vectors.parquet
+ *   node scripts/validate-params.js smalln data/llmlog.vectors.parquet
+ *   node scripts/validate-params.js scale  data/tpuf-bench-1000k.parquet 250000,500000,1000000
+ */
+import { promises as fs } from 'node:fs'
+import { asyncBufferFromFile, cachedAsyncBuffer, parquetMetadataAsync } from 'hyparquet'
+import { fileWriter } from 'hyparquet-writer'
+import { readVectors } from '../src/readVectors.js'
+import { searchVectors } from '../src/searchVectors.js'
+import { parseKvMetadata } from '../src/utils.js'
+import { writeVectors } from '../src/writeVectors.js'
+
+/** @import { AsyncBuffer } from 'hyparquet' */
+
+const MODE = process.argv[2]
+const SRC = process.argv[3]
+const ARG = process.argv[4]
+const QUERY_COUNT = 20
+
+if (!MODE || !SRC) {
+  console.error('Usage: node scripts/validate-params.js <recall|smalln|scale> <src> [arg]')
+  process.exit(1)
+}
+
+/**
+ * Read up to `limit` records from a vectors parquet into memory.
+ * @param {string} src
+ * @param {number} [limit]
+ * @returns {Promise<{ records: { id: string, vector: Float32Array }[], meta: any }>}
+ */
+async function loadRecords(src, limit) {
+  const file = await asyncBufferFromFile(src)
+  const metadata = await parquetMetadataAsync(file)
+  const meta = parseKvMetadata(metadata)
+  const records = []
+  for await (const record of readVectors({ file, metadata, includeMetadata: false })) {
+    records.push(record)
+    if (limit && records.length >= limit) break
+  }
+  return { records, meta }
+}
+
+/**
+ * Pick evenly spaced query vectors from the corpus.
+ * @param {{ vector: Float32Array }[]} records
+ * @param {number} count
+ * @returns {Float32Array[]}
+ */
+function pickQueries(records, count) {
+  const queries = []
+  const step = Math.max(1, Math.floor(records.length / (count + 1)))
+  for (let i = 0, pick = step; i < records.length && queries.length < count; i += 1) {
+    if (i === pick) { queries.push(records[i].vector); pick += step }
+  }
+  return queries
+}
+
+/**
+ * @param {AsyncBuffer} buf
+ * @returns {AsyncBuffer & { bytes: number, fetches: number }}
+ */
+function instrument(buf) {
+  const slice = buf.slice.bind(buf)
+  const w = {
+    byteLength: buf.byteLength, bytes: 0, fetches: 0,
+    slice(s, e) { w.bytes += (e ?? buf.byteLength) - s; w.fetches += 1; return slice(s, e) },
+  }
+  return w
+}
+
+function avg(a) { let s = 0; for (const x of a) s += x; return s / a.length }
+
+/**
+ * Run a search over every query and collect timing + the returned id lists.
+ * @param {string} path
+ * @param {Float32Array[]} queries
+ * @param {number} topK
+ * @param {object} extra
+ * @returns {Promise<{ ms: number, mb: number, fetches: number, tops: string[][] }>}
+ */
+async function bench(path, queries, topK, extra) {
+  const times = [], bytesA = [], fetchesA = [], tops = []
+  for (const q of queries) {
+    const raw = instrument(await asyncBufferFromFile(path))
+    const cached = cachedAsyncBuffer(raw)
+    const start = performance.now()
+    const r = await searchVectors({ source: cached, query: q, topK, ...extra })
+    times.push(performance.now() - start)
+    bytesA.push(raw.bytes); fetchesA.push(raw.fetches); tops.push(r.map(x => String(x.id)))
+  }
+  return { ms: avg(times), mb: avg(bytesA) / 1e6, fetches: avg(fetchesA), tops }
+}
+
+/**
+ * Recall of `tops` against reference `refTops`, truncating both to `k`.
+ * @param {string[][]} refTops
+ * @param {string[][]} tops
+ * @param {number} k
+ * @returns {number}
+ */
+function recallAt(refTops, tops, k) {
+  let hits = 0, total = 0
+  for (let i = 0; i < refTops.length; i += 1) {
+    const refSet = new Set(refTops[i].slice(0, k))
+    for (const id of tops[i].slice(0, k)) if (refSet.has(id)) hits += 1
+    total += refSet.size
+  }
+  return hits / total
+}
+
+/**
+ * Write a clustered+binary file for a given cluster count (idempotent).
+ * @param {string} tag
+ * @param {{ id: string, vector: Float32Array }[]} records
+ * @param {any} meta
+ * @param {number} clusters
+ * @param {boolean} binary
+ * @returns {Promise<string>}
+ */
+async function writeVariant(tag, records, meta, clusters, binary) {
+  const path = `data/_vp_${tag}.parquet`
+  if (await fs.stat(path).catch(() => undefined)) return path
+  const start = performance.now()
+  await writeVectors({
+    writer: fileWriter(path),
+    dimension: meta.dimension,
+    metric: meta.metric,
+    normalize: meta.normalized,
+    vectors: records,
+    binary,
+    clusters,
+  })
+  console.log(`  wrote ${path} (clusters=${clusters}, binary=${binary}) in ${((performance.now() - start) / 1000).toFixed(1)}s`)
+  return path
+}
+
+// --- recall@10 + recall@100 sweep ---------------------------------------
+async function runRecall() {
+  const limit = ARG ? Number(ARG) : undefined
+  const { records, meta } = await loadRecords(SRC, limit)
+  const N = records.length
+  const sqrtN = Math.round(Math.sqrt(N))
+  const queries = pickQueries(records, QUERY_COUNT)
+  console.log(`recall: ${SRC} N=${N.toLocaleString()} dim=${meta.dimension} sqrtN=${sqrtN}`)
+
+  const clusterValues = [0, Math.round(sqrtN / 2), sqrtN, 2 * sqrtN]
+  const base = `${SRC.replace(/\.parquet$/, '').split('/').pop()}_N${N}`
+  const paths = {}
+  for (const c of clusterValues) paths[c] = await writeVariant(`${base}_c${c}`, records, meta, c, true)
+
+  // Reference: exact full scan, top-100.
+  console.log('Reference: exact top-100 full scan...')
+  const ref = await bench(paths[0], queries, 100, { rerankFactor: 0 })
+
+  console.log('\n=== clusters sweep, recall@10 vs recall@100 (probe/rerank default) ===')
+  console.log(`${'clusters'.padStart(10)} ${'ms'.padStart(7)} ${'fetches'.padStart(8)} ${'MB read'.padStart(9)} ${'r@10'.padStart(7)} ${'r@100'.padStart(7)}`)
+  console.log('-'.repeat(58))
+  for (const c of clusterValues) {
+    const opts = c === 0 ? { rerankFactor: 10 } : {}
+    const r = await bench(paths[c], queries, 100, opts)
+    const r10 = recallAt(ref.tops, r.tops, 10)
+    const r100 = recallAt(ref.tops, r.tops, 100)
+    console.log(`${String(c).padStart(10)} ${r.ms.toFixed(1).padStart(7)} ${r.fetches.toFixed(0).padStart(8)} ${r.mb.toFixed(2).padStart(9)} ${(r10 * 100).toFixed(1).padStart(6)}% ${(r100 * 100).toFixed(1).padStart(6)}%`)
+  }
+}
+
+// --- small-N binary crossover -------------------------------------------
+async function runSmallN() {
+  const sizes = (ARG ?? '500,1000,2000,5000,10000,20000').split(',').map(Number)
+  const maxN = Math.max(...sizes)
+  const { records: all, meta } = await loadRecords(SRC, maxN)
+  console.log(`smalln: ${SRC} dim=${meta.dimension}, sizes=${sizes.join(',')}`)
+  console.log(`\n${'N'.padStart(7)} ${'noBin MB'.padStart(9)} ${'bin MB'.padStart(8)} ${'+%'.padStart(6)} ${'exact ms'.padStart(9)} ${'rerank ms'.padStart(10)} ${'speedup'.padStart(8)} ${'recall'.padStart(7)}`)
+  console.log('-'.repeat(74))
+  for (const N of sizes) {
+    const records = all.slice(0, N)
+    const queries = pickQueries(records, Math.min(QUERY_COUNT, N))
+    const tag = `${SRC.replace(/\.parquet$/, '').split('/').pop()}_sn${N}`
+    // No-binary file (exact scan only) and binary file (no clusters, rerank path).
+    const exactPath = await writeVariant(`${tag}_nobin`, records, meta, 0, false)
+    const binPath = await writeVariant(`${tag}_bin`, records, meta, 0, true)
+    const exactSize = (await fs.stat(exactPath)).size
+    const binSize = (await fs.stat(binPath)).size
+    // Reference = exact top-10 on the no-binary file.
+    const ref = await bench(exactPath, queries, 10, { rerankFactor: 0 })
+    const rerank = await bench(binPath, queries, 10, {})
+    const recall = recallAt(ref.tops, rerank.tops, 10)
+    const pct = (binSize - exactSize) / exactSize * 100
+    const speedup = ref.ms / rerank.ms
+    console.log(`${String(N).padStart(7)} ${(exactSize / 1e6).toFixed(2).padStart(9)} ${(binSize / 1e6).toFixed(2).padStart(8)} ${pct.toFixed(1).padStart(5)}% ${ref.ms.toFixed(2).padStart(9)} ${rerank.ms.toFixed(2).padStart(10)} ${speedup.toFixed(2).padStart(7)}x ${(recall * 100).toFixed(1).padStart(6)}%`)
+  }
+}
+
+// --- clusters sweep at scale --------------------------------------------
+async function runScale() {
+  const Ns = (ARG ?? '').split(',').filter(Boolean).map(Number)
+  if (!Ns.length) { console.error('scale needs comma-separated N list'); process.exit(1) }
+  const maxN = Math.max(...Ns)
+  console.log(`scale: loading up to ${maxN.toLocaleString()} from ${SRC}...`)
+  const { records: all, meta } = await loadRecords(SRC, maxN)
+  console.log(`  loaded ${all.length.toLocaleString()} × ${meta.dimension}-dim`)
+  for (const N of Ns) {
+    const records = all.slice(0, N)
+    const sqrtN = Math.round(Math.sqrt(N))
+    const queries = pickQueries(records, QUERY_COUNT)
+    const clusterValues = [Math.round(sqrtN / 2), sqrtN, 2 * sqrtN]
+    const base = `${SRC.replace(/\.parquet$/, '').split('/').pop()}_sc${N}`
+    console.log(`\n=== N=${N.toLocaleString()} (sqrtN=${sqrtN}) ===`)
+    const paths = {}
+    // c=0 reference file (binary, no clusters) for exact top-10.
+    const refPath = await writeVariant(`${base}_c0`, records, meta, 0, true)
+    for (const c of clusterValues) paths[c] = await writeVariant(`${base}_c${c}`, records, meta, c, true)
+    const ref = await bench(refPath, queries, 10, { rerankFactor: 0 })
+    console.log(`${'clusters'.padStart(10)} ${'ms'.padStart(7)} ${'fetches'.padStart(8)} ${'MB read'.padStart(9)} ${'recall'.padStart(8)}`)
+    console.log('-'.repeat(50))
+    for (const c of clusterValues) {
+      const r = await bench(paths[c], queries, 10, {})
+      const rec = recallAt(ref.tops, r.tops, 10)
+      const label = c === Math.round(sqrtN / 2) ? `${c} (√N/2)` : c === sqrtN ? `${c} (√N)` : `${c} (2√N)`
+      console.log(`${label.padStart(10)} ${r.ms.toFixed(1).padStart(7)} ${r.fetches.toFixed(0).padStart(8)} ${r.mb.toFixed(2).padStart(9)} ${(rec * 100).toFixed(1).padStart(7)}%`)
+    }
+  }
+}
+
+if (MODE === 'recall') await runRecall()
+else if (MODE === 'smalln') await runSmallN()
+else if (MODE === 'scale') await runScale()
+else { console.error(`unknown mode: ${MODE}`); process.exit(1) }
diff --git a/src/search/chunks.js b/src/search/chunks.js
@@ -51,9 +51,20 @@ export function hammingScoreChunk(columnData, rowStart, bytesPerRow, queryU32, h
   if (rows.length === 0) return
   const wordsPerRow = bytesPerRow >> 2
   const first = rows[0]
-  const aligned = first.byteOffset % 4 === 0
-  const flat = aligned ? new Uint32Array(first.buffer, first.byteOffset, rows.length * wordsPerRow) : null
-  const scratchU32 = aligned ? null : new Uint32Array(wordsPerRow)
+  const last = rows[rows.length - 1]
+  // The flat-view fast path is only valid when every row is a 4-byte-aligned,
+  // tightly-packed slice of a single backing buffer. A row range read from a
+  // clustered file can be assembled from several page buffers, in which case
+  // `rows` is not one contiguous block — building a span over it would read
+  // out of bounds (RangeError) or, if the first buffer is large enough, score
+  // the wrong bytes. Verify contiguity in O(1) via the last row, else fall
+  // back to the per-row scratch copy (always correct).
+  const contiguous = first.byteOffset % 4 === 0
+    && last.buffer === first.buffer
+    && last.byteOffset === first.byteOffset + (rows.length - 1) * bytesPerRow
+    && first.byteOffset + rows.length * bytesPerRow <= first.buffer.byteLength
+  const flat = contiguous ? new Uint32Array(first.buffer, first.byteOffset, rows.length * wordsPerRow) : null
+  const scratchU32 = flat ? null : new Uint32Array(wordsPerRow)
   const scratchBytes = scratchU32 ? new Uint8Array(scratchU32.buffer) : null
 
   for (let i = 0; i < rows.length; i += 1) {
diff --git a/test/chunks.test.js b/test/chunks.test.js