Skip to content

Commit c5147f0

Browse files
committed
Lower default cluster probe cap from 96 to 48
A WildChat 1024-dim sweep finds 48, 72, and 96 probed lists give statistically indistinguishable recall@10 at 1M and 3.2M (within ~1pp vs exact scan, no consistent direction), while 48 reads ~42% fewer bytes at scale. Only the default path (probe omitted) is affected; an explicit probe is honored literally.
1 parent 87fa3ce commit c5147f0

3 files changed

Lines changed: 16 additions & 9 deletions

File tree

package.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,10 @@
5050
"hyparquet-writer": "0.16.1"
5151
},
5252
"devDependencies": {
53-
"@types/node": "26.0.0",
53+
"@types/node": "26.0.1",
5454
"@vitest/coverage-v8": "4.1.9",
5555
"eslint": "9.39.4",
56-
"eslint-plugin-jsdoc": "63.0.7",
56+
"eslint-plugin-jsdoc": "63.0.8",
5757
"typescript": "6.0.3",
5858
"vitest": "4.1.9"
5959
}

src/constants.js

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,19 @@ export const defaultClusterIterations = 6
3131
export const defaultClusterProbeFraction = 0.25
3232

3333
// Upper bound on clusters probed under the *default* fraction. Clusters grow
34-
// as ~sqrt(N)/2, so 0.25 x nlist keeps rising with N; measured recall knees
35-
// well before that at scale (~92% at 80-96 lists on 1M x 1024, vs 93% at the
36-
// uncapped 125). Capping the default trims ~25% of roundtrips and ~30% of
37-
// bytes above ~400k vectors for ~1pp recall. Only applies when `probe` is
38-
// left default; an explicit `probe` is honored literally.
39-
export const defaultClusterProbeCap = 96
34+
// as ~sqrt(N)/2, so 0.25 x nlist keeps rising with N, but the clusters needed
35+
// to reach the recall ceiling stay roughly flat (~25-45) regardless of N. A
36+
// WildChat 1024-dim sweep found 48, 72, and 96 lists give statistically
37+
// indistinguishable recall@10 at 1M and 3.2M (within ~1pp over 20 exact-scan
38+
// queries, no consistent direction). Their top-10 sets are not bit-identical:
39+
// over 200 queries, cap 48 matches cap 96 on ~93% (1M) to ~97% (3.2M), the
40+
// rest reshuffling near-ties at the list boundary, not losing true neighbors.
41+
// Capping at 48 reads ~42% fewer bytes than 96 at scale with no measurable
42+
// recall loss; structurally, shrinking the cap can only lose recall, never
43+
// gain it, since probed clusters are a subset. Residual misses are a
44+
// rerankFactor limit, not a probe limit. Only applies when `probe` is left
45+
// default; an explicit `probe` is honored literally.
46+
export const defaultClusterProbeCap = 48
4047

4148
// When `binary` is not specified at write time, the column is added once
4249
// the corpus is at least this large. Below the threshold, exact full scan

test/ranges.test.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ describe('selectClusterRowRanges probe cap', () => {
3838
const query = Uint8Array.from([0, 0])
3939

4040
it('caps the default fraction at the absolute ceiling for large nlist', () => {
41-
// 0.25 * 500 = 125 clusters, but the default cap (96) should bind.
41+
// 0.25 * 500 = 125 clusters, but the default cap should bind.
4242
const ranges = selectClusterRowRanges(makeMeta(500), query, undefined)
4343
expect(rowsCovered(ranges)).toBe(defaultClusterProbeCap)
4444
})

0 commit comments

Comments
 (0)