Skip to content

Commit d8bcfeb

Browse files
committed
libsql-sqlite3: Break vector search distance ties by rowid deterministically
The DiskANN top-k buffer ordered candidates by distance using a strict less-than comparison (distanceBufferInsertIdx), so when two vectors are exactly equidistant from the query their relative order in the result fell out of the search/visit order. That visit order depends on float rounding in the distance computation, which varies with compiler/build flags -- e.g. the same sources built as testfixture vs the sqlite3 shell resolved such ties differently, and the Makefile.in changes in the 3.47.0 merge were enough to flip the testfixture build's result for vector-index-v2-query-4 (query vector [-1,1,1,1] is exactly equidistant from rows 'b' and 'c'). Add topCandidateInsertIdx(), which breaks exact-distance ties in the top candidates buffer by rowid (ascending). Equidistant results now come back in a stable, reproducible order independent of build flags or how the search happened to reach them. Non-tied results are unaffected (the distance comparison still decides first). Update vector-index-v2-query-4 to assert the deterministic ordering ({d a b}, 'b' being the smaller rowid of the tied pair) instead of the build-dependent value it previously hard-coded.
1 parent ebd1563 commit d8bcfeb

2 files changed

Lines changed: 39 additions & 2 deletions

File tree

libsql-sqlite3/src/vectordiskann.c

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -923,6 +923,38 @@ int distanceBufferInsertIdx(const float *aDistances, int nSize, int nMaxSize, fl
923923
return nSize < nMaxSize ? nSize : -1;
924924
}
925925

926+
/*
927+
** Like distanceBufferInsertIdx() above, but for the buffer of top candidates,
928+
** where each slot also has an associated node (and therefore a rowid).
929+
**
930+
** When two candidates are exactly the same distance from the query vector the
931+
** plain distance comparison leaves their relative order up to the search/visit
932+
** order. That order depends on float-rounding of the distance computation,
933+
** which in turn varies with compiler/build flags, so equidistant results would
934+
** come back in a different order from one build to another. Breaking such ties
935+
** by rowid (ascending) makes the ordering of equidistant results deterministic
936+
** and reproducible regardless of how the search happened to reach them.
937+
*/
938+
static int topCandidateInsertIdx(
939+
const float *aDistances,
940+
DiskAnnNode *const *aCandidates,
941+
int nSize,
942+
int nMaxSize,
943+
float distance,
944+
u64 nRowid
945+
){
946+
int i;
947+
for(i = 0; i < nSize; i++){
948+
if( distance < aDistances[i] ){
949+
return i;
950+
}
951+
if( distance == aDistances[i] && nRowid < aCandidates[i]->nRowid ){
952+
return i;
953+
}
954+
}
955+
return nSize < nMaxSize ? nSize : -1;
956+
}
957+
926958
void bufferInsert(u8 *aBuffer, int nSize, int nMaxSize, int iInsert, int nItemSize, const u8 *pItem, u8 *pLast) {
927959
int itemsToMove;
928960

@@ -1100,7 +1132,7 @@ static void diskAnnSearchCtxMarkVisited(DiskAnnSearchCtx *pCtx, DiskAnnNode *pNo
11001132
pNode->pNext = pCtx->visitedList;
11011133
pCtx->visitedList = pNode;
11021134

1103-
iInsert = distanceBufferInsertIdx(pCtx->aTopDistances, pCtx->nTopCandidates, pCtx->maxTopCandidates, distance);
1135+
iInsert = topCandidateInsertIdx(pCtx->aTopDistances, pCtx->aTopCandidates, pCtx->nTopCandidates, pCtx->maxTopCandidates, distance, pNode->nRowid);
11041136
if( iInsert < 0 ){
11051137
return;
11061138
}

libsql-sqlite3/test/libsql_vector_index.test

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -562,6 +562,11 @@ do_test vector-index-v2-query-3 {
562562
execsql { SELECT t.id FROM vector_top_k('t_idx', vector('[1,1,-1,-1]'), 3) i INNER JOIN t ON t.rowid = i.id; } dbv2
563563
} {c b a}
564564

565+
# Query vector [-1,1,1,1] is exactly equidistant (cosine) from rows 'b'
566+
# ([-100,-100,-100,-100]) and 'c' ([10,10,-10,-10]) -- both at distance 1.5.
567+
# The ANN search breaks distance ties by rowid (ascending), so 'b' (the
568+
# smaller rowid) is the deterministic 3rd nearest neighbour regardless of
569+
# build/float-rounding.
565570
do_test vector-index-v2-query-4 {
566571
execsql { SELECT t.id FROM vector_top_k('t_idx', vector('[-1,1,1,1]'), 3) i INNER JOIN t ON t.rowid = i.id; } dbv2
567-
} {d a c}
572+
} {d a b}

0 commit comments

Comments
 (0)