File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -661,7 +661,7 @@ dot_contribution_k = ‖query_k‖ × ‖data_k‖ × Σ_j dist_table[q_code[j]]
661661On GPU, this becomes:
662662
6636631 . ** Stage distance table in shared memory** : `dist_table[ i] [ j ] = centroids[ i] ×
664- centroids[ j] `, 16×16 = 1 KB at b=4. Fits trivially in shared memory.
664+ centroids[ j] `, 16×16 = 1 KB at b=4. Fits trivially in shared memory.
665665
6666662 . ** Stream code bytes from HBM** : For each 64-vector × 64-dim tile (matching
667667 the PDX layout), gather from the distance table and accumulate in registers.
@@ -706,6 +706,7 @@ The GPU decode pipeline reads compressed data from Vortex files:
706706The BlockTurboQuant encoding's child arrays (codes, norms, rotation signs) are
707707individually compressed by the cascading compressor. For GPU decode, we need
708708either:
709+
709710- Host-side decompression of the cascade, then GPU transfer of the raw children
710711- Direct GPU decompression of FastLanes/ALP (if GPU decompression kernels exist)
711712
You can’t perform that action at this time.
0 commit comments