Skip to content

Commit 9a1caeb

Browse files
committed
nsm: replace alignment-unsafe u8->f32 cast with safe LE decode
build_distance_matrix_from_cam reinterpreted a &[u8] codebook buffer as &[f32] via `as_ptr() as *const f32` + from_raw_parts. A &[u8] carries no alignment guarantee, so on an unaligned buffer (mmap'd file, sub-slice) the f32 reinterpret is UB. Every other byte cast in the repo widens [u64]->[u8] (alignment decreases = sound); this one narrowed alignment-up and was the lone genuine soundness risk found in the unsafe audit. Replace with chunks_exact(4) + f32::from_le_bytes: alignment-free, endian-correct (matches the workspace LE contract), no unsafe. The codebook is read-only downstream, so owning a Vec<f32> is fine. The CAM-PQ codebook centroels are f32 by definition (6 subspaces x 256 centroids x subspace_dim); the stored word distance remains the u8-quantized L2 in WordDistanceMatrix. https://claude.ai/code/session_0147hSzjmWZDuy2MSQNrhEK5
1 parent 4e537c7 commit 9a1caeb

1 file changed

Lines changed: 10 additions & 7 deletions

File tree

crates/lance-graph/src/nsm/nsm_word.rs

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -171,13 +171,16 @@ fn build_distance_matrix_from_cam(
171171
return WordDistanceMatrix::new(0);
172172
}
173173

174-
// Parse codebook: 6 subspaces x 256 centroids x subspace_dim floats
175-
// We interpret codebook_bytes as f32 array
176-
let codebook_floats: &[f32] = unsafe {
177-
let ptr = codebook_bytes.as_ptr() as *const f32;
178-
let len = codebook_bytes.len() / 4;
179-
std::slice::from_raw_parts(ptr, len)
180-
};
174+
// Parse codebook: 6 subspaces x 256 centroids x subspace_dim floats.
175+
// Decode the bytes as a little-endian f32 array. A `&[u8]` carries no
176+
// alignment guarantee, so a `*const f32` reinterpret would be UB on an
177+
// unaligned buffer (mmap'd file, sub-slice). `from_le_bytes` over 4-byte
178+
// chunks is alignment-free and endian-correct (matches the workspace LE
179+
// contract). The codebook is only read below, so owning a Vec is fine.
180+
let codebook_floats: Vec<f32> = codebook_bytes
181+
.chunks_exact(4)
182+
.map(|c| f32::from_le_bytes([c[0], c[1], c[2], c[3]]))
183+
.collect();
181184

182185
// Determine subspace_dim: total_floats / (6 * 256)
183186
let total_floats = codebook_floats.len();

0 commit comments

Comments
 (0)