Skip to content

Commit b429fc1

Browse files
Dandandanclaude
andcommitted
perf(parquet): vectorise dict-index bounds check in RleDecoder::get_batch_with_dict
Replace `idx_chunk.iter().all(|&i| (i as usize) < dict_len)` with a u32 max-reduction (`fold(0u32, |acc, &i| acc.max(i as u32))`). `.all` short-circuits and so blocks autovectorisation; on aarch64 the old form compiled to eight serialised `ldrsw` + `cmp` + `b.ls` pairs per 8-index chunk, followed by eight separate scalar gather loads. The max-reduction has no early exit, so LLVM now lowers the check to a single `ldp q1, q0` + `umax.4s` + `umaxv.4s` + one `cmp` + `b.ls`, then reuses the loaded NEON registers for the gather that follows. Negative `i32` values cast to `u32` become large, so the bounds check still rejects them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 89b1497 commit b429fc1

1 file changed

Lines changed: 8 additions & 1 deletion

File tree

parquet/src/encodings/rle.rs

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -520,8 +520,15 @@ impl RleDecoder {
520520
let idx_chunks = idx.chunks_exact(8);
521521
for (out_chunk, idx_chunk) in out_chunks.by_ref().zip(idx_chunks) {
522522
let dict_len = dict.len();
523+
// u32 max-reduction instead of `.all(|&i| ..)`: `.all`
524+
// short-circuits and blocks autovectorisation. Negative
525+
// i32 cast to u32 becomes a large value so the bounds
526+
// check still rejects it.
527+
let max_idx = idx_chunk
528+
.iter()
529+
.fold(0u32, |acc, &i| acc.max(i as u32));
523530
assert!(
524-
idx_chunk.iter().all(|&i| (i as usize) < dict_len),
531+
(max_idx as usize) < dict_len,
525532
"dictionary index out of bounds"
526533
);
527534
for (b, i) in out_chunk.iter_mut().zip(idx_chunk.iter()) {

0 commit comments

Comments
 (0)