Skip to content

Commit ab6e019

Browse files
Dandandanclaude
andcommitted
perf(parquet): vectorise dict-index bounds check in RleDecoder::get_batch_with_dict
Replace `idx_chunk.iter().all(|&i| (i as usize) < dict_len)` with a u32 max-reduction (`fold(0u32, |acc, &i| acc.max(i as u32))`). `.all` short-circuits and so blocks autovectorisation; on aarch64 the old form compiled to eight serialised `ldrsw` + `cmp` + `b.ls` pairs per 8-index chunk, followed by eight separate scalar gather loads. The max-reduction has no early exit, so LLVM now lowers the check to a single `ldp q1, q0` + `umax.4s` + `umaxv.4s` + one `cmp` + `b.ls`, then reuses the loaded NEON registers for the gather that follows. Negative `i32` values cast to `u32` become large, so the bounds check still rejects them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 89b1497 commit ab6e019

1 file changed

Lines changed: 6 additions & 1 deletion

File tree

parquet/src/encodings/rle.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -520,8 +520,13 @@ impl RleDecoder {
520520
let idx_chunks = idx.chunks_exact(8);
521521
for (out_chunk, idx_chunk) in out_chunks.by_ref().zip(idx_chunks) {
522522
let dict_len = dict.len();
523+
// u32 max-reduction instead of `.all(|&i| ..)`: `.all`
524+
// short-circuits and blocks autovectorisation. Negative
525+
// i32 cast to u32 becomes a large value so the bounds
526+
// check still rejects it.
527+
let max_idx = idx_chunk.iter().fold(0u32, |acc, &i| acc.max(i as u32));
523528
assert!(
524-
idx_chunk.iter().all(|&i| (i as usize) < dict_len),
529+
(max_idx as usize) < dict_len,
525530
"dictionary index out of bounds"
526531
);
527532
for (b, i) in out_chunk.iter_mut().zip(idx_chunk.iter()) {

0 commit comments

Comments
 (0)