perf(parquet): Vectorize dict-index bounds check in RleDecoder::get_batch_with_dict (up to -7.9%)#9746
Conversation
b3b75e2 to
b429fc1
Compare
|
run benchmark arrow_reader_clickbench |
0f17b76 to
b429fc1
Compare
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rle-dict-max-reduction (b429fc1) to 89b1497 (merge-base) diff File an issue against this benchmark runner |
…batch_with_dict` Replace `idx_chunk.iter().all(|&i| (i as usize) < dict_len)` with a u32 max-reduction (`fold(0u32, |acc, &i| acc.max(i as u32))`). `.all` short-circuits and so blocks autovectorisation; on aarch64 the old form compiled to eight serialised `ldrsw` + `cmp` + `b.ls` pairs per 8-index chunk, followed by eight separate scalar gather loads. The max-reduction has no early exit, so LLVM now lowers the check to a single `ldp q1, q0` + `umax.4s` + `umaxv.4s` + one `cmp` + `b.ls`, then reuses the loaded NEON registers for the gather that follows. Negative `i32` values cast to `u32` become large, so the bounds check still rejects them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b429fc1 to
ab6e019
Compare
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Follow-up to ab6e019. Two changes on the same RLE dict-gather loop: - `CHUNK = 16` (was 8): lets LLVM widen the `umax.4s` + `umaxv.4s` reduction to cover the whole chunk with one vector max per pair of loads, keeping the ratio of (cheap) bounds check to (cheap) gather in the loop's favour. - `#[cold] #[inline(never)] fn oob` replaces `assert!`: `max_idx` no longer has to stay live on the hot path for the panic format string, removing a per-chunk stack spill. Panic behaviour on out-of-bounds indices is preserved. Measured on aarch64 (Apple Silicon) with `cargo bench -p parquet --bench arrow_reader -- \ "dictionary encoded, mandatory, no NULLs"`: | bench | chunk 8 | chunk 16 + cold oob | Δ | | ------------------------ | ------- | ------------------- | ------- | | BinaryViewArray | 101 µs | 88 µs | -13.0% | | StringViewArray | 101 µs | 88 µs | -12.6% | | INT64 Decimal128 | 98 µs | 94 µs | -3.5% | | INT32 Decimal128 | 87 µs | 84 µs | -2.9% | | UInt8Array | 52 µs | 51 µs | -1.8% | View arrays see the biggest win because the inner copy is cheaper, so the bounds check is a larger share of the loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rle-dict-max-reduction (0ca4c50) to 89b1497 (merge-base) diff using: File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
…Decoder::get_batch_with_dict` Replace the `panic!` in the cold `oob` helper and the unchecked `dict[..]` indexing in the RLE and remainder paths with `ParquetError` returns. The vectorised hot loop (SIMD max-reduction + 16-way gather) is unchanged — only the error/cold paths differ. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks, looks good to me! |
|
@jhorstmann :thank you! Could you ✅ the PR ? :) |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rle-dict-max-reduction (963f2a2) to 89b1497 (merge-base) diff File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
Thanks @Dandandan and @jhorstmann
I went through this and had Codex do it too, and it looks good to us
| .zip(idx.chunks_exact(CHUNK).remainder().iter()) | ||
| { | ||
| b.clone_from(&dict[*i as usize]); | ||
| let dict_idx = *i as usize; |
There was a problem hiding this comment.
it is fascinaing that the rust compiler doens't already do this
| let num_values = cmp::min(max_values - values_read, self.rle_left as usize); | ||
| let dict_idx = self.current_value.unwrap() as usize; | ||
| let dict_value = dict[dict_idx].clone(); | ||
| let dict_value = dict |
There was a problem hiding this comment.
This is a nice change to return errors rather than panic on input
Which issue does this PR close?
Rationale for this change
Rewrite the code to generate more SIMD instructions / amortize loop/branching overhead.
What changes are included in this PR?
One-file change in
parquet/src/encodings/rle.rs:CHUNK = 16indices#[cold] #[inline(never)] fn oobfor the panic pathget_uncheckedgather after the vectorised checkAre these changes tested?
Existing tests
Are there any user-facing changes?
No — no API change; decoded output is identical, panic behaviour on out-of-bounds indices is preserved.
🤖 Generated with Claude Code