Skip to content

perf[buffer]: iteration for fallible operations with validity#8120

Merged
joseph-isaacs merged 27 commits into
developfrom
ji/fast-iter-valid
Jun 17, 2026
Merged

perf[buffer]: iteration for fallible operations with validity#8120
joseph-isaacs merged 27 commits into
developfrom
ji/fast-iter-valid

Conversation

@joseph-isaacs

@joseph-isaacs joseph-isaacs commented May 27, 2026

Copy link
Copy Markdown
Contributor

Currently use (and arrow) handle fallible operations with scalar (non-SIMD) code.

This PR add a trait and methods to have fast SIMD checked operations (includes cast) but verified else where that checked_add benefits

@codspeed-hq

codspeed-hq Bot commented May 27, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 8 improved benchmarks
❌ 12 regressed benchmarks
✅ 1524 untouched benchmarks
🆕 3 new benchmarks
⏩ 11 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_dict_primitive_into_canonical[u32, (1000, 10, 10)] 120.7 µs 183 µs -34.05%
Simulation encode_varbin[(1000, 2)] 176.1 µs 237.1 µs -25.74%
Simulation take_10k_random 196.9 µs 255.6 µs -22.97%
Simulation patched_take_10k_contiguous_patches 230.6 µs 290.8 µs -20.69%
Simulation patched_take_10k_random 242.6 µs 302.8 µs -19.89%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 161.8 µs 198 µs -18.28%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 177.1 µs 213.5 µs -17.06%
Simulation bench_many_codes_few_values[1024] 393.2 µs 465.6 µs -15.55%
Simulation decompress_rd[f64, (100000, 0.0)] 845.5 µs 982.8 µs -13.97%
Simulation varbinview_large 112.2 µs 130.4 µs -13.97%
Simulation chunked_varbinview_canonical_into[(100, 100)] 273.8 µs 307.9 µs -11.08%
Simulation chunked_varbinview_into_canonical[(100, 100)] 326.4 µs 365 µs -10.58%
Simulation sum_i32_nullable_all_valid 69.2 µs 35.4 µs +95.64%
Simulation null_count_run_end[(10000, 4, 0.01)] 125.4 µs 91.6 µs +36.98%
Simulation encode_varbinview[(1000, 2)] 189 µs 157.1 µs +20.26%
Simulation chunked_varbinview_opt_into_canonical[(1000, 10)] 229.3 µs 192.7 µs +18.96%
Simulation and_bool_nullable 93.7 µs 82.6 µs +13.41%
Simulation baseline_lt[4, 1024] 78.5 µs 69.9 µs +12.23%
Simulation decompress_rd[f64, (100000, 0.01)] 981.2 µs 890.4 µs +10.2%
Simulation decompress_rd[f64, (100000, 0.1)] 981.2 µs 890.4 µs +10.19%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ji/fast-iter-valid (7ed76bc) with develop (679e2c5)2

Open in CodSpeed

Footnotes

  1. 11 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on develop (9f494a1) during the generation of this report, so 679e2c5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs changed the title faster iteration infra perf[buffer]: iteration for fallible operations with validity May 27, 2026
@joseph-isaacs joseph-isaacs marked this pull request as ready for review May 27, 2026 15:13
@joseph-isaacs

Copy link
Copy Markdown
Contributor Author

Open question is where to put this code?

f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label May 27, 2026
@robert3005

Copy link
Copy Markdown
Contributor

Sounds like we want a crate in between the array and vortex-buffer or this could be a feature flag in vortex-buffer

@joseph-isaacs

Copy link
Copy Markdown
Contributor Author

I was thinking vortex-compute what only works with dtype, buffers and rust native types?

I cannot remember if this was a problem before?

f
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs requested a review from robert3005 May 28, 2026 11:30
@robert3005

Copy link
Copy Markdown
Contributor

there was no problem with vortex-compute iirc. There were some constructs that made it hard in the past but I think we're fine now.

@github-actions

Copy link
Copy Markdown
Contributor

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

@github-actions github-actions Bot added the stale This PR is stale and will be auto-closed soon label Jun 12, 2026
@joseph-isaacs joseph-isaacs removed the stale This PR is stale and will be auto-closed soon label Jun 12, 2026
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs requested a review from a team June 12, 2026 11:23
Comment thread vortex-array/benches/cast_primitive.rs
/// The kernels in this crate require this trait instead of `Iterator` so that lane
/// reads carry no inter-iteration data dependency — the autovectorizer treats each
/// lane independently.
pub trait IndexedSource {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These traits really look like Buf/BufMut from bytes which we already implement for buffers and are implemented for similar things

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the random access read/write

///
/// Use this to drive a binary kernel from two columns. Length equality is enforced
/// at construction.
pub struct LaneZip<A, B>(pub A, pub B);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good candidate for implementing Iterator and ExactSizeIterator

/// `impl<S: IndexedSource> IndexedSourceExt for S` below. Bring the trait into
/// scope (`use vortex_compute::lane_kernels::IndexedSourceExt;`) to call
/// them with method syntax: `values.try_map_masked_into(&mask, &mut out, f)`.
pub trait IndexedSourceExt: IndexedSource + Sized {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip the ext traits? just increases the mental load of tracking where things happen IMO

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's another way of doing this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't it be default functions on the core trait?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think its worth having default functions here we likely want more of these

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this really matter?

Comment thread vortex-compute/src/lane_kernels/map_in_place.rs Outdated
// Skip the fallible kernel when type widening or (cached) min/max prove every value fits.
let target_dtype = DType::Primitive(T::PTYPE, Nullability::NonNullable);
let infallible = casts_losslessly_to(F::PTYPE, T::PTYPE)
|| cached_values_fit_in(array, &target_dtype) == Some(true);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - unwrap_or_default

Mask::AllTrue(_) => BufferMut::try_from_trusted_len_iter(

// Returns `true` if every value of `from` is representable in `to` without loss.
fn casts_losslessly_to(from: PType, to: PType) -> bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be a function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this only that the body does easily read like that?

Comment thread vortex-array/src/arrays/primitive/compute/cast.rs Outdated
w
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Comment thread vortex-compute/src/lib.rs
@@ -0,0 +1,15 @@
// SPDX-License-Identifier: Apache-2.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in vortex-buffer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about putting this there, but it just seemed like the wrong place

w
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
w
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
w
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs merged commit f0b8fb9 into develop Jun 17, 2026
64 of 65 checks passed
@joseph-isaacs joseph-isaacs deleted the ji/fast-iter-valid branch June 17, 2026 13:26
@robert3005

Copy link
Copy Markdown
Contributor

we need to add the crate to the benchmarks, since it's a new one it has to be enumerate in the ci matrix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants