Skip to content

perf[gpu]: export arrow device validity on the gpu#8440

Merged
0ax1 merged 10 commits into
developfrom
ad/cuda-validity-export
Jun 17, 2026
Merged

perf[gpu]: export arrow device validity on the gpu#8440
0ax1 merged 10 commits into
developfrom
ad/cuda-validity-export

Conversation

@0ax1

@0ax1 0ax1 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Move canonicalization of the validity buffer from the CPU to the GPU for arrow device array. As part of that this change adds a null count kernel, as the count is required by cuDF. cuDF does not support consuming -1 (unknown true count) for passed in arrow device arrays.

@0ax1 0ax1 added the changelog/performance A performance improvement label Jun 16, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 16, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 8 improved benchmarks
❌ 12 regressed benchmarks
✅ 1524 untouched benchmarks
🆕 3 new benchmarks
⏩ 11 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 20.6 µs 35.7 µs -42.29%
Simulation chunked_dict_primitive_into_canonical[u32, (1000, 10, 10)] 120.7 µs 182.9 µs -34%
Simulation encode_varbin[(1000, 2)] 176.1 µs 236 µs -25.4%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 161.8 µs 198.1 µs -18.29%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 177.1 µs 214 µs -17.25%
Simulation bench_many_codes_few_values[1024] 393.2 µs 468.7 µs -16.1%
Simulation decompress_rd[f64, (100000, 0.0)] 845.5 µs 982.8 µs -13.97%
Simulation varbinview_large 112.2 µs 130.3 µs -13.89%
Simulation bitwise_not_vortex_buffer_mut[128] 186.1 ns 215.3 ns -13.55%
Simulation chunked_varbinview_canonical_into[(100, 100)] 273.8 µs 308.8 µs -11.33%
Simulation bitwise_not_vortex_buffer_mut[1024] 246.4 ns 275.6 ns -10.58%
Simulation chunked_varbinview_into_canonical[(100, 100)] 326.4 µs 364.9 µs -10.55%
Simulation sum_i32_nullable_all_valid 69.2 µs 35.3 µs +95.96%
Simulation null_count_run_end[(10000, 4, 0.01)] 125.4 µs 91.6 µs +36.92%
Simulation encode_varbinview[(1000, 2)] 189 µs 156.7 µs +20.57%
Simulation take_10k_contiguous 252.8 µs 218.1 µs +15.89%
Simulation and_bool_nullable 93.7 µs 82.7 µs +13.21%
Simulation baseline_lt[4, 1024] 78.5 µs 69.6 µs +12.76%
Simulation decompress_rd[f64, (100000, 0.01)] 981.2 µs 890.4 µs +10.2%
Simulation decompress_rd[f64, (100000, 0.1)] 981.2 µs 890.4 µs +10.19%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ad/cuda-validity-export (637a07d) with develop (679e2c5)2

Open in CodSpeed

Footnotes

  1. 11 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on develop (8058097) during the generation of this report, so 679e2c5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@0ax1 0ax1 requested review from onursatici and robert3005 June 16, 2026 10:46
@0ax1 0ax1 marked this pull request as ready for review June 16, 2026 10:46
@0ax1 0ax1 requested a review from a team June 16, 2026 10:46
Comment thread vortex-cuda/kernels/src/arrow_validity.cu Outdated
Comment thread vortex-cuda/kernels/src/arrow_validity.cu Outdated
Comment thread vortex-cuda/kernels/src/arrow_validity.cu Outdated
Comment thread vortex-cuda/kernels/src/arrow_validity.cu
Comment thread vortex-cuda/benches/arrow_validity_cuda.rs
Comment thread vortex-cuda/src/arrow/canonical.rs
Comment thread vortex-cuda/src/arrow/mod.rs Outdated
Comment thread vortex-cuda/src/arrow/canonical.rs
@0ax1 0ax1 requested a review from joseph-isaacs June 16, 2026 15:49
@0ax1 0ax1 changed the title perf[gpu]: export arrow device array on the gpu perf[gpu]: export arrow device validity on the gpu Jun 16, 2026
@0ax1

0ax1 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@claude review

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @0ax1's task in 2m 51s —— View job


Review: GPU validity export for Arrow device arrays

I reviewed the diff against origin/develop. Overall this is a clean, well-commented change that correctly moves validity canonicalization and null counting onto the GPU. The kernel comments added in response to @joseph-isaacs's feedback are a real improvement. A few findings below.

Correctness — looks solid

  • arrow_validity_byte_mask correctly masks both leading Arrow-offset padding and trailing padding, including the case where both apply to the same byte (small len). Verified by hand for e.g. len=2, arrow_offset=3.
  • Word/byte sizing: output_bytes.div_ceil(8) == old validity_bits.div_ceil(64) via the ceil(ceil(n/8)/8) = ceil(n/64) identity, so the allocation size is unchanged. The count kernel reads only expected_bytes (≤ the whole-word allocation), so the over-allocation is safe.
  • The AllInvalid branch returning only the zeroed validity bitmap (with value buffers exported separately by the caller) matches @robert3005's point that Arrow still needs the underlying value buffer. Worth a quick confirm that the caller does in fact emit value buffers on this path for non-Null dtypes — the comment claims it but it isn't visible in this diff.

Findings

1. Missing doc comment on a new public API (count_arrow_validity_nulls)canonical.rs:837
CLAUDE.md requires "Every new public API definition must have a doc comment." repack_arrow_validity_buffer has one, but count_arrow_validity_nulls (now pub) does not. cargo clippy --all-features with missing_docs may also flag this.
Fix this →

2. Public API surface widened from pub(super) to pubcanonical.rs:837,899 + mod.rs
Both functions became fully pub so the _test-harness-gated test_harness module can pub use them. This leaks two internal GPU helpers into the permanent public API of vortex-cuda even in non-test builds. If they're only meant for the test harness/benches, consider pub(crate) plus a thin #[cfg(feature = "_test-harness")] wrapper (as the old mod.rs did for repack), rather than unconditional pub. Minor, but it's a deliberate API decision worth making explicitly.

3. len - valid_count underflow on a broken invariantcanonical.rs:890
valid_count is a device-computed u64; len - valid_count is unsigned. If the kernel ever returned valid_count > len (e.g. a masking regression), this wraps to a huge u64 and surfaces as an opaque i64::try_from failure instead of a clear error. A vortex_ensure!(valid_count <= len, ...) would make a future regression diagnosable. Defensive only — the masking currently guarantees this holds.

Nits / questions

  • Warp reduction assumes blockDim.x is a multiple of 32 (arrow_validity.cu, block_sum_to_thread_zero / warp_sum use FULL_WARP_MASK). True today (COUNT_THREADS_PER_BLOCK = 256), but __shfl_down_sync with a full mask is UB if a partial final warp ever appears. A one-line comment or static_assert(COUNT_THREADS_PER_BLOCK % WARP_SIZE == 0) on the launch side would lock the invariant.
  • Redundant device transfer: execute_validity_cuda already moves bits to the device, then the Validity::Array branch in export_arrow_validity_buffer calls into_parts + ensure_on_device(bits) again. Idempotent, so harmless, but the second ensure_on_device is dead work — the value is already device-resident.
  • Benchmarks: re @joseph-isaacs / @0ax1's thread — splitting into repack / count_nulls / export is reasonable for isolating the two new kernels; the compaction you mentioned addresses the readability concern.

Checks not run

This is a CUDA-only change; I did not build or run cargo clippy --all-targets --all-features / cargo nextest -p vortex-cuda here (no GPU/CUDA toolchain in this review environment). Please confirm clippy passes — finding #1 may trip missing_docs.

--- · ad/cuda-validity-export

@0ax1

0ax1 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@claude rebase, fix the conflicts

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @0ax1's task in 23s —— View job


Unable to rebase — read-only review workflow

Hi @0ax1 — I can't perform this request. This @claude integration runs as a read-only PR review workflow: it can review code and post feedback, but it cannot modify files, create commits, push branches, or perform git branch operations like rebasing or resolving merge conflicts.

Rebasing and conflict resolution are branch operations outside this workflow's scope (see the FAQ for details).

To rebase locally:

git fetch origin
git checkout ad/cuda-validity-export
git rebase origin/develop
# resolve conflicts in each file, then:
git add <resolved-files>
git rebase --continue
git push --force-with-lease origin ad/cuda-validity-export

If you'd instead like me to review the PR (or re-review after you rebase), or answer a question about the conflicts, just let me know and I'm happy to help with that.
· ad/cuda-validity-export

0ax1 and others added 9 commits June 17, 2026 15:16
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…t-pass

Container exports (struct/list/fixed-size-list/list-view/dict) reach
export_arrow_validity_buffer without going through execute_cuda, so a
non-canonical Validity::Array (e.g. dict-encoded, or produced by
take/scan) made the export bail. Canonicalize the validity on the GPU
inside export_arrow_validity_buffer instead, which covers every export
path uniformly.

This makes the executor's execute_canonical_validity_cuda post-pass
redundant. Removing it also restores the invariant that execute_cuda
leaves validity host-executable, fixing the unwrap_host panic on the
non-contiguous list-view rebuild path, where rebuild_primitive_list_view_child
runs execute_no_nulls on the CPU context.

Update the null_count expectations that the UNKNOWN_NULL_COUNT switch
missed, and add a regression test for non-canonical container validity.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The helper had a null_count parameter that was always UNKNOWN_NULL_COUNT
and a dead `null_count == 0` branch. Inline its move-to-device-and-align
logic into the only caller.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
repack_arrow_validity_buffer recomputed the bit-to-byte length formula
inline. Reuse the helper and derive the word count from the byte count.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cuda-validity-export branch 2 times, most recently from 5b2f5ca to b4e5684 Compare June 17, 2026 15:32
@0ax1 0ax1 enabled auto-merge (squash) June 17, 2026 15:32
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cuda-validity-export branch from b4e5684 to 637a07d Compare June 17, 2026 15:44
@0ax1 0ax1 merged commit cb82828 into develop Jun 17, 2026
67 of 69 checks passed
@0ax1 0ax1 deleted the ad/cuda-validity-export branch June 17, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants