perf[gpu]: export arrow device validity on the gpu by 0ax1 · Pull Request #8440 · vortex-data/vortex

0ax1 · 2026-06-16T09:52:06Z

Move canonicalization of the validity buffer from the CPU to the GPU for arrow device array. As part of that this change adds a null count kernel, as the count is required by cuDF. cuDF does not support consuming -1 (unknown true count) for passed in arrow device arrays.

codspeed-hq · 2026-06-16T09:58:43Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 8 improved benchmarks
❌ 12 regressed benchmarks
✅ 1524 untouched benchmarks
🆕 3 new benchmarks
⏩ 11 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	20.6 µs	35.7 µs	-42.29%
❌	Simulation	`chunked_dict_primitive_into_canonical[u32, (1000, 10, 10)]`	120.7 µs	182.9 µs	-34%
❌	Simulation	`encode_varbin[(1000, 2)]`	176.1 µs	236 µs	-25.4%
❌	Simulation	`chunked_varbinview_canonical_into[(1000, 10)]`	161.8 µs	198.1 µs	-18.29%
❌	Simulation	`chunked_varbinview_into_canonical[(1000, 10)]`	177.1 µs	214 µs	-17.25%
❌	Simulation	`bench_many_codes_few_values[1024]`	393.2 µs	468.7 µs	-16.1%
❌	Simulation	`decompress_rd[f64, (100000, 0.0)]`	845.5 µs	982.8 µs	-13.97%
❌	Simulation	`varbinview_large`	112.2 µs	130.3 µs	-13.89%
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	186.1 ns	215.3 ns	-13.55%
❌	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	273.8 µs	308.8 µs	-11.33%
❌	Simulation	`bitwise_not_vortex_buffer_mut[1024]`	246.4 ns	275.6 ns	-10.58%
❌	Simulation	`chunked_varbinview_into_canonical[(100, 100)]`	326.4 µs	364.9 µs	-10.55%
⚡	Simulation	`sum_i32_nullable_all_valid`	69.2 µs	35.3 µs	+95.96%
⚡	Simulation	`null_count_run_end[(10000, 4, 0.01)]`	125.4 µs	91.6 µs	+36.92%
⚡	Simulation	`encode_varbinview[(1000, 2)]`	189 µs	156.7 µs	+20.57%
⚡	Simulation	`take_10k_contiguous`	252.8 µs	218.1 µs	+15.89%
⚡	Simulation	`and_bool_nullable`	93.7 µs	82.7 µs	+13.21%
⚡	Simulation	`baseline_lt[4, 1024]`	78.5 µs	69.6 µs	+12.76%
⚡	Simulation	`decompress_rd[f64, (100000, 0.01)]`	981.2 µs	890.4 µs	+10.2%
⚡	Simulation	`decompress_rd[f64, (100000, 0.1)]`	981.2 µs	890.4 µs	+10.19%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing ad/cuda-validity-export (637a07d) with develop (679e2c5)²}

11 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on develop (8058097) during the generation of this report, so 679e2c5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

0ax1 · 2026-06-17T09:44:47Z

@claude review

github-actions · 2026-06-17T09:45:10Z

Claude finished @0ax1's task in 2m 51s —— View job

Review: GPU validity export for Arrow device arrays

I reviewed the diff against origin/develop. Overall this is a clean, well-commented change that correctly moves validity canonicalization and null counting onto the GPU. The kernel comments added in response to @joseph-isaacs's feedback are a real improvement. A few findings below.

Correctness — looks solid

arrow_validity_byte_mask correctly masks both leading Arrow-offset padding and trailing padding, including the case where both apply to the same byte (small len). Verified by hand for e.g. len=2, arrow_offset=3.
Word/byte sizing: output_bytes.div_ceil(8) == old validity_bits.div_ceil(64) via the ceil(ceil(n/8)/8) = ceil(n/64) identity, so the allocation size is unchanged. The count kernel reads only expected_bytes (≤ the whole-word allocation), so the over-allocation is safe.
The AllInvalid branch returning only the zeroed validity bitmap (with value buffers exported separately by the caller) matches @robert3005's point that Arrow still needs the underlying value buffer. Worth a quick confirm that the caller does in fact emit value buffers on this path for non-Null dtypes — the comment claims it but it isn't visible in this diff.

Findings

1. Missing doc comment on a new public API (count_arrow_validity_nulls) — canonical.rs:837
CLAUDE.md requires "Every new public API definition must have a doc comment." repack_arrow_validity_buffer has one, but count_arrow_validity_nulls (now pub) does not. cargo clippy --all-features with missing_docs may also flag this.
Fix this →

2. Public API surface widened from pub(super) to pub — canonical.rs:837,899 + mod.rs
Both functions became fully pub so the _test-harness-gated test_harness module can pub use them. This leaks two internal GPU helpers into the permanent public API of vortex-cuda even in non-test builds. If they're only meant for the test harness/benches, consider pub(crate) plus a thin #[cfg(feature = "_test-harness")] wrapper (as the old mod.rs did for repack), rather than unconditional pub. Minor, but it's a deliberate API decision worth making explicitly.

3. len - valid_count underflow on a broken invariant — canonical.rs:890
valid_count is a device-computed u64; len - valid_count is unsigned. If the kernel ever returned valid_count > len (e.g. a masking regression), this wraps to a huge u64 and surfaces as an opaque i64::try_from failure instead of a clear error. A vortex_ensure!(valid_count <= len, ...) would make a future regression diagnosable. Defensive only — the masking currently guarantees this holds.

Nits / questions

Warp reduction assumes blockDim.x is a multiple of 32 (arrow_validity.cu, block_sum_to_thread_zero / warp_sum use FULL_WARP_MASK). True today (COUNT_THREADS_PER_BLOCK = 256), but __shfl_down_sync with a full mask is UB if a partial final warp ever appears. A one-line comment or static_assert(COUNT_THREADS_PER_BLOCK % WARP_SIZE == 0) on the launch side would lock the invariant.
Redundant device transfer: execute_validity_cuda already moves bits to the device, then the Validity::Array branch in export_arrow_validity_buffer calls into_parts + ensure_on_device(bits) again. Idempotent, so harmless, but the second ensure_on_device is dead work — the value is already device-resident.
Benchmarks: re @joseph-isaacs / @0ax1's thread — splitting into repack / count_nulls / export is reasonable for isolating the two new kernels; the compaction you mentioned addresses the readability concern.

Checks not run

This is a CUDA-only change; I did not build or run cargo clippy --all-targets --all-features / cargo nextest -p vortex-cuda here (no GPU/CUDA toolchain in this review environment). Please confirm clippy passes — finding #1 may trip missing_docs.

--- · ad/cuda-validity-export

0ax1 · 2026-06-17T14:37:39Z

@claude rebase, fix the conflicts

github-actions · 2026-06-17T14:38:26Z

Claude finished @0ax1's task in 23s —— View job

Unable to rebase — read-only review workflow

Hi @0ax1 — I can't perform this request. This @claude integration runs as a read-only PR review workflow: it can review code and post feedback, but it cannot modify files, create commits, push branches, or perform git branch operations like rebasing or resolving merge conflicts.

Rebasing and conflict resolution are branch operations outside this workflow's scope (see the FAQ for details).

To rebase locally:

git fetch origin
git checkout ad/cuda-validity-export
git rebase origin/develop
# resolve conflicts in each file, then:
git add <resolved-files>
git rebase --continue
git push --force-with-lease origin ad/cuda-validity-export

If you'd instead like me to review the PR (or re-review after you rebase), or answer a question about the conflicts, just let me know and I'm happy to help with that.
· ad/cuda-validity-export

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

…t-pass Container exports (struct/list/fixed-size-list/list-view/dict) reach export_arrow_validity_buffer without going through execute_cuda, so a non-canonical Validity::Array (e.g. dict-encoded, or produced by take/scan) made the export bail. Canonicalize the validity on the GPU inside export_arrow_validity_buffer instead, which covers every export path uniformly. This makes the executor's execute_canonical_validity_cuda post-pass redundant. Removing it also restores the invariant that execute_cuda leaves validity host-executable, fixing the unwrap_host panic on the non-contiguous list-view rebuild path, where rebuild_primitive_list_view_child runs execute_no_nulls on the CPU context. Update the null_count expectations that the UNKNOWN_NULL_COUNT switch missed, and add a regression test for non-canonical container validity. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The helper had a null_count parameter that was always UNKNOWN_NULL_COUNT and a dead `null_count == 0` branch. Inline its move-to-device-and-align logic into the only caller. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

repack_arrow_validity_buffer recomputed the bit-to-byte length formula inline. Reuse the helper and derive the word count from the byte count. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 added the changelog/performance A performance improvement label Jun 16, 2026

0ax1 requested review from onursatici and robert3005 June 16, 2026 10:46

0ax1 marked this pull request as ready for review June 16, 2026 10:46

0ax1 requested a review from a team June 16, 2026 10:46