feat[gpu]: support mixed-width types in dynamic dispatch by 0ax1 · Pull Request #7331 · vortex-data/vortex

0ax1 · 2026-04-07T22:49:48Z

Add per-stage PType tracking so input stages (e.g. u8 dict codes) can differ from the output type. The kernel widens narrow loads to the output type via load_element() instead of requiring a separate widening pass.

Kernel changes:

Byte-addressed shared memory (smem offsets are now in bytes, not elements) so stages with different element widths can coexist
PTypeTag per source op drives load_element() for type-widening loads
Fuse scalar_ops into the source_op loop for non-BITUNPACK input stages

Plan builder (Rust):

Track smem_byte_offset per stage instead of element offset
Emit PTypeTag alongside each source op in the packed plan
Plumb per-stage and per-op elem_bytes through the C ABI

100M benchmark results (GH200 480GB, SM 9.0) - % ref to reduction in duration here:
  for_bitpacked_6bw      170.75 µs  2181.7 GiB/s  -2.5%
  dict_256vals_bp8bw     191.90 µs  1941.3 GiB/s  -7.4%
  runend_100runs         145.63 µs  2558.1 GiB/s  -21.0%
  dict_64vals_nested     176.61 µs  2109.3 GiB/s  -6.2%
  alp_for_bp_6bw_f32    179.23 µs  2078.5 GiB/s  -4.6%

Add per-stage PType tracking so input stages (e.g. u8 dict codes) can differ from the output type. The kernel widens narrow loads to the output type via load_element() instead of requiring a separate widening pass. Kernel changes: - Byte-addressed shared memory (smem offsets are now in bytes, not elements) so stages with different element widths can coexist - PTypeTag per source op drives load_element() for type-widening loads - THREAD_POS macro replaces repeated if-constexpr N==1 branches - Fuse scalar_ops into the source_op loop for non-BITUNPACK input stages - Add write-barrier / read-barrier comments on __syncthreads calls - Rename variables for clarity (tile_count, tile_idx, tile_start, VALUES_PER_TILE, elem_idx, etc.) Plan builder (Rust): - Track smem_byte_offset per stage instead of element offset - Emit PTypeTag alongside each source op in the packed plan - Plumb per-stage and per-op elem_bytes through the C ABI 100M benchmark results (GH200 480GB, SM 9.0): for_bitpacked_6bw 170.75 µs 2181.7 GiB/s -2.5% dict_256vals_bp8bw 191.90 µs 1941.3 GiB/s -7.4% runend_100runs 145.63 µs 2558.1 GiB/s -21.0% dict_64vals_nested 176.61 µs 2109.3 GiB/s -6.2% alp_for_bp_6bw_f32 179.23 µs 2078.5 GiB/s -4.6% Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

codspeed-hq · 2026-04-07T22:55:25Z

Merging this PR will not alter performance

✅ 1122 untouched benchmarks
⏩ 1530 skipped benchmarks¹

_{Comparing ad/cuda-flexible-ptype (5206340) with develop (bbb371c)}

1530 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

0ax1 · 2026-04-08T09:08:10Z

To fully fuse mixed types we still need to:

BITUNPACK at native width - unpack u8 into smem, then widen to T when reading
Input stages at native width - decode into smem at the source ptype, not T
Widening reads from smem - like load_element but for smem, in DICT gathers and RUNEND lookups

Correctness: - load_element<T>() now dispatches on the raw PTypeTag instead of routing through ptype_to_unsigned(), which was zero-extending signed narrow types (e.g. i8(-1) → 255 instead of → 0xFFFFFFFF). - as_bytes() copies through a zeroed buffer so struct padding holes are deterministically zero, avoiding UB from uninitialised padding. - Reject F16 early in is_dyn_dispatch_compatible() — prevents unreachable panic in ptype_to_tag(). Plan builder hardening: - Replace unwrap_or / or_else fallbacks with map_err + ? in walk_bitpacked, walk_for, walk_zigzag, walk_dict, walk_runend so dtype errors propagate instead of silently using a default. - Extract walk_mixed_width_child() to short-circuit Primitive children (grab buffer directly) vs. deferring encoded children to subtrees. - MaterializedPlan::execute reads output_ptype from the plan header instead of taking it as a redundant parameter. FFI / DRY: - Move BLOCK_SIZE and KERNEL_STATIC_SHARED_BYTES to the C header; Rust consumes them via bindgen instead of duplicating magic numbers. - Remove dead ptype_byte_width() C function; Rust uses tag_to_ptype().byte_width() instead. - Add bidirectional tag_to_ptype() alongside ptype_to_tag(). Tests: - Add sign-extension tests (i8→u32, i16→u32) for load_element. - Fix test_plan_structure to use byte offsets (256*4=1024). Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 added the changelog/feature A new feature label Apr 7, 2026

0ax1 force-pushed the ad/cuda-flexible-ptype branch from 3b1d40e to ad54f73 Compare April 7, 2026 22:51

0ax1 changed the title ~~feat: support mixed-width types in CUDA dynamic dispatch kernel~~ feat[gpu]: support mixed-width types in dynamic dispatch Apr 7, 2026

0ax1 requested a review from robert3005 April 8, 2026 09:08

0ax1 force-pushed the ad/cuda-flexible-ptype branch from 74f18f0 to d1dedfc Compare April 8, 2026 13:10

0ax1 marked this pull request as ready for review April 8, 2026 13:11

0ax1 force-pushed the ad/cuda-flexible-ptype branch from d1dedfc to f75ca61 Compare April 8, 2026 13:13

0ax1 added 2 commits April 8, 2026 13:32

fix

233084e

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

dict codes and runend ends

dc484fc

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

joseph-isaacs approved these changes Apr 8, 2026

View reviewed changes

fix

5206340

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 merged commit f87eefa into develop Apr 8, 2026
58 checks passed

0ax1 deleted the ad/cuda-flexible-ptype branch April 8, 2026 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat[gpu]: support mixed-width types in dynamic dispatch#7331

feat[gpu]: support mixed-width types in dynamic dispatch#7331
0ax1 merged 5 commits into
developfrom
ad/cuda-flexible-ptype

0ax1 commented Apr 7, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

0ax1 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0ax1 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

0ax1 commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

0ax1 commented Apr 7, 2026 •

edited

Loading

codspeed-hq Bot commented Apr 7, 2026 •

edited

Loading