Skip to content

feat[gpu]: support mixed-width types in dynamic dispatch#7331

Open
0ax1 wants to merge 4 commits intodevelopfrom
ad/cuda-flexible-ptype
Open

feat[gpu]: support mixed-width types in dynamic dispatch#7331
0ax1 wants to merge 4 commits intodevelopfrom
ad/cuda-flexible-ptype

Conversation

@0ax1
Copy link
Copy Markdown
Contributor

@0ax1 0ax1 commented Apr 7, 2026

Add per-stage PType tracking so input stages (e.g. u8 dict codes) can differ from the output type. The kernel widens narrow loads to the output type via load_element() instead of requiring a separate widening pass.

Kernel changes:

  • Byte-addressed shared memory (smem offsets are now in bytes, not elements) so stages with different element widths can coexist
  • PTypeTag per source op drives load_element() for type-widening loads
  • Fuse scalar_ops into the source_op loop for non-BITUNPACK input stages

Plan builder (Rust):

  • Track smem_byte_offset per stage instead of element offset
  • Emit PTypeTag alongside each source op in the packed plan
  • Plumb per-stage and per-op elem_bytes through the C ABI
100M benchmark results (GH200 480GB, SM 9.0) - % ref to reduction in duration here:
  for_bitpacked_6bw      170.75 µs  2181.7 GiB/s  -2.5%
  dict_256vals_bp8bw     191.90 µs  1941.3 GiB/s  -7.4%
  runend_100runs         145.63 µs  2558.1 GiB/s  -21.0%
  dict_64vals_nested     176.61 µs  2109.3 GiB/s  -6.2%
  alp_for_bp_6bw_f32    179.23 µs  2078.5 GiB/s  -4.6%

@0ax1 0ax1 added the changelog/feature A new feature label Apr 7, 2026
Add per-stage PType tracking so input stages (e.g. u8 dict codes) can
differ from the output type. The kernel widens narrow loads to the output
type via load_element() instead of requiring a separate widening pass.

Kernel changes:
- Byte-addressed shared memory (smem offsets are now in bytes, not elements)
  so stages with different element widths can coexist
- PTypeTag per source op drives load_element() for type-widening loads
- THREAD_POS macro replaces repeated if-constexpr N==1 branches
- Fuse scalar_ops into the source_op loop for non-BITUNPACK input stages
- Add write-barrier / read-barrier comments on __syncthreads calls
- Rename variables for clarity (tile_count, tile_idx, tile_start,
  VALUES_PER_TILE, elem_idx, etc.)

Plan builder (Rust):
- Track smem_byte_offset per stage instead of element offset
- Emit PTypeTag alongside each source op in the packed plan
- Plumb per-stage and per-op elem_bytes through the C ABI

100M benchmark results (GH200 480GB, SM 9.0):
  for_bitpacked_6bw      170.75 µs  2181.7 GiB/s  -2.5%
  dict_256vals_bp8bw     191.90 µs  1941.3 GiB/s  -7.4%
  runend_100runs         145.63 µs  2558.1 GiB/s  -21.0%
  dict_64vals_nested     176.61 µs  2109.3 GiB/s  -6.2%
  alp_for_bp_6bw_f32    179.23 µs  2078.5 GiB/s  -4.6%

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cuda-flexible-ptype branch from 3b1d40e to ad54f73 Compare April 7, 2026 22:51
@0ax1 0ax1 changed the title feat: support mixed-width types in CUDA dynamic dispatch kernel feat[gpu]: support mixed-width types in dynamic dispatch Apr 7, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 7, 2026

Merging this PR will not alter performance

✅ 1122 untouched benchmarks
⏩ 1530 skipped benchmarks1


Comparing ad/cuda-flexible-ptype (dc484fc) with develop (bbb371c)

Open in CodSpeed

Footnotes

  1. 1530 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@0ax1
Copy link
Copy Markdown
Contributor Author

0ax1 commented Apr 8, 2026

To fully fuse mixed types we still need to:

  • BITUNPACK at native width - unpack u8 into smem, then widen to T when reading
  • Input stages at native width - decode into smem at the source ptype, not T
  • Widening reads from smem - like load_element but for smem, in DICT gathers and RUNEND lookups

@0ax1 0ax1 requested a review from robert3005 April 8, 2026 09:08
@0ax1 0ax1 force-pushed the ad/cuda-flexible-ptype branch from 74f18f0 to d1dedfc Compare April 8, 2026 13:10
@0ax1 0ax1 marked this pull request as ready for review April 8, 2026 13:11
Correctness:
- load_element<T>() now dispatches on the raw PTypeTag instead of
  routing through ptype_to_unsigned(), which was zero-extending signed
  narrow types (e.g. i8(-1) → 255 instead of → 0xFFFFFFFF).
- as_bytes() copies through a zeroed buffer so struct padding holes are
  deterministically zero, avoiding UB from uninitialised padding.
- Reject F16 early in is_dyn_dispatch_compatible() — prevents
  unreachable panic in ptype_to_tag().

Plan builder hardening:
- Replace unwrap_or / or_else fallbacks with map_err + ? in
  walk_bitpacked, walk_for, walk_zigzag, walk_dict, walk_runend so
  dtype errors propagate instead of silently using a default.
- Extract walk_mixed_width_child() to short-circuit Primitive children
  (grab buffer directly) vs. deferring encoded children to subtrees.
- MaterializedPlan::execute reads output_ptype from the plan header
  instead of taking it as a redundant parameter.

FFI / DRY:
- Move BLOCK_SIZE and KERNEL_STATIC_SHARED_BYTES to the C header;
  Rust consumes them via bindgen instead of duplicating magic numbers.
- Remove dead ptype_byte_width() C function; Rust uses
  tag_to_ptype().byte_width() instead.
- Add bidirectional tag_to_ptype() alongside ptype_to_tag().

Tests:
- Add sign-extension tests (i8→u32, i16→u32) for load_element.
- Fix test_plan_structure to use byte offsets (256*4=1024).

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 force-pushed the ad/cuda-flexible-ptype branch from d1dedfc to f75ca61 Compare April 8, 2026 13:13
0ax1 added 2 commits April 8, 2026 13:32
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant