Commit f87eefa

authored

feat[gpu]: support mixed-width types in dynamic dispatch (#7331)

Add per-stage PType tracking so input stages (e.g. u8 dict codes) can differ from the output type. The kernel widens narrow loads to the output type via load_element() instead of requiring a separate widening pass. Kernel changes: - Byte-addressed shared memory (smem offsets are now in bytes, not elements) so stages with different element widths can coexist - PTypeTag per source op drives load_element() for type-widening loads - Fuse scalar_ops into the source_op loop for non-BITUNPACK input stages Plan builder (Rust): - Track smem_byte_offset per stage instead of element offset - Emit PTypeTag alongside each source op in the packed plan - Plumb per-stage and per-op elem_bytes through the C ABI ``` 100M benchmark results (GH200 480GB, SM 9.0) - % ref to reduction in duration here: for_bitpacked_6bw 170.75 µs 2181.7 GiB/s -2.5% dict_256vals_bp8bw 191.90 µs 1941.3 GiB/s -7.4% runend_100runs 145.63 µs 2558.1 GiB/s -21.0% dict_64vals_nested 176.61 µs 2109.3 GiB/s -6.2% alp_for_bp_6bw_f32 179.23 µs 2078.5 GiB/s -4.6% ``` --------- Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

1 parent bbb371c commit f87eefaCopy full SHA for f87eefa

5 files changed

vortex-cuda
- kernels/src
  - dynamic_dispatch.cu
  - dynamic_dispatch.h
- src
  - dynamic_dispatch
    - mod.rs
    - plan_builder.rs
  - hybrid_dispatch
    - mod.rs

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit f87eefa

File tree

0 commit comments