Commit 3b1d40e
committed
feat: support mixed-width types in CUDA dynamic dispatch kernel
Add per-stage PType tracking so input stages (e.g. u8 dict codes) can
differ from the output type. The kernel widens narrow loads to the output
type via load_element() instead of requiring a separate widening pass.
Kernel changes:
- Byte-addressed shared memory (smem offsets are now in bytes, not elements)
so stages with different element widths can coexist
- PTypeTag per source op drives load_element() for type-widening loads
- THREAD_POS macro replaces repeated if-constexpr N==1 branches
- Fuse scalar_ops into the source_op loop for non-BITUNPACK input stages
- Add write-barrier / read-barrier comments on __syncthreads calls
- Rename variables for clarity (tile_count, tile_idx, tile_start,
VALUES_PER_TILE, elem_idx, etc.)
Plan builder (Rust):
- Track smem_byte_offset per stage instead of element offset
- Emit PTypeTag alongside each source op in the packed plan
- Plumb per-stage and per-op elem_bytes through the C ABI
100M benchmark results (GH200 480GB, SM 9.0):
for_bitpacked_6bw 170.75 µs 2181.7 GiB/s -2.5%
dict_256vals_bp8bw 191.90 µs 1941.3 GiB/s -7.4%
runend_100runs 145.63 µs 2558.1 GiB/s -21.0%
dict_64vals_nested 176.61 µs 2109.3 GiB/s -6.2%
alp_for_bp_6bw_f32 179.23 µs 2078.5 GiB/s -4.6%
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>1 parent b9c47cf commit 3b1d40e
File tree
5 files changed
+1148
-429
lines changed- vortex-cuda
- kernels/src
- src
- dynamic_dispatch
- hybrid_dispatch
5 files changed
+1148
-429
lines changed
0 commit comments