feat[gpu]: support mixed-width types in dynamic dispatch#7331
Open
feat[gpu]: support mixed-width types in dynamic dispatch#7331
Conversation
Add per-stage PType tracking so input stages (e.g. u8 dict codes) can differ from the output type. The kernel widens narrow loads to the output type via load_element() instead of requiring a separate widening pass. Kernel changes: - Byte-addressed shared memory (smem offsets are now in bytes, not elements) so stages with different element widths can coexist - PTypeTag per source op drives load_element() for type-widening loads - THREAD_POS macro replaces repeated if-constexpr N==1 branches - Fuse scalar_ops into the source_op loop for non-BITUNPACK input stages - Add write-barrier / read-barrier comments on __syncthreads calls - Rename variables for clarity (tile_count, tile_idx, tile_start, VALUES_PER_TILE, elem_idx, etc.) Plan builder (Rust): - Track smem_byte_offset per stage instead of element offset - Emit PTypeTag alongside each source op in the packed plan - Plumb per-stage and per-op elem_bytes through the C ABI 100M benchmark results (GH200 480GB, SM 9.0): for_bitpacked_6bw 170.75 µs 2181.7 GiB/s -2.5% dict_256vals_bp8bw 191.90 µs 1941.3 GiB/s -7.4% runend_100runs 145.63 µs 2558.1 GiB/s -21.0% dict_64vals_nested 176.61 µs 2109.3 GiB/s -6.2% alp_for_bp_6bw_f32 179.23 µs 2078.5 GiB/s -4.6% Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
3b1d40e to
ad54f73
Compare
Merging this PR will not alter performance
Comparing Footnotes
|
Contributor
Author
|
To fully fuse mixed types we still need to:
|
74f18f0 to
d1dedfc
Compare
Correctness: - load_element<T>() now dispatches on the raw PTypeTag instead of routing through ptype_to_unsigned(), which was zero-extending signed narrow types (e.g. i8(-1) → 255 instead of → 0xFFFFFFFF). - as_bytes() copies through a zeroed buffer so struct padding holes are deterministically zero, avoiding UB from uninitialised padding. - Reject F16 early in is_dyn_dispatch_compatible() — prevents unreachable panic in ptype_to_tag(). Plan builder hardening: - Replace unwrap_or / or_else fallbacks with map_err + ? in walk_bitpacked, walk_for, walk_zigzag, walk_dict, walk_runend so dtype errors propagate instead of silently using a default. - Extract walk_mixed_width_child() to short-circuit Primitive children (grab buffer directly) vs. deferring encoded children to subtrees. - MaterializedPlan::execute reads output_ptype from the plan header instead of taking it as a redundant parameter. FFI / DRY: - Move BLOCK_SIZE and KERNEL_STATIC_SHARED_BYTES to the C header; Rust consumes them via bindgen instead of duplicating magic numbers. - Remove dead ptype_byte_width() C function; Rust uses tag_to_ptype().byte_width() instead. - Add bidirectional tag_to_ptype() alongside ptype_to_tag(). Tests: - Add sign-extension tests (i8→u32, i16→u32) for load_element. - Fix test_plan_structure to use byte offsets (256*4=1024). Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
d1dedfc to
f75ca61
Compare
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add per-stage PType tracking so input stages (e.g. u8 dict codes) can differ from the output type. The kernel widens narrow loads to the output type via load_element() instead of requiring a separate widening pass.
Kernel changes:
Plan builder (Rust):