permute_2D_data_kernel_vec: long4->long2 + BF16 fix#5771
Open
royren622 wants to merge 1 commit into
Open
Conversation
Summary: X-link: facebookresearch/FBGEMM#2700 `permute_2D_data_kernel_vec` (used by `permute_2D_sparse_data`) had two latent issues in its vec-width selection that this diff fixes together. First, the 8-byte branch used `long4` (32B per thread) which compiles to two `LDG.E.128` instructions with a 16B stride per 32B sector, halving sector utilization. Switching to `long2` (16B per thread) restores full sector use and lets each warp issue a single 16B-aligned load per element. Second, the width formula `(sizeof(t) == 8) ? 2 : 4` from D89161131 hardcoded `kVecWidth = 4` for non-8-byte types, but the vec type `float4` (16B) actually holds 8 elements when the element is 2 bytes (`Half`, `BFloat16`). The fix replaces the hardcoded formula with `sizeof(vec_t) / sizeof(elem_t)`, yielding the correct 8 / 4 / 2 widths for 2 / 4 / 8-byte element types from a single source of truth. Important scope note for the 2-byte fix: the corrupt-write path applied to 2-byte **weights** with `weights_columns = 1`, not 2-byte indices. The alignment guard at `sparse_permute_2d.cu:118-123` requires `sizeof(indices_t) in {4, 8}`, so 2-byte indices always route to the scalar fallback (the vec path was never reached for `Half` / `BFloat16` indices in either D89161131 or this diff). The `weights_vec_aligned` check at `sparse_permute_2d.cu:125-129` has no such sizeof restriction, so 2-byte weights ARE reachable: pre-fix the vec loop wrote `2 * segment_length` elements per segment for 2-byte weights, silently corrupting adjacent segments' weight values. Note that `BFloat16` weights are GPU-only (the CPU op uses `FBGEMM_DISPATCH_FLOAT_HALF_AND_DOUBLE` which excludes `BFloat16`), so the regression test uses `Half` (also 2-byte, identical vec-path mechanics, and CPU-validateable). On H100 PCIe with the `permute-2d-sparse-data-bench` default workload (T=40, B=128, max_seg_len=2000): the int64 path improves from 18.29 ms / 1716 GB/s to 16.31 ms / 1934 GB/s (1.12x speedup, sector utilization 50% -> 77.6% DRAM throughput, now memory-bound at ~95% of HBM3 peak). BF16 wall-clock now matches float32 (8.75 ms vs 8.18 ms), confirming full vectorization is in effect on the weights side. Stacked on D89161131. Differential Revision: D105336279
Contributor
|
@royren622 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105336279. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2700
permute_2D_data_kernel_vec(used bypermute_2D_sparse_data) had two latent issues in its vec-width selection that this diff fixes together. First, the 8-byte branch usedlong4(32B per thread) which compiles to twoLDG.E.128instructions with a 16B stride per 32B sector, halving sector utilization. Switching tolong2(16B per thread) restores full sector use and lets each warp issue a single 16B-aligned load per element. Second, the width formula(sizeof(t) == 8) ? 2 : 4from D89161131 hardcodedkVecWidth = 4for non-8-byte types, but the vec typefloat4(16B) actually holds 8 elements when the element is 2 bytes (Half,BFloat16). The fix replaces the hardcoded formula withsizeof(vec_t) / sizeof(elem_t), yielding the correct 8 / 4 / 2 widths for 2 / 4 / 8-byte element types from a single source of truth.Important scope note for the 2-byte fix: the corrupt-write path applied to 2-byte weights with
weights_columns = 1, not 2-byte indices. The alignment guard atsparse_permute_2d.cu:118-123requiressizeof(indices_t) in {4, 8}, so 2-byte indices always route to the scalar fallback (the vec path was never reached forHalf/BFloat16indices in either D89161131 or this diff). Theweights_vec_alignedcheck atsparse_permute_2d.cu:125-129has no such sizeof restriction, so 2-byte weights ARE reachable: pre-fix the vec loop wrote2 * segment_lengthelements per segment for 2-byte weights, silently corrupting adjacent segments' weight values. Note thatBFloat16weights are GPU-only (the CPU op usesFBGEMM_DISPATCH_FLOAT_HALF_AND_DOUBLEwhich excludesBFloat16), so the regression test usesHalf(also 2-byte, identical vec-path mechanics, and CPU-validateable).On H100 PCIe with the
permute-2d-sparse-data-benchdefault workload (T=40, B=128, max_seg_len=2000): the int64 path improves from 18.29 ms / 1716 GB/s to 16.31 ms / 1934 GB/s (1.12x speedup, sector utilization 50% -> 77.6% DRAM throughput, now memory-bound at ~95% of HBM3 peak). BF16 wall-clock now matches float32 (8.75 ms vs 8.18 ms), confirming full vectorization is in effect on the weights side. Stacked on D89161131.Differential Revision: D105336279