Skip to content

permute_2D_data_kernel_vec: long4->long2 + BF16 fix#5771

Open
royren622 wants to merge 1 commit into
pytorch:mainfrom
royren622:export-D105336279
Open

permute_2D_data_kernel_vec: long4->long2 + BF16 fix#5771
royren622 wants to merge 1 commit into
pytorch:mainfrom
royren622:export-D105336279

Conversation

@royren622
Copy link
Copy Markdown
Contributor

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2700

permute_2D_data_kernel_vec (used by permute_2D_sparse_data) had two latent issues in its vec-width selection that this diff fixes together. First, the 8-byte branch used long4 (32B per thread) which compiles to two LDG.E.128 instructions with a 16B stride per 32B sector, halving sector utilization. Switching to long2 (16B per thread) restores full sector use and lets each warp issue a single 16B-aligned load per element. Second, the width formula (sizeof(t) == 8) ? 2 : 4 from D89161131 hardcoded kVecWidth = 4 for non-8-byte types, but the vec type float4 (16B) actually holds 8 elements when the element is 2 bytes (Half, BFloat16). The fix replaces the hardcoded formula with sizeof(vec_t) / sizeof(elem_t), yielding the correct 8 / 4 / 2 widths for 2 / 4 / 8-byte element types from a single source of truth.

Important scope note for the 2-byte fix: the corrupt-write path applied to 2-byte weights with weights_columns = 1, not 2-byte indices. The alignment guard at sparse_permute_2d.cu:118-123 requires sizeof(indices_t) in {4, 8}, so 2-byte indices always route to the scalar fallback (the vec path was never reached for Half / BFloat16 indices in either D89161131 or this diff). The weights_vec_aligned check at sparse_permute_2d.cu:125-129 has no such sizeof restriction, so 2-byte weights ARE reachable: pre-fix the vec loop wrote 2 * segment_length elements per segment for 2-byte weights, silently corrupting adjacent segments' weight values. Note that BFloat16 weights are GPU-only (the CPU op uses FBGEMM_DISPATCH_FLOAT_HALF_AND_DOUBLE which excludes BFloat16), so the regression test uses Half (also 2-byte, identical vec-path mechanics, and CPU-validateable).

On H100 PCIe with the permute-2d-sparse-data-bench default workload (T=40, B=128, max_seg_len=2000): the int64 path improves from 18.29 ms / 1716 GB/s to 16.31 ms / 1934 GB/s (1.12x speedup, sector utilization 50% -> 77.6% DRAM throughput, now memory-bound at ~95% of HBM3 peak). BF16 wall-clock now matches float32 (8.75 ms vs 8.18 ms), confirming full vectorization is in effect on the weights side. Stacked on D89161131.

Differential Revision: D105336279

Summary:
X-link: facebookresearch/FBGEMM#2700

`permute_2D_data_kernel_vec` (used by `permute_2D_sparse_data`) had two latent issues in its vec-width selection that this diff fixes together. First, the 8-byte branch used `long4` (32B per thread) which compiles to two `LDG.E.128` instructions with a 16B stride per 32B sector, halving sector utilization. Switching to `long2` (16B per thread) restores full sector use and lets each warp issue a single 16B-aligned load per element. Second, the width formula `(sizeof(t) == 8) ? 2 : 4` from D89161131 hardcoded `kVecWidth = 4` for non-8-byte types, but the vec type `float4` (16B) actually holds 8 elements when the element is 2 bytes (`Half`, `BFloat16`). The fix replaces the hardcoded formula with `sizeof(vec_t) / sizeof(elem_t)`, yielding the correct 8 / 4 / 2 widths for 2 / 4 / 8-byte element types from a single source of truth.

Important scope note for the 2-byte fix: the corrupt-write path applied to 2-byte **weights** with `weights_columns = 1`, not 2-byte indices. The alignment guard at `sparse_permute_2d.cu:118-123` requires `sizeof(indices_t) in {4, 8}`, so 2-byte indices always route to the scalar fallback (the vec path was never reached for `Half` / `BFloat16` indices in either D89161131 or this diff). The `weights_vec_aligned` check at `sparse_permute_2d.cu:125-129` has no such sizeof restriction, so 2-byte weights ARE reachable: pre-fix the vec loop wrote `2 * segment_length` elements per segment for 2-byte weights, silently corrupting adjacent segments' weight values. Note that `BFloat16` weights are GPU-only (the CPU op uses `FBGEMM_DISPATCH_FLOAT_HALF_AND_DOUBLE` which excludes `BFloat16`), so the regression test uses `Half` (also 2-byte, identical vec-path mechanics, and CPU-validateable).

On H100 PCIe with the `permute-2d-sparse-data-bench` default workload (T=40, B=128, max_seg_len=2000): the int64 path improves from 18.29 ms / 1716 GB/s to 16.31 ms / 1934 GB/s (1.12x speedup, sector utilization 50% -> 77.6% DRAM throughput, now memory-bound at ~95% of HBM3 peak). BF16 wall-clock now matches float32 (8.75 ms vs 8.18 ms), confirming full vectorization is in effect on the weights side. Stacked on D89161131.

Differential Revision: D105336279
@meta-cla meta-cla Bot added the cla signed label May 18, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 18, 2026

@royren622 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105336279.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant