permute_2D_data_kernel_vec: long4->long2 + BF16 fix by royren622 · Pull Request #5771 · pytorch/FBGEMM

royren622 · 2026-05-18T03:07:15Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2700

permute_2D_data_kernel_vec (used by permute_2D_sparse_data) had two latent issues in its vec-width selection that this diff fixes together. First, the 8-byte branch used long4 (32B per thread) which compiles to two LDG.E.128 instructions with a 16B stride per 32B sector, halving sector utilization. Switching to long2 (16B per thread) restores full sector use and lets each warp issue a single 16B-aligned load per element. Second, the width formula (sizeof(t) == 8) ? 2 : 4 from D89161131 hardcoded kVecWidth = 4 for non-8-byte types, but the vec type float4 (16B) actually holds 8 elements when the element is 2 bytes (Half, BFloat16). The fix replaces the hardcoded formula with sizeof(vec_t) / sizeof(elem_t), yielding the correct 8 / 4 / 2 widths for 2 / 4 / 8-byte element types from a single source of truth.

Important scope note for the 2-byte fix: the corrupt-write path applied to 2-byte weights with weights_columns = 1, not 2-byte indices. The alignment guard at sparse_permute_2d.cu:118-123 requires sizeof(indices_t) in {4, 8}, so 2-byte indices always route to the scalar fallback (the vec path was never reached for Half / BFloat16 indices in either D89161131 or this diff). The weights_vec_aligned check at sparse_permute_2d.cu:125-129 has no such sizeof restriction, so 2-byte weights ARE reachable: pre-fix the vec loop wrote 2 * segment_length elements per segment for 2-byte weights, silently corrupting adjacent segments' weight values. Note that BFloat16 weights are GPU-only (the CPU op uses FBGEMM_DISPATCH_FLOAT_HALF_AND_DOUBLE which excludes BFloat16), so the regression test uses Half (also 2-byte, identical vec-path mechanics, and CPU-validateable).

On H100 PCIe with the permute-2d-sparse-data-bench default workload (T=40, B=128, max_seg_len=2000): the int64 path improves from 18.29 ms / 1716 GB/s to 16.31 ms / 1934 GB/s (1.12x speedup, sector utilization 50% -> 77.6% DRAM throughput, now memory-bound at ~95% of HBM3 peak). BF16 wall-clock now matches float32 (8.75 ms vs 8.18 ms), confirming full vectorization is in effect on the weights side. Stacked on D89161131.

Differential Revision: D105336279

Summary: X-link: facebookresearch/FBGEMM#2700 `permute_2D_data_kernel_vec` (used by `permute_2D_sparse_data`) had two latent issues in its vec-width selection that this diff fixes together. First, the 8-byte branch used `long4` (32B per thread) which compiles to two `LDG.E.128` instructions with a 16B stride per 32B sector, halving sector utilization. Switching to `long2` (16B per thread) restores full sector use and lets each warp issue a single 16B-aligned load per element. Second, the width formula `(sizeof(t) == 8) ? 2 : 4` from D89161131 hardcoded `kVecWidth = 4` for non-8-byte types, but the vec type `float4` (16B) actually holds 8 elements when the element is 2 bytes (`Half`, `BFloat16`). The fix replaces the hardcoded formula with `sizeof(vec_t) / sizeof(elem_t)`, yielding the correct 8 / 4 / 2 widths for 2 / 4 / 8-byte element types from a single source of truth. Important scope note for the 2-byte fix: the corrupt-write path applied to 2-byte **weights** with `weights_columns = 1`, not 2-byte indices. The alignment guard at `sparse_permute_2d.cu:118-123` requires `sizeof(indices_t) in {4, 8}`, so 2-byte indices always route to the scalar fallback (the vec path was never reached for `Half` / `BFloat16` indices in either D89161131 or this diff). The `weights_vec_aligned` check at `sparse_permute_2d.cu:125-129` has no such sizeof restriction, so 2-byte weights ARE reachable: pre-fix the vec loop wrote `2 * segment_length` elements per segment for 2-byte weights, silently corrupting adjacent segments' weight values. Note that `BFloat16` weights are GPU-only (the CPU op uses `FBGEMM_DISPATCH_FLOAT_HALF_AND_DOUBLE` which excludes `BFloat16`), so the regression test uses `Half` (also 2-byte, identical vec-path mechanics, and CPU-validateable). On H100 PCIe with the `permute-2d-sparse-data-bench` default workload (T=40, B=128, max_seg_len=2000): the int64 path improves from 18.29 ms / 1716 GB/s to 16.31 ms / 1934 GB/s (1.12x speedup, sector utilization 50% -> 77.6% DRAM throughput, now memory-bound at ~95% of HBM3 peak). BF16 wall-clock now matches float32 (8.75 ms vs 8.18 ms), confirming full vectorization is in effect on the weights side. Stacked on D89161131. Differential Revision: D105336279

meta-codesync · 2026-05-18T03:07:23Z

@royren622 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105336279.

meta-cla Bot added the cla signed label May 18, 2026

meta-codesync Bot added fb-exported meta-exported labels May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

permute_2D_data_kernel_vec: long4->long2 + BF16 fix#5771

permute_2D_data_kernel_vec: long4->long2 + BF16 fix#5771
royren622 wants to merge 1 commit into
pytorch:mainfrom
royren622:export-D105336279

royren622 commented May 18, 2026

Uh oh!

meta-codesync Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

royren622 commented May 18, 2026

Uh oh!

meta-codesync Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant