Skip to content

Add grid-stride loop and ROCm cap to index_add_2d_with_unique_indices_kernel (#5934)#5934

Closed
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D105029511
Closed

Add grid-stride loop and ROCm cap to index_add_2d_with_unique_indices_kernel (#5934)#5934
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D105029511

Conversation

@q10

@q10 q10 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2852

Tier-2 fix for HIP grid-overflow in sparse_ops/sparse_index_add.cu.

index_add_2d_with_unique_indices_kernel previously used blockIdx.x directly to index unique indices. Capping the host-side grid without first adding a grid-stride loop would silently drop work.

Changes:

  • Add const int num_unique_indices as a new kernel parameter.
  • Convert kernel to a grid-stride loop over u = blockIdx.x; u < num_unique_indices; u += gridDim.x (Pattern C). All blockIdx.x references replaced with u. Hoist start_D and has_remainder outside the loop since they depend only on blockIdx.y / threadIdx.x.
  • RESET per-iteration register state at the top of each iteration: sum[MAX_ELEMENTS_PER_THREAD] re-zeroed and sum_remainder = 0.
  • Apply standard #ifdef USE_ROCM min(blocks_x_uncapped, get_max_thread_blocks(stream)) #else blocks_x_uncapped #endif cap to the x-dim of the launch grid. y dim is bounded by D/stride_D and needs no cap.

Stacked on top of D105029028 (Tier-2 Diff 5/7). Plan:
/home/bensonma415/.llms/plans/sparse_ops_rocm_grid_overflow_tier2_fix.plan.md (Diff 6/7).

Reviewed By: henrylhtsang

Differential Revision: D105029511

@meta-codesync

meta-codesync Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105029511.

@meta-codesync meta-codesync Bot changed the title Add grid-stride loop and ROCm cap to index_add_2d_with_unique_indices_kernel Add grid-stride loop and ROCm cap to index_add_2d_with_unique_indices_kernel (#5934) Jun 21, 2026
q10 added a commit to q10/FBGEMM that referenced this pull request Jun 21, 2026
…_kernel (pytorch#5934)

Summary:

X-link: facebookresearch/FBGEMM#2852

Tier-2 fix for HIP grid-overflow in `sparse_ops/sparse_index_add.cu`.

`index_add_2d_with_unique_indices_kernel` previously used `blockIdx.x` directly to index unique indices. Capping the host-side grid without first adding a grid-stride loop would silently drop work.

Changes:
- Add `const int num_unique_indices` as a new kernel parameter.
- Convert kernel to a grid-stride loop over `u = blockIdx.x; u < num_unique_indices; u += gridDim.x` (Pattern C). All `blockIdx.x` references replaced with `u`. Hoist `start_D` and `has_remainder` outside the loop since they depend only on `blockIdx.y` / `threadIdx.x`.
- RESET per-iteration register state at the top of each iteration: `sum[MAX_ELEMENTS_PER_THREAD]` re-zeroed and `sum_remainder = 0`.
- Apply standard `#ifdef USE_ROCM min(blocks_x_uncapped, get_max_thread_blocks(stream)) #else blocks_x_uncapped #endif` cap to the x-dim of the launch grid. y dim is bounded by D/stride_D and needs no cap.

Stacked on top of D105029028 (Tier-2 Diff 5/7). Plan:
`/home/bensonma415/.llms/plans/sparse_ops_rocm_grid_overflow_tier2_fix.plan.md` (Diff 6/7).

Reviewed By: henrylhtsang

Differential Revision: D105029511
@q10 q10 force-pushed the export-D105029511 branch from 60bd207 to 768131a Compare June 21, 2026 23:18
…_kernel (pytorch#5934)

Summary:

X-link: facebookresearch/FBGEMM#2852

Tier-2 fix for HIP grid-overflow in `sparse_ops/sparse_index_add.cu`.

`index_add_2d_with_unique_indices_kernel` previously used `blockIdx.x` directly to index unique indices. Capping the host-side grid without first adding a grid-stride loop would silently drop work.

Changes:
- Add `const int num_unique_indices` as a new kernel parameter.
- Convert kernel to a grid-stride loop over `u = blockIdx.x; u < num_unique_indices; u += gridDim.x` (Pattern C). All `blockIdx.x` references replaced with `u`. Hoist `start_D` and `has_remainder` outside the loop since they depend only on `blockIdx.y` / `threadIdx.x`.
- RESET per-iteration register state at the top of each iteration: `sum[MAX_ELEMENTS_PER_THREAD]` re-zeroed and `sum_remainder = 0`.
- Apply standard `#ifdef USE_ROCM min(blocks_x_uncapped, get_max_thread_blocks(stream)) #else blocks_x_uncapped #endif` cap to the x-dim of the launch grid. y dim is bounded by D/stride_D and needs no cap.

Stacked on top of D105029028 (Tier-2 Diff 5/7). Plan:
`/home/bensonma415/.llms/plans/sparse_ops_rocm_grid_overflow_tier2_fix.plan.md` (Diff 6/7).

Reviewed By: henrylhtsang

Differential Revision: D105029511
@q10 q10 force-pushed the export-D105029511 branch from 768131a to a9e0cf2 Compare June 22, 2026 18:09
@meta-codesync meta-codesync Bot closed this in fa211b0 Jun 23, 2026
@meta-codesync

meta-codesync Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

This pull request has been merged in fa211b0.

@meta-codesync meta-codesync Bot added the Merged label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant