gpu - pad out elem loop for shared/gen by jeremylt · Pull Request #1950 · CEED/libCEED

jeremylt · 2026-04-09T19:09:40Z

Purpose:

Ensure all threads hit all syncthreads() for #1942

Closes: #N/A

LLM/GenAI Disclosure:

None

By submitting this PR, the author certifies to its contents as described by the Developer's Certificate of Origin.
Please follow the Contributing Guidelines for all PRs.

nbeams · 2026-04-09T19:51:45Z

I don't know about shared, but for hip-gen, I looked into this awhile back when we first started testing chipStar (so long ago it still had a different name...). I investigated some different kernel options; IIRC none of them were identical to what you've done here, but I think one was very similar (adding checks around the element restriction to avoid trying to read/write out of memory and removing the current loop bounds). When running the hip-gen backend on AMD hardware/with hiprtc, it caused some pretty large performance losses for the Poisson operator, especially as I increased basis function order. The compiler output showed increased register usage which led to a drop in occupancy. I have some old notes I could dig up from a past quarterly report with the actual numbers.

Anyway, I would recommend some updated performance testing before merging this for all backends. In case it's still an issue, would there be a way to know that the kernel will be built with chipStar and only add the element check in that case?

jeremylt · 2026-04-09T19:56:13Z

These changes should not have any effect on register pressure I don't think? Here I am keeping the same strategy we currently have but making sure every thread is working during the last block of elements by padding with valid dummy data

nbeams · 2026-04-09T20:11:03Z

Oh, I guess I didn't look closely enough at the code here. I also tried a version that had any "leftover" threads doing a dummy read/write, though I think they were all reading from the same (valid) element rather than padded data (which could definitely affect things). Anyway, it also had performance drops over what we currently had in hip-gen.

Just a warning since I didn't expect the perf drops I saw before I did the tests, either. It's not exactly the same code and of course hiprtc has changed since then, so no idea if it will be a problem, but I'd still recommend checking just to be sure before merging.

jeremylt · 2026-04-09T20:16:00Z

For sure. If we see a performance difference, then I think the way to go for ChipStar would be to make chipstar backends /gpu/hip/chipstar/shared and /gpu/hip/chipstar/gen that delegate back to the current shared/gen code and that code would check the resource string for the root /gpu/hip/chipstar to determine if it needs to do the padding elements

pvelesko · 2026-04-13T10:13:22Z

The elem_loop_bound formula has a bug when stride > num_elem (last block has more threads than remaining elements):

Example (t314-basis on /gpu/hip/shared, p=8 q=10):

num_elem = 63, blockDim.z = 64, gridDim.x = 1
stride = 64
elem_loop_bound = 63 * ceil(63/64) = 63 * 1 = 63
Thread 63: e=63, 63 < 63 → false → skips loop → misses __syncthreads() inside ContractX1d → deadlock

Fix — multiply by stride, not num_elem:

const CeedInt stride          = gridDim.x * blockDim.z;
const CeedInt elem_loop_bound = stride * ((num_elem + stride - 1) / stride);

With the wrong formula, t314-basis and t316-basis on /gpu/hip/shared deadlock and return hipErrorOutOfMemory on chipStar (that error code is how chipStar signals a workgroup barrier deadlock). With the corrected formula they pass.

There is also a typo in backends/hip-gen/ceed-hip-gen-operator-build.cpp in the CEED_RESTRICTION_STRIDED case — <\n should be {\n:

code << "if (e < num_elem) <\n"   // typo: < should be {

jeremylt · 2026-04-13T11:15:56Z

That's not quite the correct fix - it's logically inconsistent with what the word stride means in the codebase. But now that I see where the issue is I can create in the a fix. Thanks

Co-authored-by: Zach Atkins <zach.atkins@colorado.edu>

jeremylt · 2026-04-15T00:51:34Z

@pvelesko can you confirm these changes do what you need? If not then I can merge and there's just a couple of small tweaks I'd like to request for your branch

pvelesko · 2026-04-15T12:55:23Z

@pvelesko can you confirm these changes do what you need? If not then I can merge and there's just a couple of small tweaks I'd like to request for your branch

Yes all tests are passing after rebasing my PR on top of this one.

* gpu - pad out elem loop for shared/gen * typo - fix bad copypasta Co-authored-by: Zach Atkins <zach.atkins@colorado.edu> * cuda - don't padd threads on CUDA * hip - fix element loop bound * hip - set Chipstar modifications off by default * hip - comment on logic * hip - move chipstar jit macro definition --------- Co-authored-by: Zach Atkins <zach.atkins@colorado.edu>

jeremylt self-assigned this Apr 9, 2026

jeremylt added GPU CUDA 0-WIP HIP labels Apr 9, 2026

jeremylt force-pushed the jeremy/hip-all-elems branch 2 times, most recently from 319d781 to 803d69e Compare April 9, 2026 19:23

jeremylt mentioned this pull request Apr 9, 2026

Add chipStar (SPIR-V) support for HIP backends #1942

Merged

gpu - pad out elem loop for shared/gen

dbffd6c

jeremylt force-pushed the jeremy/hip-all-elems branch from 803d69e to dbffd6c Compare April 9, 2026 19:54

zatkins-dev reviewed Apr 13, 2026

View reviewed changes

Comment thread backends/hip-gen/ceed-hip-gen-operator-build.cpp Outdated

jeremylt and others added 2 commits April 13, 2026 10:18

typo - fix bad copypasta

4283196

Co-authored-by: Zach Atkins <zach.atkins@colorado.edu>

cuda - don't padd threads on CUDA

bd3602b

jeremylt added 1-In Review and removed 0-WIP labels Apr 13, 2026

hip - fix element loop bound

446df38

jeremylt force-pushed the jeremy/hip-all-elems branch from d307568 to 446df38 Compare April 13, 2026 16:43

hip - set Chipstar modifications off by default

f70b67d

jeremylt force-pushed the jeremy/hip-all-elems branch from 46fe3a3 to f70b67d Compare April 13, 2026 17:09

hip - comment on logic

7808f38

jeremylt commented Apr 13, 2026

View reviewed changes

Comment thread backends/hip-gen/ceed-hip-gen-operator-build.cpp

pvelesko reviewed Apr 15, 2026

View reviewed changes

Comment thread include/ceed/jit-source/hip/hip-shared-basis-nontensor.h

hip - move chipstar jit macro definition

132eca7

jeremylt merged commit 59b5803 into main Apr 15, 2026
30 checks passed

jeremylt deleted the jeremy/hip-all-elems branch April 15, 2026 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu - pad out elem loop for shared/gen#1950

gpu - pad out elem loop for shared/gen#1950
jeremylt merged 7 commits intomainfrom
jeremy/hip-all-elems

jeremylt commented Apr 9, 2026 •

edited

Loading

Uh oh!

nbeams commented Apr 9, 2026

Uh oh!

jeremylt commented Apr 9, 2026

Uh oh!

nbeams commented Apr 9, 2026

Uh oh!

jeremylt commented Apr 9, 2026

Uh oh!

pvelesko commented Apr 13, 2026

Uh oh!

jeremylt commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jeremylt commented Apr 15, 2026

Uh oh!

Uh oh!

pvelesko commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jeremylt commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nbeams commented Apr 9, 2026

Uh oh!

jeremylt commented Apr 9, 2026

Uh oh!

nbeams commented Apr 9, 2026

Uh oh!

jeremylt commented Apr 9, 2026

Uh oh!

pvelesko commented Apr 13, 2026

Uh oh!

jeremylt commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremylt commented Apr 15, 2026

Uh oh!

Uh oh!

pvelesko commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeremylt commented Apr 9, 2026 •

edited

Loading

jeremylt commented Apr 13, 2026 •

edited

Loading