Skip to content

gpu - pad out elem loop for shared/gen#1950

Merged
jeremylt merged 7 commits intomainfrom
jeremy/hip-all-elems
Apr 15, 2026
Merged

gpu - pad out elem loop for shared/gen#1950
jeremylt merged 7 commits intomainfrom
jeremy/hip-all-elems

Conversation

@jeremylt
Copy link
Copy Markdown
Member

@jeremylt jeremylt commented Apr 9, 2026

Purpose:

Ensure all threads hit all syncthreads() for #1942

Closes: #N/A

LLM/GenAI Disclosure:

None

By submitting this PR, the author certifies to its contents as described by the Developer's Certificate of Origin.
Please follow the Contributing Guidelines for all PRs.

@jeremylt jeremylt self-assigned this Apr 9, 2026
@jeremylt jeremylt force-pushed the jeremy/hip-all-elems branch 2 times, most recently from 319d781 to 803d69e Compare April 9, 2026 19:23
@nbeams
Copy link
Copy Markdown
Contributor

nbeams commented Apr 9, 2026

I don't know about shared, but for hip-gen, I looked into this awhile back when we first started testing chipStar (so long ago it still had a different name...). I investigated some different kernel options; IIRC none of them were identical to what you've done here, but I think one was very similar (adding checks around the element restriction to avoid trying to read/write out of memory and removing the current loop bounds). When running the hip-gen backend on AMD hardware/with hiprtc, it caused some pretty large performance losses for the Poisson operator, especially as I increased basis function order. The compiler output showed increased register usage which led to a drop in occupancy. I have some old notes I could dig up from a past quarterly report with the actual numbers.

Anyway, I would recommend some updated performance testing before merging this for all backends. In case it's still an issue, would there be a way to know that the kernel will be built with chipStar and only add the element check in that case?

@jeremylt jeremylt force-pushed the jeremy/hip-all-elems branch from 803d69e to dbffd6c Compare April 9, 2026 19:54
@jeremylt
Copy link
Copy Markdown
Member Author

jeremylt commented Apr 9, 2026

These changes should not have any effect on register pressure I don't think? Here I am keeping the same strategy we currently have but making sure every thread is working during the last block of elements by padding with valid dummy data

@nbeams
Copy link
Copy Markdown
Contributor

nbeams commented Apr 9, 2026

Oh, I guess I didn't look closely enough at the code here. I also tried a version that had any "leftover" threads doing a dummy read/write, though I think they were all reading from the same (valid) element rather than padded data (which could definitely affect things). Anyway, it also had performance drops over what we currently had in hip-gen.

Just a warning since I didn't expect the perf drops I saw before I did the tests, either. It's not exactly the same code and of course hiprtc has changed since then, so no idea if it will be a problem, but I'd still recommend checking just to be sure before merging.

@jeremylt
Copy link
Copy Markdown
Member Author

jeremylt commented Apr 9, 2026

For sure. If we see a performance difference, then I think the way to go for ChipStar would be to make chipstar backends /gpu/hip/chipstar/shared and /gpu/hip/chipstar/gen that delegate back to the current shared/gen code and that code would check the resource string for the root /gpu/hip/chipstar to determine if it needs to do the padding elements

@pvelesko
Copy link
Copy Markdown
Contributor

The elem_loop_bound formula has a bug when stride > num_elem (last block has more threads than remaining elements):

Example (t314-basis on /gpu/hip/shared, p=8 q=10):

num_elem = 63, blockDim.z = 64, gridDim.x = 1
stride = 64
elem_loop_bound = 63 * ceil(63/64) = 63 * 1 = 63
Thread 63: e=63, 63 < 63 → false → skips loop → misses __syncthreads() inside ContractX1d → deadlock

Fix — multiply by stride, not num_elem:

const CeedInt stride          = gridDim.x * blockDim.z;
const CeedInt elem_loop_bound = stride * ((num_elem + stride - 1) / stride);

With the wrong formula, t314-basis and t316-basis on /gpu/hip/shared deadlock and return hipErrorOutOfMemory on chipStar (that error code is how chipStar signals a workgroup barrier deadlock). With the corrected formula they pass.

There is also a typo in backends/hip-gen/ceed-hip-gen-operator-build.cpp in the CEED_RESTRICTION_STRIDED case — <\n should be {\n:

code << "if (e < num_elem) <\n"   // typo: < should be {

@jeremylt
Copy link
Copy Markdown
Member Author

jeremylt commented Apr 13, 2026

That's not quite the correct fix - it's logically inconsistent with what the word stride means in the codebase. But now that I see where the issue is I can create in the a fix. Thanks

Comment thread backends/hip-gen/ceed-hip-gen-operator-build.cpp Outdated
jeremylt and others added 2 commits April 13, 2026 10:18
Co-authored-by: Zach Atkins <zach.atkins@colorado.edu>
@jeremylt jeremylt force-pushed the jeremy/hip-all-elems branch from d307568 to 446df38 Compare April 13, 2026 16:43
@jeremylt jeremylt force-pushed the jeremy/hip-all-elems branch from 46fe3a3 to f70b67d Compare April 13, 2026 17:09
Comment thread backends/hip-gen/ceed-hip-gen-operator-build.cpp
@jeremylt
Copy link
Copy Markdown
Member Author

@pvelesko can you confirm these changes do what you need? If not then I can merge and there's just a couple of small tweaks I'd like to request for your branch

Comment thread include/ceed/jit-source/hip/hip-shared-basis-nontensor.h
@pvelesko
Copy link
Copy Markdown
Contributor

@pvelesko can you confirm these changes do what you need? If not then I can merge and there's just a couple of small tweaks I'd like to request for your branch

Yes all tests are passing after rebasing my PR on top of this one.

@jeremylt jeremylt merged commit 59b5803 into main Apr 15, 2026
30 checks passed
@jeremylt jeremylt deleted the jeremy/hip-all-elems branch April 15, 2026 15:29
wostrie2 pushed a commit to wostrie2/libCEED that referenced this pull request Apr 19, 2026
* gpu - pad out elem loop for shared/gen

* typo - fix bad copypasta

Co-authored-by: Zach Atkins <zach.atkins@colorado.edu>

* cuda - don't padd threads on CUDA

* hip - fix element loop bound

* hip - set Chipstar modifications off by default

* hip - comment on logic

* hip - move chipstar jit macro definition

---------

Co-authored-by: Zach Atkins <zach.atkins@colorado.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants