gpu - pad out elem loop for shared/gen#1950
Conversation
319d781 to
803d69e
Compare
|
I don't know about shared, but for Anyway, I would recommend some updated performance testing before merging this for all backends. In case it's still an issue, would there be a way to know that the kernel will be built with chipStar and only add the element check in that case? |
803d69e to
dbffd6c
Compare
|
These changes should not have any effect on register pressure I don't think? Here I am keeping the same strategy we currently have but making sure every thread is working during the last block of elements by padding with valid dummy data |
|
Oh, I guess I didn't look closely enough at the code here. I also tried a version that had any "leftover" threads doing a dummy read/write, though I think they were all reading from the same (valid) element rather than padded data (which could definitely affect things). Anyway, it also had performance drops over what we currently had in Just a warning since I didn't expect the perf drops I saw before I did the tests, either. It's not exactly the same code and of course |
|
For sure. If we see a performance difference, then I think the way to go for ChipStar would be to make chipstar backends |
|
The Example ( Fix — multiply by const CeedInt stride = gridDim.x * blockDim.z;
const CeedInt elem_loop_bound = stride * ((num_elem + stride - 1) / stride);With the wrong formula, There is also a typo in code << "if (e < num_elem) <\n" // typo: < should be { |
|
That's not quite the correct fix - it's logically inconsistent with what the word stride means in the codebase. But now that I see where the issue is I can create in the a fix. Thanks |
Co-authored-by: Zach Atkins <zach.atkins@colorado.edu>
d307568 to
446df38
Compare
46fe3a3 to
f70b67d
Compare
|
@pvelesko can you confirm these changes do what you need? If not then I can merge and there's just a couple of small tweaks I'd like to request for your branch |
Yes all tests are passing after rebasing my PR on top of this one. |
* gpu - pad out elem loop for shared/gen * typo - fix bad copypasta Co-authored-by: Zach Atkins <zach.atkins@colorado.edu> * cuda - don't padd threads on CUDA * hip - fix element loop bound * hip - set Chipstar modifications off by default * hip - comment on logic * hip - move chipstar jit macro definition --------- Co-authored-by: Zach Atkins <zach.atkins@colorado.edu>
Purpose:
Ensure all threads hit all
syncthreads()for #1942Closes: #N/A
LLM/GenAI Disclosure:
None
By submitting this PR, the author certifies to its contents as described by the Developer's Certificate of Origin.
Please follow the Contributing Guidelines for all PRs.