Skip to content

gpu - pad out elem loop for shared/gen#1950

Open
jeremylt wants to merge 1 commit intomainfrom
jeremy/hip-all-elems
Open

gpu - pad out elem loop for shared/gen#1950
jeremylt wants to merge 1 commit intomainfrom
jeremy/hip-all-elems

Conversation

@jeremylt
Copy link
Copy Markdown
Member

@jeremylt jeremylt commented Apr 9, 2026

Purpose:

Ensure all threads hit all syncthreads() for #1942

Closes: #N/A

LLM/GenAI Disclosure:

None

By submitting this PR, the author certifies to its contents as described by the Developer's Certificate of Origin.
Please follow the Contributing Guidelines for all PRs.

@jeremylt jeremylt self-assigned this Apr 9, 2026
@jeremylt jeremylt force-pushed the jeremy/hip-all-elems branch 2 times, most recently from 319d781 to 803d69e Compare April 9, 2026 19:23
@nbeams
Copy link
Copy Markdown
Contributor

nbeams commented Apr 9, 2026

I don't know about shared, but for hip-gen, I looked into this awhile back when we first started testing chipStar (so long ago it still had a different name...). I investigated some different kernel options; IIRC none of them were identical to what you've done here, but I think one was very similar (adding checks around the element restriction to avoid trying to read/write out of memory and removing the current loop bounds). When running the hip-gen backend on AMD hardware/with hiprtc, it caused some pretty large performance losses for the Poisson operator, especially as I increased basis function order. The compiler output showed increased register usage which led to a drop in occupancy. I have some old notes I could dig up from a past quarterly report with the actual numbers.

Anyway, I would recommend some updated performance testing before merging this for all backends. In case it's still an issue, would there be a way to know that the kernel will be built with chipStar and only add the element check in that case?

@jeremylt jeremylt force-pushed the jeremy/hip-all-elems branch from 803d69e to dbffd6c Compare April 9, 2026 19:54
@jeremylt
Copy link
Copy Markdown
Member Author

jeremylt commented Apr 9, 2026

These changes should not have any effect on register pressure I don't think? Here I am keeping the same strategy we currently have but making sure every thread is working during the last block of elements by padding with valid dummy data

@nbeams
Copy link
Copy Markdown
Contributor

nbeams commented Apr 9, 2026

Oh, I guess I didn't look closely enough at the code here. I also tried a version that had any "leftover" threads doing a dummy read/write, though I think they were all reading from the same (valid) element rather than padded data (which could definitely affect things). Anyway, it also had performance drops over what we currently had in hip-gen.

Just a warning since I didn't expect the perf drops I saw before I did the tests, either. It's not exactly the same code and of course hiprtc has changed since then, so no idea if it will be a problem, but I'd still recommend checking just to be sure before merging.

@jeremylt
Copy link
Copy Markdown
Member Author

jeremylt commented Apr 9, 2026

For sure. If we see a performance difference, then I think the way to go for ChipStar would be to make chipstar backends /gpu/hip/chipstar/shared and /gpu/hip/chipstar/gen that delegate back to the current shared/gen code and that code would check the resource string for the root /gpu/hip/chipstar to determine if it needs to do the padding elements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants