Conversation
319d781 to
803d69e
Compare
|
I don't know about shared, but for Anyway, I would recommend some updated performance testing before merging this for all backends. In case it's still an issue, would there be a way to know that the kernel will be built with chipStar and only add the element check in that case? |
803d69e to
dbffd6c
Compare
|
These changes should not have any effect on register pressure I don't think? Here I am keeping the same strategy we currently have but making sure every thread is working during the last block of elements by padding with valid dummy data |
|
Oh, I guess I didn't look closely enough at the code here. I also tried a version that had any "leftover" threads doing a dummy read/write, though I think they were all reading from the same (valid) element rather than padded data (which could definitely affect things). Anyway, it also had performance drops over what we currently had in Just a warning since I didn't expect the perf drops I saw before I did the tests, either. It's not exactly the same code and of course |
|
For sure. If we see a performance difference, then I think the way to go for ChipStar would be to make chipstar backends |
Purpose:
Ensure all threads hit all
syncthreads()for #1942Closes: #N/A
LLM/GenAI Disclosure:
None
By submitting this PR, the author certifies to its contents as described by the Developer's Certificate of Origin.
Please follow the Contributing Guidelines for all PRs.