Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR :
Note: I currently do not have access to the target hardware to verify correctness so I am testing it here with CI
Implementation Details
Instead of introducing an explicit multi_buffer_count argument, this PR leverages the existing scheduling infrastructure to achieve triple buffering natively.
By inserting an empty pipeline stage between the global loads and the compute stage, we increase the pipeline depth without increasing the initiation interval. This effectively extends the lifetime of the loaded data across an extra iteration, which compels the compiler's buffer analysis to allocate a 3rd buffer to prevent hazards. This approach was preferred as it aligns with the current stage centric loop construction logic, avoiding big changes required to support an explicit buffer count parameter.
loop_reconstruction.py BUG and FIX:
After the kernel yields its final iteration, the buffer list has been rotated. When the epilogue tries to extract these buffers it gets them in the rotated order. The original code then didn't account for this rotation, causing a mismatch. Solution: Applied a -1 rotation to compensate for how the buffers were organized when yielded, so the epilogue reads them in the correct order.
Next
Add logic to verify that the pipeline stages defined in the schedule do not exceed the device's shared memory capacity
Add lit_test