You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When computing “large” blocks (Head_DIM=128/Workgroup <128,128, 64>) , the CUTLASS FA implementation assigns work to 16 subgroups of shape: 8x2.
In the Triton implementation, all subgroups are assigned to the same dimension. Therefore, when number of subgroups = 16, we compute a block with subgroup shape: 16x1.
We should therefore try to improve the Triton implementation and define multi-dimensional subgroup shape when needed.
The text was updated successfully, but these errors were encountered:
The CUTLASS implementation ensures that each subgroup processes data of size <16,64,64>. So, within a WorkGroup, subgroups are assigned to a dimension in order to maintain the data shape per subgroup.
When computing “large” blocks (Head_DIM=128/Workgroup <128,128, 64>) , the CUTLASS FA implementation assigns work to 16 subgroups of shape: 8x2.
In the Triton implementation, all subgroups are assigned to the same dimension. Therefore, when number of subgroups = 16, we compute a block with subgroup shape: 16x1.
We should therefore try to improve the Triton implementation and define multi-dimensional subgroup shape when needed.
The text was updated successfully, but these errors were encountered: