[FA Performance] Optimize subgroup shape when `num-warps>8` #3740

mfrancepillois · 2025-03-24T12:29:33Z

When computing “large” blocks (Head_DIM=128/Workgroup <128,128, 64>) , the CUTLASS FA implementation assigns work to 16 subgroups of shape: 8x2.

In the Triton implementation, all subgroups are assigned to the same dimension. Therefore, when number of subgroups = 16, we compute a block with subgroup shape: 16x1.

We should therefore try to improve the Triton implementation and define multi-dimensional subgroup shape when needed.

mfrancepillois · 2025-03-24T12:33:39Z

The CUTLASS implementation ensures that each subgroup processes data of size <16,64,64>. So, within a WorkGroup, subgroups are assigned to a dimension in order to maintain the data shape per subgroup.

mfrancepillois self-assigned this Mar 24, 2025

vlad-penkin added this to the 4. [Performance] Core milestone Mar 24, 2025

vlad-penkin added performance codegen: attention labels Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FA Performance] Optimize subgroup shape when `num-warps>8` #3740

[FA Performance] Optimize subgroup shape when `num-warps>8` #3740

mfrancepillois commented Mar 24, 2025

mfrancepillois commented Mar 24, 2025

[FA Performance] Optimize subgroup shape when num-warps>8 #3740

[FA Performance] Optimize subgroup shape when num-warps>8 #3740

Comments

mfrancepillois commented Mar 24, 2025

mfrancepillois commented Mar 24, 2025

[FA Performance] Optimize subgroup shape when `num-warps>8` #3740

[FA Performance] Optimize subgroup shape when `num-warps>8` #3740