Current non-matmul schedulers, except the inner-outer persistent scheduler, still rely on the classical multi-wave approach.
A TMA + Warp-Specialization variant should be developed for these schedulers, as it has demonstrated substantial speedups in the inner-outer persistent scheduler.