Proposal
support context parallelism for all linear attention models
Rationale
One of the major advantages of linear attention is that it enables long sequence modeling. However, for training and prefilling, a single GPU will often lack sufficient memory to process the entire input, making context parallelism essential.