[RFC] Support context parallelism

### Proposal

support context parallelism for all linear attention models

### Rationale

One of the major advantages of linear attention is that it enables long sequence modeling. However, for training and prefilling, a single GPU will often lack sufficient memory to process the entire input, making context parallelism essential.