-
Notifications
You must be signed in to change notification settings - Fork 5
use ck mha instead of triton unified_attention for sink and window #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR transitions from Triton's unified_attention to CK (Composable Kernel) MHA for handling sink and sliding window attention patterns. The changes remove temporary workarounds that were previously needed for GPT-OSS model compatibility.
Key changes:
- Switches prefill attention to always use CK MHA (flash_attn_varlen_func) with native sink and sliding window support
- Removes the ATOM_GPT_OSS_MODEL environment variable and related temporary workarounds
- Eliminates fake block table generation logic that was needed for Triton's unified_attention
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| atom/utils/forward_context.py | Removes GPT-OSS model environment variable usage and fake block table generation logic for prefill attention |
| atom/utils/envs.py | Removes ATOM_GPT_OSS_MODEL environment variable definition |
| atom/model_ops/attentions/backends.py | Removes block table preparation for sliding window during prefill phase |
| atom/model_ops/attention_mha.py | Adds window_size and sink_ptr parameters to CK MHA call; comments out Triton prefill path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # variable lenth attention use key value as input | ||
| attn_metadata = fwd_args.attn_metadata | ||
| sliding_window = (self.sliding_window, 0, 0) if self.sliding_window is not None else (-1, -1, 0) |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sliding_window logic is inconsistent with the prefill_attention_triton implementation. In prefill_attention_triton (line 373), the window_size is defined as (self.sliding_window - 1, 0), but here it's defined as (self.sliding_window, 0, 0). This inconsistency could lead to different behavior. Additionally, the condition checks if self.sliding_window is not None, but at line 64 in init, self.sliding_window is set to -1 when None is passed, so this None check will never be true. The condition should check if self.sliding_window != -1 instead.
| sliding_window = (self.sliding_window, 0, 0) if self.sliding_window is not None else (-1, -1, 0) | |
| sliding_window = (self.sliding_window - 1, 0) if self.sliding_window != -1 else (-1, 0) |
| # if self.use_triton_attn: | ||
| # return self.prefill_attention_triton | ||
| # else: |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented-out code for prefill_attention_triton creates maintainability issues. If this backend is truly no longer needed, it should be removed entirely rather than commented out. If it might be needed in the future, add a clear comment explaining why it's being kept and under what conditions it might be re-enabled. This helps prevent confusion and code rot.
| # if self.use_triton_attn: | |
| # return self.prefill_attention_triton | |
| # else: | |
| # Prefill currently always uses the ASM backend. The Triton prefill path | |
| # (`prefill_attention_triton`) is kept defined but not dispatched until | |
| # it is fully validated and aligned with the required kv-cache layout. |
| window_size=sliding_window, | ||
| sink_ptr=self.sinks, |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new parameters window_size and sink_ptr added to flash_attn_varlen_func lack inline comments explaining their purpose and expected values. Adding brief comments would help future maintainers understand how these parameters relate to the sink and sliding window attention features mentioned in the PR title.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist