Skip to content

Commit d5503cd

Browse files
hammersamclaude
andauthored
[Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3 (tile-ai#1636)
Add a new sparse MLA forward kernel implementation using the "Seesaw" synchronization pattern, an alternative to the existing pipelined approach. **Dual-Consumer Parallel Architecture:** - Unlike pipelined version where WG1 depends on WG0's S_shared/alpha_shared, Seesaw allows both consumers to work independently on different KV blocks - Consumer 0 (WG0): Processes even blocks (BI_2*i), computes O_L (left half) - Consumer 1 (WG1): Processes odd blocks (BI_2*i+1), computes O_R (right half) - Each consumer maintains its own softmax statistics (m_i, sumexp) **Seesaw Synchronization Mechanism:** - Consumers exchange local row_max via bar_stats_0/1_ready barriers - Both compute global max by taking max of local and peer's max - S matrices exchanged via bar_S_0/1_ready for cross-attention: - O_L += P0 @ V0_L (self) + P1 @ V1_L (from peer) - O_R += P1 @ V1_R (self) + P0 @ V0_R (from peer) **Memory Optimizations:** - Reuses K_tail_shared_0/1 as S_shared_0/1 to save shared memory - Double-buffered is_kv_valid[2, BI] mask to avoid race conditions - Index prefetching in producer to hide memory latency | Aspect | Pipelined | Seesaw | |--------|-----------|--------| | Consumer dependency | WG1 waits for WG0's S/alpha | Independent parallel compute | | S matrix | Single S_shared | Dual S_shared_0/1 (reused) | | Softmax stats | Single m_i, sumexp | Per-consumer stats with exchange | | KV valid mask | Single buffer [BI] | Double buffer [2, BI] | | Index prefetch | None | Async prefetch next iteration | | Register alloc | WG0:240, WG1:168, Prod:80 | WG0:216, WG1:216, Prod:72 | **Prefill Benchmark** (B=2, S=4096, SKV=8192, H=128, topk=2048): - Average time: 10.276 ms - IO bandwidth: 1.88 TB/s - TFLOPS: 454.76 **Decode Benchmark** (B=2048, S=2, SKV=8192, H=128, topk=2048): - Average time: 5.554 ms - IO bandwidth: 1.74 TB/s - TFLOPS: 420.68 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent b914318 commit d5503cd

1 file changed

Lines changed: 644 additions & 0 deletions

File tree

0 commit comments

Comments
 (0)