Commit d5503cd
[Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3 (tile-ai#1636)
Add a new sparse MLA forward kernel implementation using the "Seesaw"
synchronization pattern, an alternative to the existing pipelined approach.
**Dual-Consumer Parallel Architecture:**
- Unlike pipelined version where WG1 depends on WG0's S_shared/alpha_shared,
Seesaw allows both consumers to work independently on different KV blocks
- Consumer 0 (WG0): Processes even blocks (BI_2*i), computes O_L (left half)
- Consumer 1 (WG1): Processes odd blocks (BI_2*i+1), computes O_R (right half)
- Each consumer maintains its own softmax statistics (m_i, sumexp)
**Seesaw Synchronization Mechanism:**
- Consumers exchange local row_max via bar_stats_0/1_ready barriers
- Both compute global max by taking max of local and peer's max
- S matrices exchanged via bar_S_0/1_ready for cross-attention:
- O_L += P0 @ V0_L (self) + P1 @ V1_L (from peer)
- O_R += P1 @ V1_R (self) + P0 @ V0_R (from peer)
**Memory Optimizations:**
- Reuses K_tail_shared_0/1 as S_shared_0/1 to save shared memory
- Double-buffered is_kv_valid[2, BI] mask to avoid race conditions
- Index prefetching in producer to hide memory latency
| Aspect | Pipelined | Seesaw |
|--------|-----------|--------|
| Consumer dependency | WG1 waits for WG0's S/alpha | Independent parallel compute |
| S matrix | Single S_shared | Dual S_shared_0/1 (reused) |
| Softmax stats | Single m_i, sumexp | Per-consumer stats with exchange |
| KV valid mask | Single buffer [BI] | Double buffer [2, BI] |
| Index prefetch | None | Async prefetch next iteration |
| Register alloc | WG0:240, WG1:168, Prod:80 | WG0:216, WG1:216, Prod:72 |
**Prefill Benchmark** (B=2, S=4096, SKV=8192, H=128, topk=2048):
- Average time: 10.276 ms
- IO bandwidth: 1.88 TB/s
- TFLOPS: 454.76
**Decode Benchmark** (B=2048, S=2, SKV=8192, H=128, topk=2048):
- Average time: 5.554 ms
- IO bandwidth: 1.74 TB/s
- TFLOPS: 420.68
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>1 parent b914318 commit d5503cd
1 file changed
Lines changed: 644 additions & 0 deletions
0 commit comments