Skip to content

Conversation

@justinSmileDate
Copy link

Description

This PR ports and enhances FP8 support from #82 to the latest branch, delivering substantial performance improvements through optimized computational patterns and memory access strategies.

Key Changes

Compared to Original FlashMLA:

  • WGMMA FP8 Integration: Leverage WGMMA instructions for efficient FP8 matrix operations
  • FP8 Data Types: Utilize FP8 data types for both Q and KV tensors
  • Memory Optimization:
    • Save shared memory by allocating shared memory for sP1
    • Remove retrieve_rP_for_SP(sQ(8)) operations
    • Eliminate rQ(8)*sK RS WGMMA operations

Improvements over PR #82:

  • Code Structure: Refactored for better integration with the latest branch
  • Boundary Processing Optimization:
    • Operate directly at block index level instead of sequence length level
    • Eliminate redundant ceil_div calculations
    • Reduce division operations for improved computational efficiency
  • Enhanced Memory Access:
    • Minimize intermediate variable calculations
    • Optimize register utilization
    • Improve performance across diverse workload patterns

Performance Benchmarks

Test Configuration:

  • Batch Size: 128
  • Sequence Lengths: 4096, 8192, 16384
  • Heads: h_q=16, 32, 64, 128; h_kv=1
  • Dimensions: d=576, dv=512
  • Causal Mask: Enabled
  • Variable Length: Both True and False
  • Hardware: H20

Key Performance Highlights:

Metric FlashMLA (bfloat16) PR #82 (FP8) This PR (FP8) Improvement vs PR #82
Best Latency 0.553 ms 0.405 ms 0.345 ms +15% faster
Best TFLOPS 140 TFLOPS 214 TFLOPS 234 TFLOPS +9% higher
Best Bandwidth 1165 GB/s 802 GB/s 972 GB/s +21% higher
Varlen Performance 0.577 ms 0.579 ms 0.353 ms +39% faster

Representative Performance Comparison:

Configuration: b=128, s_q=1, mean_sk=4096, h_q=16

Implementation varlen Latency TFLOPS Bandwidth
FlashMLA False 0.553 ms 33 1100 GB/s
PR #82 False 0.405 ms 45 752 GB/s
This PR False 0.345 ms 53 881 GB/s
FlashMLA True 0.577 ms 32 1060 GB/s
PR #82 True 0.579 ms 32 528 GB/s
This PR True 0.353 ms 52 866 GB/s

Configuration: b=128, s_q=2, mean_sk=4096, h_q=32

Implementation varlen Latency TFLOPS Bandwidth
FlashMLA False 0.561 ms 130 1108 GB/s
PR #82 False 0.412 ms 177 755 GB/s
This PR False 0.351 ms 208 885 GB/s

Scalability Analysis:

Across Head Counts (mean_sk=4096, s_q=1):

Heads PR #82 Latency This PR Latency Improvement
16 0.405 ms 0.345 ms +15%
32 0.406 ms 0.347 ms +14%
64 0.410 ms 0.350 ms +15%
128 0.790 ms 0.665 ms +16%

Across Sequence Lengths (h_q=16, s_q=1):

SeqLen PR #82 Latency This PR Latency Improvement
4096 0.405 ms 0.345 ms +15%
8192 0.771 ms 0.645 ms +16%
16384 1.509 ms 1.245 ms +17%

Performance Improvements Summary

vs Original FlashMLA (bfloat16):

  • Latency Reduction: Up to 38% (0.553ms → 0.345ms)
  • TFLOPS Improvement: Up to 67% (140 → 234 TFLOPS)
  • Consistent gains across all head counts and sequence lengths

vs PR #82 (FP8):

  • Latency Reduction: 14-17% across all configurations
  • TFLOPS Improvement: 9-18% across various workloads
  • Significant varlen improvement: 39% latency reduction
  • Better bandwidth utilization: Up to 21% improvement

Technical Advantages

  1. Computational Efficiency:

    • Reduced arithmetic operations through optimized boundary handling
    • Better utilization of GPU compute resources
  2. Memory Optimization:

    • Reduced shared memory footprint
    • Improved memory access patterns
    • Better register allocation
  3. Scalability:

    • Consistent performance improvements across different model configurations
    • Excellent scaling with increasing head counts and sequence lengths

Testing & Validation

python test/test_fp8.py --dtype float8_e4m3fn

Usage

export ENABLE_SWAPAB=1
python3 -m sglang.launch_server XXX --quantization fp8 --kv-cache-dtype fp8_e4m3

This PR delivers substantial performance improvements while maintaining full numerical correctness and compatibility with existing FP8 functionality.

@akhoroshev
Copy link

Are any sglang updates needed for testing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants