Enhanced FP8 Support with Significant Performance Optimizations #122

justinSmileDate · 2025-11-17T05:04:35Z

Description

This PR ports and enhances FP8 support from #82 to the latest branch, delivering substantial performance improvements through optimized computational patterns and memory access strategies.

Key Changes

Compared to Original FlashMLA:

WGMMA FP8 Integration: Leverage WGMMA instructions for efficient FP8 matrix operations
FP8 Data Types: Utilize FP8 data types for both Q and KV tensors
Memory Optimization:
- Save shared memory by allocating shared memory for sP1
- Remove retrieve_rP_for_SP(sQ(8)) operations
- Eliminate rQ(8)*sK RS WGMMA operations

Improvements over PR #82:

Code Structure: Refactored for better integration with the latest branch
Boundary Processing Optimization:
- Operate directly at block index level instead of sequence length level
- Eliminate redundant ceil_div calculations
- Reduce division operations for improved computational efficiency
Enhanced Memory Access:
- Minimize intermediate variable calculations
- Optimize register utilization
- Improve performance across diverse workload patterns

Performance Benchmarks

Test Configuration:

Batch Size: 128
Sequence Lengths: 4096, 8192, 16384
Heads: h_q=16, 32, 64, 128; h_kv=1
Dimensions: d=576, dv=512
Causal Mask: Enabled
Variable Length: Both True and False
Hardware: H20

Key Performance Highlights:

Metric	FlashMLA (bfloat16)	PR #82 (FP8)	This PR (FP8)	Improvement vs PR #82
Best Latency	0.553 ms	0.405 ms	0.345 ms	+15% faster
Best TFLOPS	140 TFLOPS	214 TFLOPS	234 TFLOPS	+9% higher
Best Bandwidth	1165 GB/s	802 GB/s	972 GB/s	+21% higher
Varlen Performance	0.577 ms	0.579 ms	0.353 ms	+39% faster

Representative Performance Comparison:

Configuration: b=128, s_q=1, mean_sk=4096, h_q=16

Implementation	varlen	Latency	TFLOPS	Bandwidth
FlashMLA	False	0.553 ms	33	1100 GB/s
PR #82	False	0.405 ms	45	752 GB/s
This PR	False	0.345 ms	53	881 GB/s
FlashMLA	True	0.577 ms	32	1060 GB/s
PR #82	True	0.579 ms	32	528 GB/s
This PR	True	0.353 ms	52	866 GB/s

Configuration: b=128, s_q=2, mean_sk=4096, h_q=32

Implementation	varlen	Latency	TFLOPS	Bandwidth
FlashMLA	False	0.561 ms	130	1108 GB/s
PR #82	False	0.412 ms	177	755 GB/s
This PR	False	0.351 ms	208	885 GB/s

Scalability Analysis:

Across Head Counts (mean_sk=4096, s_q=1):

Heads	PR #82 Latency	This PR Latency	Improvement
16	0.405 ms	0.345 ms	+15%
32	0.406 ms	0.347 ms	+14%
64	0.410 ms	0.350 ms	+15%
128	0.790 ms	0.665 ms	+16%

Across Sequence Lengths (h_q=16, s_q=1):

SeqLen	PR #82 Latency	This PR Latency	Improvement
4096	0.405 ms	0.345 ms	+15%
8192	0.771 ms	0.645 ms	+16%
16384	1.509 ms	1.245 ms	+17%

Performance Improvements Summary

vs Original FlashMLA (bfloat16):

Latency Reduction: Up to 38% (0.553ms → 0.345ms)
TFLOPS Improvement: Up to 67% (140 → 234 TFLOPS)
Consistent gains across all head counts and sequence lengths

vs PR #82 (FP8):

Latency Reduction: 14-17% across all configurations
TFLOPS Improvement: 9-18% across various workloads
Significant varlen improvement: 39% latency reduction
Better bandwidth utilization: Up to 21% improvement

Technical Advantages

Computational Efficiency:
- Reduced arithmetic operations through optimized boundary handling
- Better utilization of GPU compute resources
Memory Optimization:
- Reduced shared memory footprint
- Improved memory access patterns
- Better register allocation
Scalability:
- Consistent performance improvements across different model configurations
- Excellent scaling with increasing head counts and sequence lengths

Testing & Validation

python test/test_fp8.py --dtype float8_e4m3fn

Usage

export ENABLE_SWAPAB=1
python3 -m sglang.launch_server XXX --quantization fp8 --kv-cache-dtype fp8_e4m3

This PR delivers substantial performance improvements while maintaining full numerical correctness and compatibility with existing FP8 functionality.

Added a namespace 'sm90' for better code organization.

akhoroshev · 2025-11-25T12:34:45Z

Are any sglang updates needed for testing?

justinSmileDate and others added 3 commits November 17, 2025 00:09

Add enhanced FP8 support

b715b4c

update params

068cfb9

Introduce 'sm90' namespace in splitkv_mla_fp8.cu

b202497

Added a namespace 'sm90' for better code organization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhanced FP8 Support with Significant Performance Optimizations #122

Enhanced FP8 Support with Significant Performance Optimizations #122

Uh oh!

justinSmileDate commented Nov 17, 2025

Uh oh!

akhoroshev commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enhanced FP8 Support with Significant Performance Optimizations #122

Are you sure you want to change the base?

Enhanced FP8 Support with Significant Performance Optimizations #122

Uh oh!

Conversation

justinSmileDate commented Nov 17, 2025

Description

Key Changes

Compared to Original FlashMLA:

Improvements over PR #82:

Performance Benchmarks

Test Configuration:

Key Performance Highlights:

Representative Performance Comparison:

Configuration: b=128, s_q=1, mean_sk=4096, h_q=16

Configuration: b=128, s_q=2, mean_sk=4096, h_q=32

Scalability Analysis:

Across Head Counts (mean_sk=4096, s_q=1):

Across Sequence Lengths (h_q=16, s_q=1):

Performance Improvements Summary

vs Original FlashMLA (bfloat16):

vs PR #82 (FP8):

Technical Advantages

Testing & Validation

Usage

Uh oh!

akhoroshev commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants