Clarification on fused kernel behavior for FP8 vs. BF16 #1862

EricJKrebs · 2025-09-04T06:30:58Z

EricJKrebs
Sep 4, 2025

I've been creating profiles to compare performance of FP8 vs BF16 using the PyTorch Transformer Engine package on GPUs that support FP8. GEMMs seem pretty straightforward, but transformer engine's dot product attention uses fused flash attention kernels that I have a few questions about since I don't know how to profile into those.

The code I've been testing is a simple implementation of GPT2 training using Transformer Engine with "mixed" FP8 (E4M3 forward, E5M2 backward). During forward pass, the GPU timing I see using FP8 is dominated by an FP8 flash attention kernel, second being GEMM:

Running the same code with purely BF16, the dominant time during forward pass is from GEMMs, then float 16 flash attention kernels:

Time comparison for the GEMMs is about half for FP8 (which is expected), but higher for the FP8 flash attention kernel compared to BF16. I've changed sequence length, number of attention heads, per head dimension, and batch sizes looking at this, but I keep seeing similar results where flash attention time for FP8 is higher than BF16.

My questions regarding the flash attention kernels:

Is softmax dominating the time in the FP8 kernel?
Is this the FP8 kernel memory bound or possibly I/O bound?
I assume softmax in the FP8 kernel would still be calculated using BF16 in the fp8 flash attention kernel, but is this correct?

Any advice into gaining more insight?

tridao · 2025-09-08T16:09:25Z

tridao
Sep 8, 2025
Maintainer

Yes softmax is a big part
No
Softmax is computed in fp32.

0 replies

tridao · 2025-09-08T16:10:50Z

tridao
Sep 8, 2025
Maintainer

Seems strange that fp8 would be slower than bf16. I'd expect fp8 fwd to be ~1.3x faster than bf16.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on fused kernel behavior for FP8 vs. BF16 #1862

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarification on fused kernel behavior for FP8 vs. BF16 #1862

Uh oh!

EricJKrebs Sep 4, 2025

Replies: 2 comments

Uh oh!

tridao Sep 8, 2025 Maintainer

Uh oh!

tridao Sep 8, 2025 Maintainer

EricJKrebs
Sep 4, 2025

tridao
Sep 8, 2025
Maintainer

tridao
Sep 8, 2025
Maintainer