[flash-linear-attention] Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) #3761

zhiyuan1i · 2025-03-26T06:49:17Z

Describe the issue

Currently, I am doing continuous integration for fla.

I'm experiencing substantially slower compilation times when using Triton with Intel Arc A770 compared to NVIDIA RTX 4090 (10x+ difference in some kernel's end-to-end compilation). This significantly impacts development workflow and CI testing efficiency.

Reproduction Steps:

Environment: Intel Arc A770 (latest drivers) vs. NVIDIA RTX 4090
Clone repository: git clone https://github.com/fla-org/flash-linear-attention/
intall intel pytorch and fla
Run compilation benchmark:

export COMPILER_MODE=1 
export SKIP_TEST_CHUNK_VARLEN=1
pytest tests/ops/

Observed Behavior:

A770 compilation takes >5-10x longer than 4090 for equivalent operations
Noticeable delay in development iteration cycles
Substantial impact on CI/CD pipeline efficiency

Expected Behavior:

Comparable compilation performance between Intel and NVIDIA hardware
Reasonable compilation times for unit testing scenarios

Impact:

Developer Experience: Slow compilation disrupts the edit-compile-test workflow
Testing Efficiency: Unit tests become bottlenecked by compilation rather than actual execution
Adoption Barrier: Such performance gaps may discourage potential XPU adopters

Request:
Could the Triton team:

Investigate this compilation performance discrepancy?
Share any known optimizations or workarounds for XPU compilation?
Consider compilation performance as a priority metric for XPU backend?

Hey, could the Triton team:

Check out this compilation performance discrepancy? Maybe the tiling introduces the compile speed decay
Share any known optimizations or workarounds for XPU compilation?
Think about making compilation performance a priority metric for XPU backend?

Environment details

pip list |grep triton
pytorch-triton-xpu 3.3.0
Intel Arc A770

The text was updated successfully, but these errors were encountered:

zhiyuan1i · 2025-03-26T06:54:07Z

I also encountered some compilation errors and segment fault issues. I'll organize them later and create a new issue.

alexbaden · 2025-03-26T14:58:28Z

A770 lacks hardware features for MMA that newer hardware has (PVC, LNL, BMG). I suspect the compilation time is coming from inefficient (large) kernels. This is a known issue and, unfortunately, I'm not sure this is something we intend to improve. Do you have access to a BMG or even a PVC machine? Intel developer cloud could be a good place to start.

zhiyuan1i · 2025-03-27T01:45:24Z

A770 lacks hardware features for MMA that newer hardware has (PVC, LNL, BMG). I suspect the compilation time is coming from inefficient (large) kernels. This is a known issue and, unfortunately, I'm not sure this is something we intend to improve. Do you have access to a BMG or even a PVC machine? Intel developer cloud could be a good place to start.

Thank you, I will try thay if possible:)

zhiyuan1i added the performance label Mar 26, 2025

vlad-penkin added community hw: arc labels Mar 26, 2025

vlad-penkin added this to the 5. Ecosystem enabling and performance milestone Mar 26, 2025

vlad-penkin changed the title ~~Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090)~~ [flash-linear-attention] Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) Mar 26, 2025

vlad-penkin assigned alexbaden and vlad-penkin Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flash-linear-attention] Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) #3761

[flash-linear-attention] Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) #3761

zhiyuan1i commented Mar 26, 2025 •

edited

Loading

zhiyuan1i commented Mar 26, 2025

alexbaden commented Mar 26, 2025

zhiyuan1i commented Mar 27, 2025

[flash-linear-attention] Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) #3761

[flash-linear-attention] Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) #3761

Comments

zhiyuan1i commented Mar 26, 2025 • edited Loading

Describe the issue

Environment details

zhiyuan1i commented Mar 26, 2025

alexbaden commented Mar 26, 2025

zhiyuan1i commented Mar 27, 2025

zhiyuan1i commented Mar 26, 2025 •

edited

Loading