Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flash-linear-attention] Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) #3761

Open
zhiyuan1i opened this issue Mar 26, 2025 · 3 comments

Comments

@zhiyuan1i
Copy link
Contributor

zhiyuan1i commented Mar 26, 2025

Describe the issue

Currently, I am doing continuous integration for fla.

I'm experiencing substantially slower compilation times when using Triton with Intel Arc A770 compared to NVIDIA RTX 4090 (10x+ difference in some kernel's end-to-end compilation). This significantly impacts development workflow and CI testing efficiency.

Reproduction Steps:

  1. Environment: Intel Arc A770 (latest drivers) vs. NVIDIA RTX 4090
  2. Clone repository: git clone https://github.com/fla-org/flash-linear-attention/
  3. intall intel pytorch and fla
  4. Run compilation benchmark:
export COMPILER_MODE=1 
export SKIP_TEST_CHUNK_VARLEN=1
pytest tests/ops/

Observed Behavior:

  • A770 compilation takes >5-10x longer than 4090 for equivalent operations
  • Noticeable delay in development iteration cycles
  • Substantial impact on CI/CD pipeline efficiency

Expected Behavior:

  • Comparable compilation performance between Intel and NVIDIA hardware
  • Reasonable compilation times for unit testing scenarios

Impact:

  1. Developer Experience: Slow compilation disrupts the edit-compile-test workflow
  2. Testing Efficiency: Unit tests become bottlenecked by compilation rather than actual execution
  3. Adoption Barrier: Such performance gaps may discourage potential XPU adopters

Request:
Could the Triton team:

  1. Investigate this compilation performance discrepancy?
  2. Share any known optimizations or workarounds for XPU compilation?
  3. Consider compilation performance as a priority metric for XPU backend?

Hey, could the Triton team:

  1. Check out this compilation performance discrepancy? Maybe the tiling introduces the compile speed decay
  2. Share any known optimizations or workarounds for XPU compilation?
  3. Think about making compilation performance a priority metric for XPU backend?

Environment details

pip list |grep triton
pytorch-triton-xpu 3.3.0
Intel Arc A770

@zhiyuan1i
Copy link
Contributor Author

I also encountered some compilation errors and segment fault issues. I'll organize them later and create a new issue.

@alexbaden
Copy link
Contributor

A770 lacks hardware features for MMA that newer hardware has (PVC, LNL, BMG). I suspect the compilation time is coming from inefficient (large) kernels. This is a known issue and, unfortunately, I'm not sure this is something we intend to improve. Do you have access to a BMG or even a PVC machine? Intel developer cloud could be a good place to start.

@vlad-penkin vlad-penkin changed the title Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) [flash-linear-attention] Significant Compilation Time Discrepancy: Triton XPU (A770) vs. NVIDIA (RTX 4090) Mar 26, 2025
@zhiyuan1i
Copy link
Contributor Author

A770 lacks hardware features for MMA that newer hardware has (PVC, LNL, BMG). I suspect the compilation time is coming from inefficient (large) kernels. This is a known issue and, unfortunately, I'm not sure this is something we intend to improve. Do you have access to a BMG or even a PVC machine? Intel developer cloud could be a good place to start.

Thank you, I will try thay if possible:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants