Open-source ONNX Runtime EP for XDNA 2 NPU without VAIP — benchmarks + source

## iron-ep: First open-source VitisAI Execution Provider

**Repo:** https://github.com/Manuelreyesbravo/iron-ep

I built a fully open-source replacement for AMD's proprietary `libonnxruntime_vitisai_ep.so` that runs directly on the XDNA 2 NPU using **IRON + MLIR-AIE + Peano + XRT** — no VAIP, no closed-source runtime.

### What it does

Implements the `compile_onnx_model_vitisai_ep_v4` VitisAI EP interface so ONNX Runtime loads it transparently. Claims and executes `MatMul`, `MatMulInteger`, and `Gemm` nodes on the NPU.

### Benchmark results (AMD Ryzen AI 9 HX 375, Fedora 43)

**INT8 MatMulInteger — whole_array backend (16 AIE cores):**

| Shape | NPU (ms) | CPU (ms) | Speedup |
|-------|----------|----------|---------|
| 256×1024×1024 | 1.041 | 3.082 | **2.96×** |
| 512×2048×2048 | 5.553 | 16.081 | **2.90×** |
| 1024×4096×4096 | 35.836 | 119.240 | **3.33×** |

1024×4096×4096 = full hidden-dim matmul of a 7B LLM at batch 1024. Results are **bit-exact** vs CPU int32 reference.

### How it works

1. Graph analyzer claims MatMul/MatMulInteger/Gemm nodes, reads dtype via `node_arg_get_element_type` (slot 43 in the VitisAI API)
2. Kernel cache drives MLIR-AIE: generates MLIR → Peano compiles mm.cc → aiecc links xclbin
3. XRT runner dispatches to hardware
4. Adaptive backend: `single_core` for M<256, `whole_array` (4×4=16 cores) for M≥256

### Why I'm posting here

Would love feedback on:
- Whether the VitisAI EP slot API (`node_arg_get_element_type`, `graph_nodes_unsafe`, etc.) is stable across versions, or if there's a more stable interface
- If there's interest in upstreaming this or collaborating on a community-maintained open EP
- Any upcoming XDNA 2 features (quantized attention, Flash Attention on NPU) that could inform the roadmap

Happy to provide more details or run specific benchmarks on request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open-source ONNX Runtime EP for XDNA 2 NPU without VAIP — benchmarks + source #356

iron-ep: First open-source VitisAI Execution Provider

What it does

Benchmark results (AMD Ryzen AI 9 HX 375, Fedora 43)

How it works

Why I'm posting here

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shape	NPU (ms)	CPU (ms)	Speedup
256×1024×1024	1.041	3.082	2.96×
512×2048×2048	5.553	16.081	2.90×
1024×4096×4096	35.836	119.240	3.33×

Open-source ONNX Runtime EP for XDNA 2 NPU without VAIP — benchmarks + source #356

Description

iron-ep: First open-source VitisAI Execution Provider

What it does

Benchmark results (AMD Ryzen AI 9 HX 375, Fedora 43)

How it works

Why I'm posting here

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions