I've run multiple inference and training benchmarks from https://github.com/pytorch/benchmark to compare the performance of PyTorch build from source in EB with a simple installation using the official wheels from pytorch.org.
Test configurations
The following tables show the results for PyTorch 2.6.0 in 2024a. I made 3 different installations:
-
Standard build in EB from source:
-
Official wheels from pytorch.org on top of Python/SciPy-bundle (2024a) and CUDA 12.6.0 on EB:
-
Official wheels for PyTorch and CUDA on top of a minimal Python/SciPy-bundle (2024a) environment on EB
Hardware/Job
All benchmarks run on the same system:
- 16 AMD Zen5 CPU cores
- 1 NVIDIA L40S GPU (48GB VRAM)
- 96GB of RAM
- each benchmark ran at least 5 times
Results
All values are relative Operations per Second (OPS) with respect the configuration (3) with PyTorch and CUDA wheels.
Training on GPU
| |
(1) Build source |
(2) PyTorch wheels |
(3) PyTorch/CUDA wheels |
| --TOTAL-- |
100.37% |
99.88% |
100.00% |
| BERT_pytorch |
100.02% |
99.61% |
100.00% |
| alexnet |
99.63% |
99.92% |
100.00% |
| detectron2_fasterrcnn_r_101_c4 |
x |
97.90% |
100.00% |
| detectron2_fasterrcnn_r_101_dc5 |
x |
99.80% |
100.00% |
| detectron2_fasterrcnn_r_101_fpn |
x |
102.76% |
100.00% |
| LearningToPaint |
101.59% |
101.26% |
100.00% |
| dlrm |
99.88% |
99.91% |
100.00% |
| hf_Bert_large |
101.30% |
98.47% |
100.00% |
| mobilenet_v3_large |
99.98% |
99.45% |
100.00% |
| timm_regnet |
100.39% |
99.98% |
100.00% |
| yolov3 |
100.14% |
99.61% |
100.00% |
Note: the runs with an x failed with errors.
Inference on GPU
| |
(1) Build source |
(2) PyTorch wheels |
(3) PyTorch/CUDA wheels |
| --TOTAL-- |
99.80% |
99.56% |
100.00% |
| BERT_pytorch |
100.29% |
97.69% |
100.00% |
| alexnet |
100.41% |
98.62% |
100.00% |
| detectron2_fasterrcnn_r_101_c4 |
x |
99.80% |
100.00% |
| detectron2_fasterrcnn_r_101_dc5 |
x |
101.15% |
100.00% |
| detectron2_fasterrcnn_r_101_fpn |
x |
98.73% |
100.00% |
| LearningToPaint |
98.01% |
100.25% |
100.00% |
| dlrm |
99.67% |
99.83% |
100.00% |
| hf_Bert_large |
99.14% |
101.19% |
100.00% |
| hf_T5_large |
98.42% |
96.71% |
100.00% |
| mobilenet_v3_large |
100.55% |
100.39% |
100.00% |
| timm_regnet |
100.53% |
100.30% |
100.00% |
| yolov3 |
101.19% |
100.01% |
100.00% |
Note: the runs with an x failed with errors.
Training on CPU
| |
(1) Build source |
(2) PyTorch wheels |
(3) PyTorch/CUDA wheels |
| --TOTAL-- |
31.94% |
99.91% |
100.00% |
| BERT_pytorch |
7.16% |
101.44% |
100.00% |
| alexnet |
28.28% |
101.51% |
100.00% |
| detectron2_fasterrcnn_r_101_c4 |
71.91% |
100.92% |
100.00% |
| LearningToPaint |
29.47% |
101.39% |
100.00% |
| dlrm |
22.87% |
100.61% |
100.00% |
| hf_Bert_large |
7.29% |
101.23% |
100.00% |
| mobilenet_v3_large |
9.93% |
93.36% |
100.00% |
| timm_vision_transformer_large |
77.08% |
100.63% |
100.00% |
| timm_regnet |
33.44% |
98.12% |
100.00% |
Inference on CPU
| |
(1) Build source |
(2) PyTorch wheels |
(3) PyTorch/CUDA wheels |
| --TOTAL-- |
35.54% |
99.63% |
100.00% |
| BERT_pytorch |
2.08% |
100.66% |
100.00% |
| alexnet |
28.77% |
98.30% |
100.00% |
| detectron2_fasterrcnn_r_101_c4 |
68.24% |
99.39% |
100.00% |
| LearningToPaint |
32.40% |
100.97% |
100.00% |
| dlrm |
8.27% |
96.77% |
100.00% |
| hf_Bert_large |
25.18% |
94.50% |
100.00% |
| hf_T5_large |
5.22% |
101.43% |
100.00% |
| mobilenet_v3_large |
21.25% |
113.94% |
100.00% |
| timm_vision_transformer_large |
67.20% |
99.87% |
100.00% |
| timm_regnet |
61.48% |
98.80% |
100.00% |
| yolov3 |
70.84% |
91.34% |
100.00% |
Observations
- Performance of inference/training on GPU is practically the same between builds from source and the official wheels
- Performance of inference/training on CPU is unreasonably low on the builds from source on EB compared to official wheels
I've run multiple inference and training benchmarks from https://github.com/pytorch/benchmark to compare the performance of PyTorch build from source in EB with a simple installation using the official wheels from pytorch.org.
Test configurations
The following tables show the results for PyTorch 2.6.0 in 2024a. I made 3 different installations:
Standard build in EB from source:
Official wheels from pytorch.org on top of Python/SciPy-bundle (2024a) and CUDA 12.6.0 on EB:
{ai}[foss/2025a] PyTorch v2.9.1, Triton v3.5.1 w/ CUDA 12.8.0 from wheels easybuild-easyconfigs#25267
Official wheels for PyTorch and CUDA on top of a minimal Python/SciPy-bundle (2024a) environment on EB
Hardware/Job
All benchmarks run on the same system:
Results
All values are relative Operations per Second (OPS) with respect the configuration (3) with PyTorch and CUDA wheels.
Training on GPU
Note: the runs with an
xfailed with errors.Inference on GPU
Note: the runs with an
xfailed with errors.Training on CPU
Inference on CPU
Observations