Skip to content

Performance of PyTorch in EB compared to upstream wheels #931

@lexming

Description

@lexming

I've run multiple inference and training benchmarks from https://github.com/pytorch/benchmark to compare the performance of PyTorch build from source in EB with a simple installation using the official wheels from pytorch.org.

Test configurations

The following tables show the results for PyTorch 2.6.0 in 2024a. I made 3 different installations:

  1. Standard build in EB from source:

  2. Official wheels from pytorch.org on top of Python/SciPy-bundle (2024a) and CUDA 12.6.0 on EB:

  3. Official wheels for PyTorch and CUDA on top of a minimal Python/SciPy-bundle (2024a) environment on EB

Hardware/Job

All benchmarks run on the same system:

  • 16 AMD Zen5 CPU cores
  • 1 NVIDIA L40S GPU (48GB VRAM)
  • 96GB of RAM
  • each benchmark ran at least 5 times

Results

All values are relative Operations per Second (OPS) with respect the configuration (3) with PyTorch and CUDA wheels.

Training on GPU

  (1) Build source (2) PyTorch wheels (3) PyTorch/CUDA wheels
--TOTAL-- 100.37% 99.88% 100.00%
BERT_pytorch 100.02% 99.61% 100.00%
alexnet 99.63% 99.92% 100.00%
detectron2_fasterrcnn_r_101_c4 x 97.90% 100.00%
detectron2_fasterrcnn_r_101_dc5 x 99.80% 100.00%
detectron2_fasterrcnn_r_101_fpn x 102.76% 100.00%
LearningToPaint 101.59% 101.26% 100.00%
dlrm 99.88% 99.91% 100.00%
hf_Bert_large 101.30% 98.47% 100.00%
mobilenet_v3_large 99.98% 99.45% 100.00%
timm_regnet 100.39% 99.98% 100.00%
yolov3 100.14% 99.61% 100.00%

Note: the runs with an x failed with errors.

Inference on GPU

  (1) Build source (2) PyTorch wheels (3) PyTorch/CUDA wheels
--TOTAL-- 99.80% 99.56% 100.00%
BERT_pytorch 100.29% 97.69% 100.00%
alexnet 100.41% 98.62% 100.00%
detectron2_fasterrcnn_r_101_c4 x 99.80% 100.00%
detectron2_fasterrcnn_r_101_dc5 x 101.15% 100.00%
detectron2_fasterrcnn_r_101_fpn x 98.73% 100.00%
LearningToPaint 98.01% 100.25% 100.00%
dlrm 99.67% 99.83% 100.00%
hf_Bert_large 99.14% 101.19% 100.00%
hf_T5_large 98.42% 96.71% 100.00%
mobilenet_v3_large 100.55% 100.39% 100.00%
timm_regnet 100.53% 100.30% 100.00%
yolov3 101.19% 100.01% 100.00%

Note: the runs with an x failed with errors.

Training on CPU

  (1) Build source (2) PyTorch wheels (3) PyTorch/CUDA wheels
--TOTAL-- 31.94% 99.91% 100.00%
BERT_pytorch 7.16% 101.44% 100.00%
alexnet 28.28% 101.51% 100.00%
detectron2_fasterrcnn_r_101_c4 71.91% 100.92% 100.00%
LearningToPaint 29.47% 101.39% 100.00%
dlrm 22.87% 100.61% 100.00%
hf_Bert_large 7.29% 101.23% 100.00%
mobilenet_v3_large 9.93% 93.36% 100.00%
timm_vision_transformer_large 77.08% 100.63% 100.00%
timm_regnet 33.44% 98.12% 100.00%

Inference on CPU

  (1) Build source (2) PyTorch wheels (3) PyTorch/CUDA wheels
--TOTAL-- 35.54% 99.63% 100.00%
BERT_pytorch 2.08% 100.66% 100.00%
alexnet 28.77% 98.30% 100.00%
detectron2_fasterrcnn_r_101_c4 68.24% 99.39% 100.00%
LearningToPaint 32.40% 100.97% 100.00%
dlrm 8.27% 96.77% 100.00%
hf_Bert_large 25.18% 94.50% 100.00%
hf_T5_large 5.22% 101.43% 100.00%
mobilenet_v3_large 21.25% 113.94% 100.00%
timm_vision_transformer_large 67.20% 99.87% 100.00%
timm_regnet 61.48% 98.80% 100.00%
yolov3 70.84% 91.34% 100.00%

Observations

  • Performance of inference/training on GPU is practically the same between builds from source and the official wheels
  • Performance of inference/training on CPU is unreasonably low on the builds from source on EB compared to official wheels

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions