Skip to content

[Issue]: Very low performance when dmabuf is used #281

@IMbackK

Description

@IMbackK

Problem Description

i am aware dmabuf is currently experimental.

Using HSA_ENABLE_IPC_MODE_LEGACY=0 on a kernel with no KFD patches currently causes pytorch ddp to excessively underperform.

Every device to device copy is accompanied by a ~500mS stall on all devices.

Operating System

Ubuntu 24.04

CPU

Amd Epyc 7552

GPU

3x MI100

ROCm Version

ROCm 6.3.0

ROCm Component

rccl

Steps to Reproduce

As a test case i am using the pytorch ddp example from https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series, run with:

torchrun --nnode=1 --node_rank=0 --nproc_per_node=2 multigpu_torchrun.py --batch_size 8 100 10

I tested upstream Kernel 6.6.64 and 6.12.8 with CONFIG_HSA_AMD_P2P and CONFIG_DMABUF_MOVE_NOTIFY
ROCm bandwith test shows good p2p performance.
ROCm validation suit shows device pice p2p to be working.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions