Skip to content

GC less effective in AMDGPU than CUDA #683

Closed
@evelyne-ringoot

Description

@evelyne-ringoot

Creating a multitude of small copies for benchmarking slows AMDGPU.jl down a lot, something not observed in CUDA.jl. The solution for this specific code is to avoid allocations all together, but this is (maybe?) not possible with every type of code. (I also remember having had some issues with benchmarktools, but cannot manage to reproduce them right now) Sharing the code here for future reference:

using AMDGPU, BSON
n_values=(2 .^(1:14))
timings=zeros(2,length(n_values))

function mybelapsed(A, B)
   AMDGPU.rocBLAS.gemm('N','N',copy(A),copy(B))
   t=0.0
   k=0
   while (k<1e5 && t<1)
       Acpy=copy(A)
       Bcpy=copy(B)
       AMDGPU.synchronize()
       t+= @elapsed (AMDGPU.@sync AMDGPU.rocBLAS.gemm('N','N',Acpy,Bcpy))
       AMDGPU.synchronize()
       k+=1
    end
    return t/k
end

function mybelapsed2(A, B)
   AMDGPU.rocBLAS.gemm('N','N',copy(A),copy(B))
   t=0.0
   k=0
   Acpy=copy(A)
   Bcpy=copy(B)
   if(k<1e5 && t<1)
       AMDGPU.synchronize()
       t+= @elapsed (AMDGPU.@sync AMDGPU.rocBLAS.gemm('N','N',Acpy,Bcpy);)
       AMDGPU.synchronize()
       Acpy.=A
       Bcpy.=B
       k+=1
    end
    return t/k
end


for (i,n) in enumerate(n_values)
   A=ROCArray(rand(Float32,n,n));
   B=ROCArray(rand(Float32,n,n));
   println(n)
   timings[1,i]=mybelapsed(A,B)
   GC.gc()
   sleep(1)
   timings[2,i]=mybelapsed2(A,B)
   GC.gc()
   sleep(1)
   BSON.@save "AMD_matmul_bench.bson" timings
end

Adding AMDGPU.unsafe_free! in every iteration does not solve this problem either, neither does turning GC off, and manually running GC.enable(true); AMDGPU.unsafe_free!(Acpy); AMDGPU.unsafe_free!(Bcpy); GC.gc(); sleep(0.001); GC.enable(false); between every iteration. The same code with AMDGPU replaced by CUDA (and ROCblasgemm by Acpy*Bcpy) shows barely any performance difference between both codes (even slightly better and more stable performance when using copies):
Image

Versions:

julia> versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7302 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
┌───────────┬──────────────────┬───────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                    │
├───────────┼──────────────────┼───────────┼─────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /opt/rocm/llvm/bin/ld.lld                                                               │
│     +     │ Device Libraries │ -         │ /home/eringoot/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode │
│     +     │ HIP              │ 6.0.32831 │ /opt/rocm-6.0.2/lib/libamdhip64.so                                                      │
│     +     │ rocBLAS          │ 4.0.0     │ /opt/rocm-6.0.2/lib/librocblas.so                                                       │
│     +     │ rocSOLVER        │ 3.24.0    │ /opt/rocm-6.0.2/lib/librocsolver.so                                                     │
│     +     │ rocALUTION       │ -         │ /opt/rocm-6.0.2/lib/librocalution.so                                                    │
│     +     │ rocSPARSE        │ -         │ /opt/rocm-6.0.2/lib/librocsparse.so                                                     │
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm-6.0.2/lib/librocrand.so                                                       │
│     +     │ rocFFT           │ 1.0.27    │ /opt/rocm-6.0.2/lib/librocfft.so                                                        │
│     +     │ MIOpen           │ 3.0.0     │ /opt/rocm-6.0.2/lib/libMIOpen.so                                                        │
└───────────┴──────────────────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬────────────────────────┬────────────────────────┬───────────┬────────────┐
│ Id │                   Name │               GCN arch │ Wavefront │     Memory │
├────┼────────────────────────┼────────────────────────┼───────────┼────────────┤
│  1 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  2 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  3 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  4 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  5 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  6 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  7 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
│  8 │ AMD Instinct MI50/MI60 │ gfx906:sramecc+:xnack- │        64 │ 31.984 GiB │
└────┴────────────────────────┴────────────────────────┴───────────┴────────────┘

@jpsamaroo @vchuravy @pxl-th

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions