Skip to content

Commit 682d91e

Browse files
committed
Add GIL Article
1 parent 5f1bf8c commit 682d91e

File tree

1 file changed

+88
-0
lines changed

1 file changed

+88
-0
lines changed

_gil/htcondor-gpu-support.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: HTCondor support of GPUs beyond NVIDIA
3+
layout: markdown
4+
date: 2025-08-15
5+
excerpt: |
6+
HTCondor AMD GPU support is currently broken, as it would not recognize the GPUs present on the SDSC Cosmos system. The problem seems to be in the initial GPU discovery phase.
7+
8+
Moreover, the standard containerization tool used by HTCondor to launch user jobs, namely Apptainer, also does not work correctly with AMD GPUs, and needs to be fixed before the OSPool can effectively use resources providing AMD GPUs.
9+
---
10+
11+
*By Igor Sfiligoi - University of California San Diego*
12+
*Aug 15th, 2025*
13+
14+
The GPU computing landscape has long been dominated by NVIDIA hardware, which can be used through the CUDA interface. HTCondor, which underpins the compute scheduling in the OSPool, has thus developed support for NVIDIA GPUs long ago, and many users are actively using this functionality for production uses.
15+
16+
While NVIDIA still dominates the GPU landscape, AMD has recently introduced compelling GPU models that are starting to be deployed at several resource providers of interest to the OSPool. HTCondor does have experimental support for AMD GPUs, which rely on the ROCm interface, and this activity was initiated to validate it on deployed resources.
17+
18+
## Executive Summary
19+
20+
HTCondor AMD GPU support is currently broken, as it would not recognize the GPUs present on the SDSC Cosmos system. The problem seems to be in the initial GPU discovery phase.
21+
22+
Moreover, the standard containerization tool used by HTCondor to launch user jobs, namely Apptainer, also does not work correctly with AMD GPUs, and needs to be fixed before the OSPool can effectively use resources providing AMD GPUs.
23+
24+
## Observations and Recommendations
25+
26+
### Observation #1
27+
28+
HTCondor does not properly detect the AMD GPUs in SDSC Cosmos.
29+
30+
The root cause seems to be the crash of `condor_gpu_discovery`:
31+
```
32+
/usr/libexec/condor/condor_gpu_discovery -config -extra
33+
/opt/rh/gcc-toolset-14/root/usr/include/c++/14/bits/stl_vector.h:1130: constexpr std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = _cl_device_id*; _Alloc = std::allocator<_cl_device_id*>; reference = _cl_device_id*&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
34+
Aborted
35+
```
36+
37+
**Recommendation:**
38+
39+
The HTCondor team should fix the `condor_gpu_discovery` tool.
40+
41+
---
42+
43+
### Observation #2
44+
45+
Apptainer blindly imports the AMD ROCm libraries and tools inside the container
46+
47+
But that does not work, unless the OS inside and outside the container are the same, which is often not the case.
48+
49+
In this particular test, the OS on SDSC Cosmos was SLES 15, while the OS in the container was AlmaLinux9.
50+
51+
```
52+
singularity shell --rocm …
53+
> /usr/libexec/condor/condor_gpu_discovery -diag -hip
54+
diag: clearing environment before device enumeration
55+
Can't open library 'libamdhip64.so.6': '/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /.singularity.d/libs/libamd_comgr.so.2)'
56+
```
57+
58+
I got the same result using SingularityPro.
59+
60+
**Recommendation:**
61+
62+
The Apptainer team should fix the handling of the `--rocm` option.
63+
64+
---
65+
66+
### Observation #3
67+
68+
Container images that include the ROCm stack work correctly in Apptainer only if process isolation is not enabled. Unfortunately, pilot systems require this functionality.
69+
70+
```
71+
singularity shell <rocm-enabled image>
72+
> /usr/libexec/condor/condor_gpu_discovery -diag -hip
73+
diag: clearing environment before device enumeration
74+
diag: hip_Init()
75+
diag: hipDevice count: 4
76+
DetectedGPUs="GPU0, GPU1, GPU2, GPU3"
77+
78+
singularity shell --contain <rocm-enabled image>
79+
> /usr/libexec/condor/condor_gpu_discovery -diag -hip
80+
diag: clearing environment before device enumeration
81+
diag: hip_Init()
82+
diag: hipDevice count: 0
83+
DetectedGPUs=0
84+
```
85+
86+
**Recommendation:**
87+
88+
The Apptainer team should investigate how to fix this. It may be dealt with as part of fixing the `--rocm` option.

0 commit comments

Comments
 (0)