When HSA_OVERRIDE_GFX_VERSION is set to an empty string — common in
shell scripts, K8s pod env defaults, container layers — ROCm runtime treats
the variable as "set" and tries to parse it. Parse fails, but instead of
falling back to the detected GPU, it forces gfx000 (no-device sentinel)
and every ROCm app fails with HSA_STATUS_ERROR_OUT_OF_RESOURCES.
Repro (30 seconds, any RDNA/CDNA GPU)
$ HSA_OVERRIDE_GFX_VERSION="" /opt/rocm/bin/rocm_agent_enumerator
Invalid HSA_OVERRIDE_GFX_VERSION format expected "1.2.3"
gfx000
gfx000
$ HSA_OVERRIDE_GFX_VERSION="" /opt/rocm/bin/rocminfo 2>&1 | head -3
ROCk module version 6.x.x is loaded
hsa api call failure at: ...
hsa api call failure: HSA_STATUS_ERROR_OUT_OF_RESOURCES
Compare with unset (works correctly):
$ unset HSA_OVERRIDE_GFX_VERSION; /opt/rocm/bin/rocm_agent_enumerator
gfx1201
gfx1201
Why this hits people
- Container init scripts that propagate env vars without filtering empties
docker run -e HSA_OVERRIDE_GFX_VERSION with no value → empty string
- K8s pod env from a ConfigMap key that wasn't populated → empty string
- Shell
export HSA_OVERRIDE_GFX_VERSION="${OVERRIDE:-}" with OVERRIDE unset
Diagnostic burden is high — error message says "Invalid format" but the
secondary effect (gfx000 fallback → every workload fails) is what users
actually see. People chase ulimit/permission rabbit holes for hours.
Suggested fix
Most likely site: libhsakmt/src/topology.c (where the env var is currently read and the "Invalid format" message originates).
const char* env = getenv("HSA_OVERRIDE_GFX_VERSION");
if (env != NULL && strlen(env) > 0) { // ← add `strlen > 0` check
// ... parse and apply ...
} else {
// treat as unset, use detected ISA
}
5 lines. Treat empty string identically to unset.
Environment
- Hardware: any (hit it on Radeon AI PRO R9700 / gfx1201)
- ROCm 7.2.2 on Ubuntu 24.04 noble (rocm/dev-ubuntu-24.04 image)
- Reproduces on K8s pods with default empty env propagation
When
HSA_OVERRIDE_GFX_VERSIONis set to an empty string — common inshell scripts, K8s pod env defaults, container layers — ROCm runtime treats
the variable as "set" and tries to parse it. Parse fails, but instead of
falling back to the detected GPU, it forces
gfx000(no-device sentinel)and every ROCm app fails with HSA_STATUS_ERROR_OUT_OF_RESOURCES.
Repro (30 seconds, any RDNA/CDNA GPU)
Compare with unset (works correctly):
Why this hits people
docker run -e HSA_OVERRIDE_GFX_VERSIONwith no value → empty stringexport HSA_OVERRIDE_GFX_VERSION="${OVERRIDE:-}"withOVERRIDEunsetDiagnostic burden is high — error message says "Invalid format" but the
secondary effect (gfx000 fallback → every workload fails) is what users
actually see. People chase ulimit/permission rabbit holes for hours.
Suggested fix
Most likely site:
libhsakmt/src/topology.c(where the env var is currently read and the "Invalid format" message originates).5 lines. Treat empty string identically to unset.
Environment