Skip to content

HSA_OVERRIDE_GFX_VERSION="" (empty string) silently degrades GPU detection to gfx000 #354

@yanghoeg

Description

@yanghoeg

When HSA_OVERRIDE_GFX_VERSION is set to an empty string — common in
shell scripts, K8s pod env defaults, container layers — ROCm runtime treats
the variable as "set" and tries to parse it. Parse fails, but instead of
falling back to the detected GPU, it forces gfx000 (no-device sentinel)
and every ROCm app fails with HSA_STATUS_ERROR_OUT_OF_RESOURCES.

Repro (30 seconds, any RDNA/CDNA GPU)

$ HSA_OVERRIDE_GFX_VERSION="" /opt/rocm/bin/rocm_agent_enumerator
Invalid HSA_OVERRIDE_GFX_VERSION format expected "1.2.3"
gfx000
gfx000

$ HSA_OVERRIDE_GFX_VERSION="" /opt/rocm/bin/rocminfo 2>&1 | head -3
ROCk module version 6.x.x is loaded
hsa api call failure at: ...
hsa api call failure: HSA_STATUS_ERROR_OUT_OF_RESOURCES

Compare with unset (works correctly):

$ unset HSA_OVERRIDE_GFX_VERSION; /opt/rocm/bin/rocm_agent_enumerator
gfx1201
gfx1201

Why this hits people

  • Container init scripts that propagate env vars without filtering empties
  • docker run -e HSA_OVERRIDE_GFX_VERSION with no value → empty string
  • K8s pod env from a ConfigMap key that wasn't populated → empty string
  • Shell export HSA_OVERRIDE_GFX_VERSION="${OVERRIDE:-}" with OVERRIDE unset

Diagnostic burden is high — error message says "Invalid format" but the
secondary effect (gfx000 fallback → every workload fails) is what users
actually see. People chase ulimit/permission rabbit holes for hours.

Suggested fix

Most likely site: libhsakmt/src/topology.c (where the env var is currently read and the "Invalid format" message originates).

const char* env = getenv("HSA_OVERRIDE_GFX_VERSION");
if (env != NULL && strlen(env) > 0) {     // ← add `strlen > 0` check
    // ... parse and apply ...
} else {
    // treat as unset, use detected ISA
}

5 lines. Treat empty string identically to unset.

Environment

  • Hardware: any (hit it on Radeon AI PRO R9700 / gfx1201)
  • ROCm 7.2.2 on Ubuntu 24.04 noble (rocm/dev-ubuntu-24.04 image)
  • Reproduces on K8s pods with default empty env propagation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions