Skip to content

Commit 02ca70d

Browse files
Fix/preflight NUMA imbalance to mean uneven GPU distribution across nodes (#554)
### Changes: Only flag imbalance if the COUNT of GPUs on each node differs. Example: 4 on Node 0, 4 on Node 1 -> counts=[4,4] -> set={4} -> len=1 -> NOT imbalanced. 7 on Node 0, 1 on Node 1 -> counts=[7,1] -> set={7,1} -> len=2 -> Imbalanced. ### Reason for changes: The previous logic would issue a NUMA imbalance warning if not all GPUs were connected to the same node, resulting in a false positive when using a multi-socket CPU. --------- Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
1 parent 78633ae commit 02ca70d

1 file changed

Lines changed: 5 additions & 1 deletion

File tree

primus/tools/preflight/gpu/gpu_topology.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from __future__ import annotations
88

99
import os
10+
from collections import Counter
1011
from typing import Any, Dict, List, Optional
1112

1213
from .gpu_probe import probe_gpus
@@ -109,7 +110,10 @@ def run_gpu_standard_checks(*, force_topology: bool = False) -> Dict[str, Any]:
109110
findings.append(Finding("warn", "NUMA mapping unavailable (amd-smi not found); skipped", {}))
110111
else:
111112
nodes = [x.get("numa_node") for x in numa.get("gpus", []) if x.get("numa_node") is not None]
112-
imbalance = len(set(nodes)) > 1 if nodes else False
113+
imbalance = False
114+
if nodes:
115+
counts = Counter(nodes).values()
116+
imbalance = len(set(counts)) > 1
113117
findings.append(
114118
Finding("info", "GPU↔NUMA mapping", {"mapping": numa.get("gpus", []), "imbalance": imbalance})
115119
)

0 commit comments

Comments
 (0)