Fix/preflight distinguish single-node distributed from multi-node by alexsu52 · Pull Request #565 · AMD-AGI/Primus

alexsu52 · 2026-02-26T10:23:37Z

Changes:

Refine network_mode detection to calculate the actual number of nodes (nnodes) instead of relying solely on world_size.
Support node count detection from Slurm (SLURM_NNODES), OpenMPI (OMPI_COMM_WORLD_SIZE / OMPI_COMM_WORLD_LOCAL_SIZE), and PyTorch (WORLD_SIZE / LOCAL_WORLD_SIZE) environment variables.
Set network_mode="multi-node" only if nnodes > 1.

Reason for changes:

The previous logic incorrectly classified single-node distributed training (e.g., multi-GPU on one machine) as "multi-node" simply because the world size was greater than 1. This change ensures that network_mode accurately reflects whether training spans multiple physical nodes.

Copilot

Pull request overview

This PR fixes the network mode detection logic to accurately distinguish between single-node distributed training (multiple GPUs on one machine) and multi-node distributed training (spanning multiple physical nodes). Previously, any distributed training with world_size > 1 was incorrectly classified as "multi-node".

Changes:

Added detection of actual node count from Slurm, OpenMPI, and PyTorch environment variables
Calculate nnodes by dividing total world size by local world size
Set network_mode to "multi-node" only when nnodes > 1

primus/tools/preflight/network/network_probe.py

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Copilot · 2026-02-26T10:48:29Z

primus/tools/preflight/network/network_probe.py

 def detect_distributed_intent() -> Dict[str, Any]:
    world_size = _env_int("WORLD_SIZE", 1)
+    local_world_size = _env_int("LOCAL_WORLD_SIZE", 1)
+
+    slurm_nnodes = _env_get("SLURM_NNODES")
    slurm_ntasks = _env_get("SLURM_NTASKS")
+
    ompi_size = _env_get("OMPI_COMM_WORLD_SIZE")
+    ompi_local_size = _env_get("OMPI_COMM_WORLD_LOCAL_SIZE")

+    slurm_nnodes_i = int(slurm_nnodes) if slurm_nnodes and slurm_nnodes.isdigit() else None
    slurm_ntasks_i = int(slurm_ntasks) if slurm_ntasks and slurm_ntasks.isdigit() else None
    ompi_size_i = int(ompi_size) if ompi_size and ompi_size.isdigit() else None
+    ompi_local_size_i = int(ompi_local_size) if ompi_local_size and ompi_local_size.isdigit() else None

    is_distributed = (
-        bool(world_size and world_size > 1)
-        or bool(slurm_ntasks_i and slurm_ntasks_i > 1)
-        or bool(ompi_size_i and ompi_size_i > 1)
+        world_size > 1
+        or (slurm_ntasks_i and slurm_ntasks_i > 1)
+        or (ompi_size_i and ompi_size_i > 1)
    )
-    network_mode = "multi-node" if is_distributed else "single-node"
+
+    nnodes = 1
+    if slurm_nnodes_i is not None:
+        nnodes = slurm_nnodes_i
+    elif ompi_size_i is not None and ompi_local_size_i is not None and ompi_size_i > 1 and ompi_local_size_i > 0:
+        nnodes = ompi_size_i // ompi_local_size_i
+    elif world_size > 1 and local_world_size > 0:
+        nnodes = world_size // local_world_size
+
+    network_mode = "multi-node" if nnodes > 1 else "single-node"


The changes to detect_distributed_intent significantly modify the node count detection logic, but there are no unit tests for this function. Given that the repository has comprehensive unit tests for other modules, consider adding tests to cover the new logic, especially edge cases like uneven division of processes across nodes, zero values in environment variables, and the priority order of detection methods (Slurm > OpenMPI > PyTorch).

@Xiaoming-AMD please provide your opinion.

primus/tools/preflight/network/network_probe.py

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

fix(preflight): distinguish single-node distributed from multi-node

3a430fe

alexsu52 requested a review from Xiaoming-AMD as a code owner February 26, 2026 10:23

Copilot AI review requested due to automatic review settings February 26, 2026 10:23

alexsu52 requested review from limou102 and wenxie-amd as code owners February 26, 2026 10:23

Copilot AI reviewed Feb 26, 2026

View reviewed changes

primus/tools/preflight/network/network_probe.py Outdated Show resolved Hide resolved

primus/tools/preflight/network/network_probe.py Show resolved Hide resolved

replied to comments

0f62a45

alexsu52 requested a review from Copilot February 26, 2026 10:43

Copilot started reviewing on behalf of alexsu52 February 26, 2026 10:44 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

replied to comments

3d306d1

alexsu52 requested a review from Copilot February 26, 2026 11:21

Copilot started reviewing on behalf of alexsu52 February 26, 2026 11:23 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/preflight distinguish single-node distributed from multi-node#565

Fix/preflight distinguish single-node distributed from multi-node#565
alexsu52 wants to merge 3 commits intoAMD-AGI:mainfrom
alexsu52:fix/preflight/network_mode

alexsu52 commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

alexsu52 Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexsu52 commented Feb 26, 2026

Changes:

Reason for changes:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

alexsu52 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants