Skip to content

Fix/preflight distinguish single-node distributed from multi-node#565

Open
alexsu52 wants to merge 3 commits intoAMD-AGI:mainfrom
alexsu52:fix/preflight/network_mode
Open

Fix/preflight distinguish single-node distributed from multi-node#565
alexsu52 wants to merge 3 commits intoAMD-AGI:mainfrom
alexsu52:fix/preflight/network_mode

Conversation

@alexsu52
Copy link
Contributor

Changes:

  • Refine network_mode detection to calculate the actual number of nodes (nnodes) instead of relying solely on world_size.
  • Support node count detection from Slurm (SLURM_NNODES), OpenMPI (OMPI_COMM_WORLD_SIZE / OMPI_COMM_WORLD_LOCAL_SIZE), and PyTorch (WORLD_SIZE / LOCAL_WORLD_SIZE) environment variables.
  • Set network_mode="multi-node" only if nnodes > 1.

Reason for changes:

The previous logic incorrectly classified single-node distributed training (e.g., multi-GPU on one machine) as "multi-node" simply because the world size was greater than 1. This change ensures that network_mode accurately reflects whether training spans multiple physical nodes.

Copilot AI review requested due to automatic review settings February 26, 2026 10:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the network mode detection logic to accurately distinguish between single-node distributed training (multiple GPUs on one machine) and multi-node distributed training (spanning multiple physical nodes). Previously, any distributed training with world_size > 1 was incorrectly classified as "multi-node".

Changes:

  • Added detection of actual node count from Slurm, OpenMPI, and PyTorch environment variables
  • Calculate nnodes by dividing total world size by local world size
  • Set network_mode to "multi-node" only when nnodes > 1

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comment on lines 33 to 62
def detect_distributed_intent() -> Dict[str, Any]:
world_size = _env_int("WORLD_SIZE", 1)
local_world_size = _env_int("LOCAL_WORLD_SIZE", 1)

slurm_nnodes = _env_get("SLURM_NNODES")
slurm_ntasks = _env_get("SLURM_NTASKS")

ompi_size = _env_get("OMPI_COMM_WORLD_SIZE")
ompi_local_size = _env_get("OMPI_COMM_WORLD_LOCAL_SIZE")

slurm_nnodes_i = int(slurm_nnodes) if slurm_nnodes and slurm_nnodes.isdigit() else None
slurm_ntasks_i = int(slurm_ntasks) if slurm_ntasks and slurm_ntasks.isdigit() else None
ompi_size_i = int(ompi_size) if ompi_size and ompi_size.isdigit() else None
ompi_local_size_i = int(ompi_local_size) if ompi_local_size and ompi_local_size.isdigit() else None

is_distributed = (
bool(world_size and world_size > 1)
or bool(slurm_ntasks_i and slurm_ntasks_i > 1)
or bool(ompi_size_i and ompi_size_i > 1)
world_size > 1
or (slurm_ntasks_i and slurm_ntasks_i > 1)
or (ompi_size_i and ompi_size_i > 1)
)
network_mode = "multi-node" if is_distributed else "single-node"

nnodes = 1
if slurm_nnodes_i is not None:
nnodes = slurm_nnodes_i
elif ompi_size_i is not None and ompi_local_size_i is not None and ompi_size_i > 1 and ompi_local_size_i > 0:
nnodes = ompi_size_i // ompi_local_size_i
elif world_size > 1 and local_world_size > 0:
nnodes = world_size // local_world_size

network_mode = "multi-node" if nnodes > 1 else "single-node"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to detect_distributed_intent significantly modify the node count detection logic, but there are no unit tests for this function. Given that the repository has comprehensive unit tests for other modules, consider adding tests to cover the new logic, especially edge cases like uneven division of processes across nodes, zero values in environment variables, and the priority order of detection methods (Slurm > OpenMPI > PyTorch).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xiaoming-AMD please provide your opinion.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants