Fix/preflight distinguish single-node distributed from multi-node#565
Fix/preflight distinguish single-node distributed from multi-node#565alexsu52 wants to merge 3 commits intoAMD-AGI:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes the network mode detection logic to accurately distinguish between single-node distributed training (multiple GPUs on one machine) and multi-node distributed training (spanning multiple physical nodes). Previously, any distributed training with world_size > 1 was incorrectly classified as "multi-node".
Changes:
- Added detection of actual node count from Slurm, OpenMPI, and PyTorch environment variables
- Calculate nnodes by dividing total world size by local world size
- Set network_mode to "multi-node" only when nnodes > 1
| def detect_distributed_intent() -> Dict[str, Any]: | ||
| world_size = _env_int("WORLD_SIZE", 1) | ||
| local_world_size = _env_int("LOCAL_WORLD_SIZE", 1) | ||
|
|
||
| slurm_nnodes = _env_get("SLURM_NNODES") | ||
| slurm_ntasks = _env_get("SLURM_NTASKS") | ||
|
|
||
| ompi_size = _env_get("OMPI_COMM_WORLD_SIZE") | ||
| ompi_local_size = _env_get("OMPI_COMM_WORLD_LOCAL_SIZE") | ||
|
|
||
| slurm_nnodes_i = int(slurm_nnodes) if slurm_nnodes and slurm_nnodes.isdigit() else None | ||
| slurm_ntasks_i = int(slurm_ntasks) if slurm_ntasks and slurm_ntasks.isdigit() else None | ||
| ompi_size_i = int(ompi_size) if ompi_size and ompi_size.isdigit() else None | ||
| ompi_local_size_i = int(ompi_local_size) if ompi_local_size and ompi_local_size.isdigit() else None | ||
|
|
||
| is_distributed = ( | ||
| bool(world_size and world_size > 1) | ||
| or bool(slurm_ntasks_i and slurm_ntasks_i > 1) | ||
| or bool(ompi_size_i and ompi_size_i > 1) | ||
| world_size > 1 | ||
| or (slurm_ntasks_i and slurm_ntasks_i > 1) | ||
| or (ompi_size_i and ompi_size_i > 1) | ||
| ) | ||
| network_mode = "multi-node" if is_distributed else "single-node" | ||
|
|
||
| nnodes = 1 | ||
| if slurm_nnodes_i is not None: | ||
| nnodes = slurm_nnodes_i | ||
| elif ompi_size_i is not None and ompi_local_size_i is not None and ompi_size_i > 1 and ompi_local_size_i > 0: | ||
| nnodes = ompi_size_i // ompi_local_size_i | ||
| elif world_size > 1 and local_world_size > 0: | ||
| nnodes = world_size // local_world_size | ||
|
|
||
| network_mode = "multi-node" if nnodes > 1 else "single-node" |
There was a problem hiding this comment.
The changes to detect_distributed_intent significantly modify the node count detection logic, but there are no unit tests for this function. Given that the repository has comprehensive unit tests for other modules, consider adding tests to cover the new logic, especially edge cases like uneven division of processes across nodes, zero values in environment variables, and the priority order of detection methods (Slurm > OpenMPI > PyTorch).
There was a problem hiding this comment.
@Xiaoming-AMD please provide your opinion.
Changes:
Reason for changes:
The previous logic incorrectly classified single-node distributed training (e.g., multi-GPU on one machine) as "multi-node" simply because the world size was greater than 1. This change ensures that network_mode accurately reflects whether training spans multiple physical nodes.