-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
Summary
When using ./primus-cli slurm, manually setting MASTER_ADDR can cause the job to hang if the address does not match the first node in SLURM_NODELIST. This occurs because the rank mapping relies on the Slurm node allocation.
Details
The current script respects an existing MASTER_ADDR environment variable:
Primus/runner/primus-cli-slurm-entry.sh
Line 123 in 02ca70d
| if [[ -z "${MASTER_ADDR:-}" ]]; then |
If a user (or a previous script) sets MASTER_ADDR incorrectly (e.g., to a node not in the current allocation or not the head node), the training hangs indefinitely.
Suggested Fix
- Remapping NODE_RANK to handle user's MASTER_ADDR.
- Fatal error if
MASTER_ADDRis set and does not matchSLURM_NODELIST[0]to improve UX/DX
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels