Skip to content

[Slurm] Setting MASTER_ADDR !=SLURM_NODELIST[0] causes a hang. #567

@alexsu52

Description

@alexsu52

Summary

When using ./primus-cli slurm, manually setting MASTER_ADDR can cause the job to hang if the address does not match the first node in SLURM_NODELIST. This occurs because the rank mapping relies on the Slurm node allocation.

Details

The current script respects an existing MASTER_ADDR environment variable:

if [[ -z "${MASTER_ADDR:-}" ]]; then

If a user (or a previous script) sets MASTER_ADDR incorrectly (e.g., to a node not in the current allocation or not the head node), the training hangs indefinitely.

Suggested Fix

  • Remapping NODE_RANK to handle user's MASTER_ADDR.
  • Fatal error if MASTER_ADDR is set and does not match SLURM_NODELIST[0] to improve UX/DX

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions