Why does a SLURM cluster not allow nodesets in a different region than the controller? #4738

bo-qeye · 2025-10-08T07:34:53Z

bo-qeye
Oct 8, 2025

I'm trying to make a cluster with nodesets across more than one region. This fails, due to the check here.

Why this constraint? I'm aware of the increased latency and cost involved with this setup, but I'm willing to accept these drawbacks.

Answered by ankitkumar-quad

Oct 15, 2025

Currently, slurm-gcp does not support deploying a SLURM cluster controller with nodesets in multiple regions.
Google and SchedMD recommend using reservations for VM availability and compact placement policies to achieve low network latency within a region.
To ensure the lowest possible latency and reliable operation, node-sets should be in the same region as the controller (head node) (SLURM in the Clouds, GCP best practices).
If you need to run workloads across multiple regions, the recommended solution is to deploy separate SLURM clusters, each with its own controller and nodesets in a single region. You can then federate clusters if needed for workload portability.

View full answer

ankitkumar-quad · 2025-10-15T04:57:18Z

ankitkumar-quad
Oct 15, 2025
Maintainer

Currently, slurm-gcp does not support deploying a SLURM cluster controller with nodesets in multiple regions.
Google and SchedMD recommend using reservations for VM availability and compact placement policies to achieve low network latency within a region.
To ensure the lowest possible latency and reliable operation, node-sets should be in the same region as the controller (head node) (SLURM in the Clouds, GCP best practices).
If you need to run workloads across multiple regions, the recommended solution is to deploy separate SLURM clusters, each with its own controller and nodesets in a single region. You can then federate clusters if needed for workload portability.

0 replies

siddhartha-quad · 2025-10-22T07:49:08Z

siddhartha-quad
Oct 22, 2025
Maintainer

Hi @bo-qeye , Could you please look into the above answer and let us know whether it is answering your question.

0 replies

bo-qeye · 2025-10-22T07:51:51Z

bo-qeye
Oct 22, 2025
Author

Hi @siddhartha-quad

It does answer my question. I still slightly disagree in the choice. Multi cluster federation is not well documented and there are no clear examples, also, it adds overhead. I vote that the constraint is lifted, possibly via an option one needs to enable, for those who are willing to accept the drawbacks.

0 replies

cboneti · 2025-10-22T10:15:28Z

cboneti
Oct 22, 2025
Maintainer

I think there is also another problem beyond the latency, which is the storage. Most nfs type of storages (used at /home and other) will have a hard time working across regions and, not all network configurations will allow that to work reliably. Furthermore, cross region traffic is often charged and so users could incur in hidden costs.

Naturally, you could make all I/O go through GCS, but even that will have costs as you truly go across regions.

Now, if you really know want to do it, I believe you can do this by creating the network outside of the blueprint, and making sure that the network is configured such as all regions can reach private IPs of each other, which is probably the default behavior if you create a network by hand on the cloud console. You would need to manually define the region, zone and subnetworks for each nodeset (you can debug this by looking at the expanded blueprint: ghpc expand <blueprint).

Like I warned above, NFS performance may suffer and there will be data egress cost.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does a SLURM cluster not allow nodesets in a different region than the controller? #4738

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why does a SLURM cluster not allow nodesets in a different region than the controller? #4738

Uh oh!

bo-qeye Oct 8, 2025

Replies: 4 comments

Uh oh!

ankitkumar-quad Oct 15, 2025 Maintainer

Uh oh!

siddhartha-quad Oct 22, 2025 Maintainer

Uh oh!

bo-qeye Oct 22, 2025 Author

Uh oh!

cboneti Oct 22, 2025 Maintainer

bo-qeye
Oct 8, 2025

ankitkumar-quad
Oct 15, 2025
Maintainer

siddhartha-quad
Oct 22, 2025
Maintainer

bo-qeye
Oct 22, 2025
Author

cboneti
Oct 22, 2025
Maintainer