Skip to content

Commit

Permalink
Add controller state save disk
Browse files Browse the repository at this point in the history
  • Loading branch information
alyssa-sm committed Feb 18, 2025
1 parent 0107923 commit 2e6fb84
Show file tree
Hide file tree
Showing 14 changed files with 126 additions and 16 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_access_config"></a> [access\_config](#input\_access\_config) | Access configurations, i.e. IPs via which the VM instance can be accessed via the Internet. | <pre>list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))</pre> | `[]` | no |
| <a name="input_additional_disks"></a> [additional\_disks](#input\_additional\_disks) | List of maps of disks. | <pre>list(object({<br/> disk_name = string<br/> device_name = string<br/> disk_type = string<br/> disk_size_gb = number<br/> disk_labels = map(string)<br/> auto_delete = bool<br/> boot = bool<br/> }))</pre> | `[]` | no |
| <a name="input_additional_disks"></a> [additional\_disks](#input\_additional\_disks) | List of maps of disks. | <pre>list(object({<br/> source = optional(string)<br/> disk_name = string<br/> device_name = string<br/> disk_type = string<br/> disk_size_gb = number<br/> disk_labels = map(string)<br/> auto_delete = bool<br/> boot = bool<br/> }))</pre> | `[]` | no |
| <a name="input_additional_networks"></a> [additional\_networks](#input\_additional\_networks) | Additional network interface details for GCE, if any. | <pre>list(object({<br/> network = string<br/> subnetwork = string<br/> subnetwork_project = string<br/> network_ip = string<br/> nic_type = string<br/> access_config = list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))<br/> ipv6_access_config = list(object({<br/> network_tier = string<br/> }))<br/> }))</pre> | `[]` | no |
| <a name="input_advanced_machine_features"></a> [advanced\_machine\_features](#input\_advanced\_machine\_features) | See https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#nested_advanced_machine_features | <pre>object({<br/> enable_nested_virtualization = optional(bool)<br/> threads_per_core = optional(number)<br/> turbo_mode = optional(string)<br/> visible_core_count = optional(number)<br/> performance_monitoring_unit = optional(string)<br/> enable_uefi_networking = optional(bool)<br/> })</pre> | n/a | yes |
| <a name="input_bandwidth_tier"></a> [bandwidth\_tier](#input\_bandwidth\_tier) | Tier 1 bandwidth increases the maximum egress bandwidth for VMs.<br/>Using the `virtio_enabled` setting will only enable VirtioNet and will not enable TIER\_1.<br/>Using the `tier_1_enabled` setting will enable both gVNIC and TIER\_1 higher bandwidth networking.<br/>Using the `gvnic_enabled` setting will only enable gVNIC and will not enable TIER\_1.<br/>Note that TIER\_1 only works with specific machine families & shapes and must be using an image that supports gVNIC. See [official docs](https://cloud.google.com/compute/docs/networking/configure-vm-with-high-bandwidth-configuration) for more details. | `string` | `"platform_default"` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ locals {
disk_name = disk.disk_name
device_name = disk.device_name
auto_delete = disk.auto_delete
source = disk.source
boot = disk.boot
disk_size_gb = disk.disk_size_gb
disk_type = disk.disk_type
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,7 @@ variable "disk_auto_delete" {

variable "additional_disks" {
type = list(object({
source = optional(string)
disk_name = string
device_name = string
disk_type = string
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ No modules.
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_access_config"></a> [access\_config](#input\_access\_config) | Access configurations, i.e. IPs via which the VM instance can be accessed via the Internet. | <pre>list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))</pre> | `[]` | no |
| <a name="input_additional_disks"></a> [additional\_disks](#input\_additional\_disks) | List of maps of additional disks. See https://www.terraform.io/docs/providers/google/r/compute_instance_template#disk_name | <pre>list(object({<br/> disk_name = string<br/> device_name = string<br/> auto_delete = bool<br/> boot = bool<br/> disk_size_gb = number<br/> disk_type = string<br/> disk_labels = map(string)<br/> }))</pre> | `[]` | no |
| <a name="input_additional_disks"></a> [additional\_disks](#input\_additional\_disks) | List of maps of additional disks. See https://www.terraform.io/docs/providers/google/r/compute_instance_template#disk_name | <pre>list(object({<br/> source = optional(string)<br/> disk_name = string<br/> device_name = string<br/> auto_delete = bool<br/> boot = bool<br/> disk_size_gb = number<br/> disk_type = string<br/> disk_labels = map(string)<br/> }))</pre> | `[]` | no |
| <a name="input_additional_networks"></a> [additional\_networks](#input\_additional\_networks) | Additional network interface details for GCE, if any. | <pre>list(object({<br/> network = string<br/> subnetwork = string<br/> subnetwork_project = string<br/> network_ip = string<br/> nic_type = string<br/> access_config = list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))<br/> ipv6_access_config = list(object({<br/> network_tier = string<br/> }))<br/> }))</pre> | `[]` | no |
| <a name="input_advanced_machine_features"></a> [advanced\_machine\_features](#input\_advanced\_machine\_features) | See https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#nested_advanced_machine_features | <pre>object({<br/> enable_nested_virtualization = optional(bool)<br/> threads_per_core = optional(number)<br/> turbo_mode = optional(string)<br/> visible_core_count = optional(number)<br/> performance_monitoring_unit = optional(string)<br/> enable_uefi_networking = optional(bool)<br/> })</pre> | n/a | yes |
| <a name="input_alias_ip_range"></a> [alias\_ip\_range](#input\_alias\_ip\_range) | An array of alias IP ranges for this network interface. Can only be specified for network interfaces on subnet-mode networks.<br/>ip\_cidr\_range: The IP CIDR range represented by this alias IP range. This IP CIDR range must belong to the specified subnetwork and cannot contain IP addresses reserved by system or used by other network interfaces. At the time of writing only a netmask (e.g. /24) may be supplied, with a CIDR format resulting in an API error.<br/>subnetwork\_range\_name: The subnetwork secondary range name specifying the secondary range from which to allocate the IP CIDR range for this alias IP range. If left unspecified, the primary range of the subnetwork will be used. | <pre>object({<br/> ip_cidr_range = string<br/> subnetwork_range_name = string<br/> })</pre> | `null` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,15 +87,15 @@ resource "google_compute_instance_template" "tpl" {
auto_delete = lookup(disk.value, "auto_delete", null)
boot = lookup(disk.value, "boot", null)
device_name = lookup(disk.value, "device_name", null)
disk_name = lookup(disk.value, "disk_name", null)
disk_size_gb = lookup(disk.value, "disk_size_gb", lookup(disk.value, "disk_type", null) == "local-ssd" ? "375" : null)
disk_type = lookup(disk.value, "disk_type", null)
disk_name = lookup(disk.value, "source", null) != null ? null : lookup(disk.value, "disk_name", null)
disk_size_gb = lookup(disk.value, "source", null) != null ? null : lookup(disk.value, "disk_size_gb", lookup(disk.value, "disk_type", null) == "local-ssd" ? "375" : null)
disk_type = lookup(disk.value, "source", null) != null ? null : lookup(disk.value, "disk_type", null)
interface = lookup(disk.value, "interface", lookup(disk.value, "disk_type", null) == "local-ssd" ? "NVME" : null)
mode = lookup(disk.value, "mode", null)
source = lookup(disk.value, "source", null)
source_image = lookup(disk.value, "source_image", null)
source_image = lookup(disk.value, "source", null) != null ? null : lookup(disk.value, "source_image", null)
type = lookup(disk.value, "disk_type", null) == "local-ssd" ? "SCRATCH" : "PERSISTENT"
labels = lookup(disk.value, "disk_type", null) == "local-ssd" ? null : lookup(disk.value, "disk_labels", null)
labels = lookup(disk.value, "source", null) != null ? null : lookup(disk.value, "disk_type", null) == "local-ssd" ? null : lookup(disk.value, "disk_labels", null)

dynamic "disk_encryption_key" {
for_each = compact([var.disk_encryption_key == null ? null : 1])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,7 @@ variable "auto_delete" {
variable "additional_disks" {
description = "List of maps of additional disks. See https://www.terraform.io/docs/providers/google/r/compute_instance_template#disk_name"
type = list(object({
source = optional(string)
disk_name = string
device_name = string
auto_delete = bool
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ limitations under the License.

| Name | Type |
|------|------|
| [google_compute_disk.controller_disk](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_disk) | resource |
| [google_compute_instance_from_template.controller](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_from_template) | resource |
| [google_secret_manager_secret.cloudsql](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/secret_manager_secret) | resource |
| [google_secret_manager_secret_iam_member.cloudsql_secret_accessor](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/secret_manager_secret_iam_member) | resource |
Expand Down Expand Up @@ -291,6 +292,7 @@ limitations under the License.
| <a name="input_compute_startup_scripts_timeout"></a> [compute\_startup\_scripts\_timeout](#input\_compute\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in compute\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_startup_script"></a> [controller\_startup\_script](#input\_controller\_startup\_script) | Startup script used by the controller VM. | `string` | `"# no-op"` | no |
| <a name="input_controller_startup_scripts_timeout"></a> [controller\_startup\_scripts\_timeout](#input\_controller\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in controller\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_state_disk"></a> [controller\_state\_disk](#input\_controller\_state\_disk) | A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.<br/> To disable this feature, set this variable to null.<br/><br/> NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally. | <pre>object({<br/> device_name = string<br/> type = string<br/> size = number<br/> })</pre> | <pre>{<br/> "device_name": "controller-save-state",<br/> "size": 50,<br/> "type": "pd-ssd"<br/>}</pre> | no |
| <a name="input_create_bucket"></a> [create\_bucket](#input\_create\_bucket) | Create GCS bucket instead of using an existing one. | `bool` | `true` | no |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | Name of the deployment. | `string` | n/a | yes |
| <a name="input_disable_controller_public_ips"></a> [disable\_controller\_public\_ips](#input\_disable\_controller\_public\_ips) | DEPRECATED: Use `enable_controller_public_ips` instead. | `bool` | `null` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,17 @@ locals {
}
]

state_disk = var.controller_state_disk != null ? [{
disk_name = var.controller_state_disk.device_name
disk_size_gb = var.controller_state_disk.size
disk_type = var.controller_state_disk.type
source = google_compute_disk.controller_disk[0].name
device_name = var.controller_state_disk.device_name
disk_labels = null
auto_delete = false
boot = false
}] : []

synth_def_sa_email = "${data.google_project.this.number}[email protected]"

service_account = {
Expand All @@ -48,6 +59,15 @@ locals {
)
}

resource "google_compute_disk" "controller_disk" {
count = var.controller_state_disk != null ? 1 : 0

name = var.controller_state_disk.device_name
type = var.controller_state_disk.type
size = var.controller_state_disk.size
zone = var.zone
}

# INSTANCE TEMPLATE
module "slurm_controller_template" {
source = "../../internal/slurm-gcp/instance_template"
Expand All @@ -62,7 +82,7 @@ module "slurm_controller_template" {
disk_labels = merge(var.disk_labels, local.labels)
disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
additional_disks = local.additional_disks
additional_disks = concat(local.additional_disks, local.state_disk)

bandwidth_tier = var.bandwidth_tier
slurm_bucket_path = module.slurm_files.slurm_bucket_path
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ No modules.
| <a name="input_compute_startup_scripts_timeout"></a> [compute\_startup\_scripts\_timeout](#input\_compute\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in compute\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_startup_scripts"></a> [controller\_startup\_scripts](#input\_controller\_startup\_scripts) | List of scripts to be ran on controller VM startup. | <pre>list(object({<br/> filename = string<br/> content = string<br/> }))</pre> | `[]` | no |
| <a name="input_controller_startup_scripts_timeout"></a> [controller\_startup\_scripts\_timeout](#input\_controller\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in controller\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_state_disk"></a> [controller\_state\_disk](#input\_controller\_state\_disk) | A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.<br/> To disable this feature, set this variable to null.<br/><br/> NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally. | <pre>object({<br/> device_name = string<br/> })</pre> | <pre>{<br/> "device_name": "controller-save-state"<br/>}</pre> | no |
| <a name="input_disable_default_mounts"></a> [disable\_default\_mounts](#input\_disable\_default\_mounts) | Disable default global network storage from the controller<br/>- /usr/local/etc/slurm<br/>- /etc/munge<br/>- /home<br/>- /apps<br/>If these are disabled, the slurm etc and munge dirs must be added manually,<br/>or some other mechanism must be used to synchronize the slurm conf files<br/>and the munge key across the cluster. | `bool` | `false` | no |
| <a name="input_enable_bigquery_load"></a> [enable\_bigquery\_load](#input\_enable\_bigquery\_load) | Enables loading of cluster job usage into big query.<br/><br/>NOTE: Requires Google Bigquery API. | `bool` | `false` | no |
| <a name="input_enable_debug_logging"></a> [enable\_debug\_logging](#input\_enable\_debug\_logging) | Enables debug logging mode. Not for production use. | `bool` | `false` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,15 @@ locals {
tp = "${local.bucket_dir}/" # prefix to trim from the bucket path to get a "file name"

config = {
enable_bigquery_load = var.enable_bigquery_load
cloudsql_secret = var.cloudsql_secret
cluster_id = random_uuid.cluster_id.result
project = var.project_id
slurm_cluster_name = var.slurm_cluster_name
bucket_path = local.bucket_path
enable_debug_logging = var.enable_debug_logging
extra_logging_flags = var.extra_logging_flags
enable_bigquery_load = var.enable_bigquery_load
cloudsql_secret = var.cloudsql_secret
cluster_id = random_uuid.cluster_id.result
project = var.project_id
slurm_cluster_name = var.slurm_cluster_name
bucket_path = local.bucket_path
enable_debug_logging = var.enable_debug_logging
extra_logging_flags = var.extra_logging_flags
controller_state_disk = var.controller_state_disk

# storage
disable_default_mounts = var.disable_default_mounts
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,47 @@ def run_custom_scripts():
log.exception(f"script {script} encountered an exception")
raise e

def mount_save_state_disk():
disk_name = f"/dev/disk/by-id/google-{lookup().cfg.controller_state_disk.device_name}"
mount_point = util.slurmdirs.state
fs_type = "xfs"

rdevice = util.run(f"realpath {disk_name}").stdout.strip()
file_output = util.run(f"file -s {rdevice}").stdout.strip()
if "filesystem" not in file_output:
util.run(f"mkfs -t {fs_type} -q {rdevice}")

fstab_entry = f"{disk_name} {mount_point} {fs_type}"
with open("/etc/fstab", "r") as f:
fstab = f.readlines()
if fstab_entry not in fstab:
with open("/etc/fstab", "a") as f:
f.write(f"{fstab_entry} defaults 0 0\n")

util.run(f"systemctl daemon-reload")

os.makedirs(mount_point, exist_ok=True)
util.run(f"mount {mount_point}")

util.chown_slurm(mount_point)

def mount_munge_key_disk():
state_disk_dir = "/var/spool/slurm/munge"
mount_point = dirs.munge

os.makedirs(state_disk_dir, exist_ok=True)

util.run(f"mount --bind {state_disk_dir} {mount_point}")

fstab_entry = f"{state_disk_dir} {mount_point}"
with open("/etc/fstab", "r") as f:
fstab = f.readlines()
if fstab_entry not in fstab:
with open("/etc/fstab", "a") as f:
f.write(f"{fstab_entry} none bind 0 0\n")

util.run(f"systemctl daemon-reload")

def setup_jwt_key():
jwt_key = Path(slurmdirs.state / "jwt_hs256.key")

Expand Down Expand Up @@ -329,6 +370,11 @@ def setup_controller():
util.chown_slurm(dirs.scripts / "config.yaml", mode=0o600)
install_custom_scripts()
conf.gen_controller_configs(lookup())

if lookup().cfg.controller_state_disk != None:
mount_save_state_disk()
mount_munge_key_disk()

setup_jwt_key()
setup_munge_key()
setup_sudoers()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,22 @@ variable "slurm_cluster_name" {
}
}

variable "controller_state_disk" {
description = <<EOD
A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.
To disable this feature, set this variable to null.
NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally.
EOD
type = object({
device_name = string
})

default = {
device_name = "controller-save-state"
}
}

variable "enable_bigquery_load" {
description = <<EOD
Enables loading of cluster job usage into big query.
Expand Down
Loading

0 comments on commit 2e6fb84

Please sign in to comment.