Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add controller save state disk #3661

Merged
merged 1 commit into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_access_config"></a> [access\_config](#input\_access\_config) | Access configurations, i.e. IPs via which the VM instance can be accessed via the Internet. | <pre>list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))</pre> | `[]` | no |
| <a name="input_additional_disks"></a> [additional\_disks](#input\_additional\_disks) | List of maps of disks. | <pre>list(object({<br/> disk_name = string<br/> device_name = string<br/> disk_type = string<br/> disk_size_gb = number<br/> disk_labels = map(string)<br/> auto_delete = bool<br/> boot = bool<br/> }))</pre> | `[]` | no |
| <a name="input_additional_disks"></a> [additional\_disks](#input\_additional\_disks) | List of maps of disks. | <pre>list(object({<br/> source = optional(string)<br/> disk_name = optional(string)<br/> device_name = string<br/> disk_type = optional(string)<br/> disk_size_gb = optional(number)<br/> disk_labels = map(string)<br/> auto_delete = bool<br/> boot = bool<br/> }))</pre> | `[]` | no |
| <a name="input_additional_networks"></a> [additional\_networks](#input\_additional\_networks) | Additional network interface details for GCE, if any. | <pre>list(object({<br/> network = string<br/> subnetwork = string<br/> subnetwork_project = string<br/> network_ip = string<br/> nic_type = string<br/> access_config = list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))<br/> ipv6_access_config = list(object({<br/> network_tier = string<br/> }))<br/> }))</pre> | `[]` | no |
| <a name="input_advanced_machine_features"></a> [advanced\_machine\_features](#input\_advanced\_machine\_features) | See https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#nested_advanced_machine_features | <pre>object({<br/> enable_nested_virtualization = optional(bool)<br/> threads_per_core = optional(number)<br/> turbo_mode = optional(string)<br/> visible_core_count = optional(number)<br/> performance_monitoring_unit = optional(string)<br/> enable_uefi_networking = optional(bool)<br/> })</pre> | n/a | yes |
| <a name="input_bandwidth_tier"></a> [bandwidth\_tier](#input\_bandwidth\_tier) | Tier 1 bandwidth increases the maximum egress bandwidth for VMs.<br/>Using the `virtio_enabled` setting will only enable VirtioNet and will not enable TIER\_1.<br/>Using the `tier_1_enabled` setting will enable both gVNIC and TIER\_1 higher bandwidth networking.<br/>Using the `gvnic_enabled` setting will only enable gVNIC and will not enable TIER\_1.<br/>Note that TIER\_1 only works with specific machine families & shapes and must be using an image that supports gVNIC. See [official docs](https://cloud.google.com/compute/docs/networking/configure-vm-with-high-bandwidth-configuration) for more details. | `string` | `"platform_default"` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ locals {
disk_name = disk.disk_name
device_name = disk.device_name
auto_delete = disk.auto_delete
source = disk.source
boot = disk.boot
disk_size_gb = disk.disk_size_gb
disk_type = disk.disk_type
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -321,10 +321,11 @@ variable "disk_auto_delete" {

variable "additional_disks" {
type = list(object({
disk_name = string
source = optional(string)
disk_name = optional(string)
device_name = string
disk_type = string
disk_size_gb = number
disk_type = optional(string)
disk_size_gb = optional(number)
disk_labels = map(string)
auto_delete = bool
boot = bool
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ No modules.
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_access_config"></a> [access\_config](#input\_access\_config) | Access configurations, i.e. IPs via which the VM instance can be accessed via the Internet. | <pre>list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))</pre> | `[]` | no |
| <a name="input_additional_disks"></a> [additional\_disks](#input\_additional\_disks) | List of maps of additional disks. See https://www.terraform.io/docs/providers/google/r/compute_instance_template#disk_name | <pre>list(object({<br/> disk_name = string<br/> device_name = string<br/> auto_delete = bool<br/> boot = bool<br/> disk_size_gb = number<br/> disk_type = string<br/> disk_labels = map(string)<br/> }))</pre> | `[]` | no |
| <a name="input_additional_disks"></a> [additional\_disks](#input\_additional\_disks) | List of maps of additional disks. See https://www.terraform.io/docs/providers/google/r/compute_instance_template#disk_name | <pre>list(object({<br/> source = optional(string)<br/> disk_name = optional(string)<br/> device_name = string<br/> auto_delete = bool<br/> boot = bool<br/> disk_size_gb = optional(number)<br/> disk_type = optional(string)<br/> disk_labels = map(string)<br/> }))</pre> | `[]` | no |
| <a name="input_additional_networks"></a> [additional\_networks](#input\_additional\_networks) | Additional network interface details for GCE, if any. | <pre>list(object({<br/> network = string<br/> subnetwork = string<br/> subnetwork_project = string<br/> network_ip = string<br/> nic_type = string<br/> access_config = list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))<br/> ipv6_access_config = list(object({<br/> network_tier = string<br/> }))<br/> }))</pre> | `[]` | no |
| <a name="input_advanced_machine_features"></a> [advanced\_machine\_features](#input\_advanced\_machine\_features) | See https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#nested_advanced_machine_features | <pre>object({<br/> enable_nested_virtualization = optional(bool)<br/> threads_per_core = optional(number)<br/> turbo_mode = optional(string)<br/> visible_core_count = optional(number)<br/> performance_monitoring_unit = optional(string)<br/> enable_uefi_networking = optional(bool)<br/> })</pre> | n/a | yes |
| <a name="input_alias_ip_range"></a> [alias\_ip\_range](#input\_alias\_ip\_range) | An array of alias IP ranges for this network interface. Can only be specified for network interfaces on subnet-mode networks.<br/>ip\_cidr\_range: The IP CIDR range represented by this alias IP range. This IP CIDR range must belong to the specified subnetwork and cannot contain IP addresses reserved by system or used by other network interfaces. At the time of writing only a netmask (e.g. /24) may be supplied, with a CIDR format resulting in an API error.<br/>subnetwork\_range\_name: The subnetwork secondary range name specifying the secondary range from which to allocate the IP CIDR range for this alias IP range. If left unspecified, the primary range of the subnetwork will be used. | <pre>object({<br/> ip_cidr_range = string<br/> subnetwork_range_name = string<br/> })</pre> | `null` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ resource "google_compute_instance_template" "tpl" {
source = lookup(disk.value, "source", null)
source_image = lookup(disk.value, "source_image", null)
type = lookup(disk.value, "disk_type", null) == "local-ssd" ? "SCRATCH" : "PERSISTENT"
labels = lookup(disk.value, "disk_type", null) == "local-ssd" ? null : lookup(disk.value, "disk_labels", null)
labels = (lookup(disk.value, "source", null) != null || lookup(disk.value, "disk_type", null) == "local-ssd") ? null : lookup(disk.value, "disk_labels", null)

dynamic "disk_encryption_key" {
for_each = compact([var.disk_encryption_key == null ? null : 1])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -161,12 +161,13 @@ variable "auto_delete" {
variable "additional_disks" {
description = "List of maps of additional disks. See https://www.terraform.io/docs/providers/google/r/compute_instance_template#disk_name"
type = list(object({
disk_name = string
source = optional(string)
disk_name = optional(string)
device_name = string
auto_delete = bool
boot = bool
disk_size_gb = number
disk_type = string
disk_size_gb = optional(number)
disk_type = optional(string)
disk_labels = map(string)
}))
default = []
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,7 @@ limitations under the License.

| Name | Type |
|------|------|
| [google_compute_disk.controller_disk](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_disk) | resource |
| [google_compute_instance_from_template.controller](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_from_template) | resource |
| [google_secret_manager_secret.cloudsql](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/secret_manager_secret) | resource |
| [google_secret_manager_secret_iam_member.cloudsql_secret_accessor](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/secret_manager_secret_iam_member) | resource |
Expand Down Expand Up @@ -292,6 +293,7 @@ limitations under the License.
| <a name="input_controller_project_id"></a> [controller\_project\_id](#input\_controller\_project\_id) | Optionally. Provision controller and config bucket in the different project | `string` | `null` | no |
| <a name="input_controller_startup_script"></a> [controller\_startup\_script](#input\_controller\_startup\_script) | Startup script used by the controller VM. | `string` | `"# no-op"` | no |
| <a name="input_controller_startup_scripts_timeout"></a> [controller\_startup\_scripts\_timeout](#input\_controller\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in controller\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_state_disk"></a> [controller\_state\_disk](#input\_controller\_state\_disk) | A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.<br/> To disable this feature, set this variable to null.<br/><br/> NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally. | <pre>object({<br/> type = string<br/> size = number<br/> })</pre> | <pre>{<br/> "size": 50,<br/> "type": "pd-ssd"<br/>}</pre> | no |
| <a name="input_create_bucket"></a> [create\_bucket](#input\_create\_bucket) | Create GCS bucket instead of using an existing one. | `bool` | `true` | no |
| <a name="input_deployment_name"></a> [deployment\_name](#input\_deployment\_name) | Name of the deployment. | `string` | n/a | yes |
| <a name="input_disable_controller_public_ips"></a> [disable\_controller\_public\_ips](#input\_disable\_controller\_public\_ips) | DEPRECATED: Use `enable_controller_public_ips` instead. | `bool` | `null` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@ locals {
}
]

state_disk = var.controller_state_disk != null ? [{
source = google_compute_disk.controller_disk[0].name
device_name = google_compute_disk.controller_disk[0].name
disk_labels = null
auto_delete = false
boot = false
}] : []

synth_def_sa_email = "${data.google_project.controller_project.number}[email protected]"

service_account = {
Expand All @@ -54,6 +62,15 @@ data "google_project" "controller_project" {
project_id = var.controller_project_id
}

resource "google_compute_disk" "controller_disk" {
count = var.controller_state_disk != null ? 1 : 0

name = "${local.slurm_cluster_name}-controller-save"
type = var.controller_state_disk.type
size = var.controller_state_disk.size
zone = var.zone
}

# INSTANCE TEMPLATE
module "slurm_controller_template" {
source = "../../internal/slurm-gcp/instance_template"
Expand All @@ -68,7 +85,7 @@ module "slurm_controller_template" {
disk_labels = merge(var.disk_labels, local.labels)
disk_size_gb = var.disk_size_gb
disk_type = var.disk_type
additional_disks = local.additional_disks
additional_disks = concat(local.additional_disks, local.state_disk)

bandwidth_tier = var.bandwidth_tier
slurm_bucket_path = module.slurm_files.slurm_bucket_path
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ No modules.
| <a name="input_compute_startup_scripts_timeout"></a> [compute\_startup\_scripts\_timeout](#input\_compute\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in compute\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_startup_scripts"></a> [controller\_startup\_scripts](#input\_controller\_startup\_scripts) | List of scripts to be ran on controller VM startup. | <pre>list(object({<br/> filename = string<br/> content = string<br/> }))</pre> | `[]` | no |
| <a name="input_controller_startup_scripts_timeout"></a> [controller\_startup\_scripts\_timeout](#input\_controller\_startup\_scripts\_timeout) | The timeout (seconds) applied to each script in controller\_startup\_scripts. If<br/>any script exceeds this timeout, then the instance setup process is considered<br/>failed and handled accordingly.<br/><br/>NOTE: When set to 0, the timeout is considered infinite and thus disabled. | `number` | `300` | no |
| <a name="input_controller_state_disk"></a> [controller\_state\_disk](#input\_controller\_state\_disk) | A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.<br/> To disable this feature, set this variable to null.<br/><br/> NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally. | <pre>object({<br/> device_name = string<br/> })</pre> | <pre>{<br/> "device_name": null<br/>}</pre> | no |
| <a name="input_disable_default_mounts"></a> [disable\_default\_mounts](#input\_disable\_default\_mounts) | Disable default global network storage from the controller<br/>- /usr/local/etc/slurm<br/>- /etc/munge<br/>- /home<br/>- /apps<br/>If these are disabled, the slurm etc and munge dirs must be added manually,<br/>or some other mechanism must be used to synchronize the slurm conf files<br/>and the munge key across the cluster. | `bool` | `false` | no |
| <a name="input_enable_bigquery_load"></a> [enable\_bigquery\_load](#input\_enable\_bigquery\_load) | Enables loading of cluster job usage into big query.<br/><br/>NOTE: Requires Google Bigquery API. | `bool` | `false` | no |
| <a name="input_enable_debug_logging"></a> [enable\_debug\_logging](#input\_enable\_debug\_logging) | Enables debug logging mode. Not for production use. | `bool` | `false` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,15 @@ locals {
tp = "${local.bucket_dir}/" # prefix to trim from the bucket path to get a "file name"

config = {
enable_bigquery_load = var.enable_bigquery_load
cloudsql_secret = var.cloudsql_secret
cluster_id = random_uuid.cluster_id.result
project = var.project_id
slurm_cluster_name = var.slurm_cluster_name
bucket_path = local.bucket_path
enable_debug_logging = var.enable_debug_logging
extra_logging_flags = var.extra_logging_flags
enable_bigquery_load = var.enable_bigquery_load
cloudsql_secret = var.cloudsql_secret
cluster_id = random_uuid.cluster_id.result
project = var.project_id
slurm_cluster_name = var.slurm_cluster_name
bucket_path = local.bucket_path
enable_debug_logging = var.enable_debug_logging
extra_logging_flags = var.extra_logging_flags
controller_state_disk = var.controller_state_disk

# storage
disable_default_mounts = var.disable_default_mounts
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,47 @@ def run_custom_scripts():
log.exception(f"script {script} encountered an exception")
raise e

def mount_save_state_disk():
disk_name = f"/dev/disk/by-id/google-{lookup().cfg.controller_state_disk.device_name}"
mount_point = util.slurmdirs.state
fs_type = "xfs"

rdevice = util.run(f"realpath {disk_name}").stdout.strip()
file_output = util.run(f"file -s {rdevice}").stdout.strip()
if "filesystem" not in file_output:
util.run(f"mkfs -t {fs_type} -q {rdevice}")

fstab_entry = f"{disk_name} {mount_point} {fs_type}"
with open("/etc/fstab", "r") as f:
fstab = f.readlines()
if fstab_entry not in fstab:
with open("/etc/fstab", "a") as f:
f.write(f"{fstab_entry} defaults 0 0\n")

util.run(f"systemctl daemon-reload")

os.makedirs(mount_point, exist_ok=True)
util.run(f"mount {mount_point}")

util.chown_slurm(mount_point)

def mount_munge_key_disk():
state_disk_dir = "/var/spool/slurm/munge"
mount_point = dirs.munge

os.makedirs(state_disk_dir, exist_ok=True)

util.run(f"mount --bind {state_disk_dir} {mount_point}")

fstab_entry = f"{state_disk_dir} {mount_point}"
with open("/etc/fstab", "r") as f:
fstab = f.readlines()
if fstab_entry not in fstab:
with open("/etc/fstab", "a") as f:
f.write(f"{fstab_entry} none bind 0 0\n")

util.run(f"systemctl daemon-reload")

def setup_jwt_key():
jwt_key = Path(slurmdirs.state / "jwt_hs256.key")

Expand Down Expand Up @@ -329,6 +370,11 @@ def setup_controller():
util.chown_slurm(dirs.scripts / "config.yaml", mode=0o600)
install_custom_scripts()
conf.gen_controller_configs(lookup())

if lookup().cfg.controller_state_disk.device_name != None:
mount_save_state_disk()
mount_munge_key_disk()

setup_jwt_key()
setup_munge_key()
setup_sudoers()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,22 @@ variable "slurm_cluster_name" {
}
}

variable "controller_state_disk" {
description = <<EOD
A disk that will be attached to the controller instance template to save state of slurm. The disk is created and used by default.
To disable this feature, set this variable to null.
NOTE: This will not save the contents at /opt/apps and /home. To preserve those, they must be saved externally.
EOD
type = object({
device_name = string
})

default = {
device_name = null
}
}

variable "enable_bigquery_load" {
description = <<EOD
Enables loading of cluster job usage into big query.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,9 @@ locals {
}
ghpc_startup_script_controller = length(local.daos_ns) > 0 ? [local.daos_install_mount_script, local.ghpc_startup_controller] : [local.ghpc_startup_controller]

controller_state_disk = {
device_name : try(google_compute_disk.controller_disk[0].name, null)
}
ghpc_startup_login = {
filename = "ghpc_startup.sh"
content = var.login_startup_script
Expand Down Expand Up @@ -154,6 +157,7 @@ module "slurm_files" {
compute_startup_scripts_timeout = var.compute_startup_scripts_timeout
login_startup_scripts = local.login_startup_scripts
login_startup_scripts_timeout = var.login_startup_scripts_timeout
controller_state_disk = local.controller_state_disk

enable_debug_logging = var.enable_debug_logging
extra_logging_flags = var.extra_logging_flags
Expand Down
Loading
Loading
You are viewing a condensed version of this merge commit. You can view the full changes here.