-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize naming prefixes for kubernetes network objects #3644
base: develop
Are you sure you want to change the base?
Standardize naming prefixes for kubernetes network objects #3644
Conversation
ae76095
to
efd436d
Compare
The script path for all_gather NCCL test was broken, please fix it |
Please update version for tcpxo NCCL test/installer to match what's in https://github.com/GoogleCloudPlatform/container-engine-accelerators repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested changes to defaults, a missing bug-fix for running all_gather test and tcpxo version update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes to defaults, a missing bug fix to run all_gather benchmark and tcpxo nccl version update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. Let's merge after testing!
281b8e3
to
59c1cd5
Compare
All the possible cluster provisioning paths have been tested with a NCCL test:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -43,11 +43,11 @@ locals { | |||
"a3-megagpu-8g" = { | |||
# Manifest to be installed for enabling TCPXO on a3-megagpu-8g machines | |||
gpu_direct_manifests = [ | |||
"https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/b324ec8994aa98ca320438dd2d01ff6d7f9165bb/gpudirect-tcpxo/nccl-tcpxo-installer.yaml", # nccl_plugin v1.0.7 for tcpxo | |||
"https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/b324ec8994aa98ca320438dd2d01ff6d7f9165bb/nri_device_injector/nri-device-injector.yaml", # nri_plugin | |||
"https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/39308db7574925ea3c14f9113fcf87f70a6fcc26/gpudirect-tcpxo/nccl-tcpxo-installer.yaml", # nccl_plugin v1.0.7 for tcpxo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1.0.8-1
@@ -52,6 +52,7 @@ locals { | |||
initial_node_set = try(var.initial_node_count > 0, false) | |||
|
|||
module_unique_id = replace(lower(var.internal_ghpc_module_id), "/[^a-z0-9\\-]/", "") | |||
nccl_path = var.machine_type == "a3-highgpu-8g" ? "configs" : "scripts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is only used in outputs.tf, please put it there instead
Standardize naming convention for k8s network objects.
For A3H:
For A3M:
For A3U:
This aligns with the naming convention assumed in all Toolkit and gcloud docs, reducing friction experienced by users. This also avoids the need for users to rename networks in various kubernetes manifests, for e.g. the NCCL tests.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.