Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize naming prefixes for kubernetes network objects #3644

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

parulbajaj01
Copy link
Contributor

@parulbajaj01 parulbajaj01 commented Feb 6, 2025

Standardize naming convention for k8s network objects.

For A3H:

default
vpc1
...
vpc4

For A3M:

default
vpc1
...
vpc8

For A3U:

default
gvnic1
rdma0
...
rdma7

This aligns with the naming convention assumed in all Toolkit and gcloud docs, reducing friction experienced by users. This also avoids the need for users to rename networks in various kubernetes manifests, for e.g. the NCCL tests.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@parulbajaj01 parulbajaj01 added the release-improvements Added to release notes under the "Improvements" heading. label Feb 6, 2025
@parulbajaj01 parulbajaj01 marked this pull request as draft February 6, 2025 13:08
@annuay-google
Copy link
Contributor

The script path for all_gather NCCL test was broken, please fix it

@annuay-google
Copy link
Contributor

Please update version for tcpxo NCCL test/installer to match what's in https://github.com/GoogleCloudPlatform/container-engine-accelerators repo

Copy link
Contributor

@annuay-google annuay-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes to defaults, a missing bug-fix for running all_gather test and tcpxo version update

Copy link
Contributor

@annuay-google annuay-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to defaults, a missing bug fix to run all_gather benchmark and tcpxo nccl version update

Copy link
Contributor

@annuay-google annuay-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Let's merge after testing!

@annuay-google
Copy link
Contributor

annuay-google commented Feb 20, 2025

All the possible cluster provisioning paths have been tested with a NCCL test:

A3 high - toolkit instructions - verified
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576         16384     float    none      -1   1116.2    0.94    0.88    N/A   1094.0    0.96    0.90    N/A
     2097152         32768     float    none      -1   1071.4    1.96    1.83    N/A   1072.8    1.95    1.83    N/A
     4194304         65536     float    none      -1   1123.8    3.73    3.50    N/A   1089.6    3.85    3.61    N/A
     8388608        131072     float    none      -1   1211.6    6.92    6.49    N/A   1209.0    6.94    6.50    N/A
    16777216        262144     float    none      -1   1263.4   13.28   12.45    N/A   1322.5   12.69   11.89    N/A
    33554432        524288     float    none      -1   1305.2   25.71   24.10    N/A   1304.5   25.72   24.11    N/A
    67108864       1048576     float    none      -1   1522.1   44.09   41.33    N/A   1522.6   44.07   41.32    N/A
   134217728       2097152     float    none      -1   2156.8   62.23   58.34    N/A   2217.9   60.52   56.73    N/A
   268435456       4194304     float    none      -1   3685.5   72.84   68.28    N/A   3588.1   74.81   70.14    N/A
   536870912       8388608     float    none      -1   7238.0   74.17   69.54    N/A   7018.7   76.49   71.71    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 28.7753

A3 high - gcloud docs instructions - verified
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576         16384     float    none      -1    935.0    1.12    1.05    N/A    954.1    1.10    1.03    N/A
     2097152         32768     float    none      -1    940.7    2.23    2.09    N/A    978.2    2.14    2.01    N/A
     4194304         65536     float    none      -1    997.9    4.20    3.94    N/A    995.3    4.21    3.95    N/A
     8388608        131072     float    none      -1   1048.8    8.00    7.50    N/A   1061.5    7.90    7.41    N/A
    16777216        262144     float    none      -1   1079.6   15.54   14.57    N/A   1148.7   14.60   13.69    N/A
    33554432        524288     float    none      -1   1172.8   28.61   26.82    N/A   1180.2   28.43   26.65    N/A
    67108864       1048576     float    none      -1   1355.5   49.51   46.41    N/A   1347.9   49.79   46.68    N/A
   134217728       2097152     float    none      -1   2044.4   65.65   61.55    N/A   2015.0   66.61   62.45    N/A
   268435456       4194304     float    none      -1   3626.5   74.02   69.39    N/A   3581.8   74.94   70.26    N/A
   536870912       8388608     float    none      -1   7058.4   76.06   71.31    N/A   7020.7   76.47   71.69    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 30.5226

A3 mega - toolkit instructions - verified
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           0             0     float    none      -1     0.23    0.00    0.00      0     0.17    0.00    0.00      0
           0             0     float    none      -1     0.16    0.00    0.00      0     0.16    0.00    0.00      0
           0             0     float    none      -1     0.16    0.00    0.00      0     0.16    0.00    0.00      0
           0             0     float    none      -1     0.17    0.00    0.00      0     0.16    0.00    0.00      0
           0             0     float    none      -1     0.16    0.00    0.00      0     0.16    0.00    0.00      0
         256             4     float    none      -1    238.9    0.00    0.00      0    241.6    0.00    0.00      0
         512             8     float    none      -1    238.3    0.00    0.00      0    238.7    0.00    0.00      0
        1024            16     float    none      -1    238.7    0.00    0.00      0    239.5    0.00    0.00      0
        2048            32     float    none      -1    237.0    0.01    0.01      0    238.9    0.01    0.01      0
        4096            64     float    none      -1    237.3    0.02    0.02      0    236.5    0.02    0.02      0
        8192           128     float    none      -1    237.8    0.03    0.03      0    237.4    0.03    0.03      0
       16384           256     float    none      -1    242.0    0.07    0.06      0    241.6    0.07    0.06      0
       32768           512     float    none      -1    240.2    0.14    0.13      0    241.4    0.14    0.13      0
       65536          1024     float    none      -1    263.6    0.25    0.23      0    264.1    0.25    0.23      0
      131072          2048     float    none      -1    265.5    0.49    0.46      0    281.4    0.47    0.44      0
      262144          4096     float    none      -1    285.1    0.92    0.86      0    292.8    0.90    0.84      0
      524288          8192     float    none      -1    315.7    1.66    1.56      0    317.3    1.65    1.55      0
     1048576         16384     float    none      -1    310.4    3.38    3.17      0    319.2    3.29    3.08      0
     2097152         32768     float    none      -1    308.9    6.79    6.36      0    316.7    6.62    6.21      0
     4194304         65536     float    none      -1    330.9   12.68   11.88      0    325.5   12.89   12.08      0
     8388608        131072     float    none      -1    356.3   23.55   22.07      0    353.1   23.76   22.27      0
    16777216        262144     float    none      -1    411.4   40.78   38.24      0    408.8   41.04   38.47      0
    33554432        524288     float    none      -1    479.9   69.92   65.55      0    479.8   69.93   65.56      0
    67108864       1048576     float    none      -1    732.7   91.59   85.86      0    730.7   91.84   86.10      0
   134217728       2097152     float    none      -1   1107.0  121.24  113.66      0   1097.1  122.33  114.69      0
   268435456       4194304     float    none      -1   1775.2  151.22  141.77      0   1777.8  151.00  141.56      0
   536870912       8388608     float    none      -1   2967.3  180.93  169.62      0   2971.7  180.66  169.37      0
  1073741824      16777216     float    none      -1   5552.6  193.38  181.29      0   5532.8  194.07  181.94      0
  2147483648      33554432     float    none      -1    10854  197.85  185.48      0    10835  198.19  185.80      0
  4294967296      67108864     float    none      -1    21502  199.75  187.26      0    21489  199.87  187.38      0
  8589934592     134217728     float    none      -1    42809  200.66  188.12      0    42774  200.82  188.27      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 45.3197 
#

A3 mega - gcloud docs instructions - WIP
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           0             0     float    none      -1     0.24    0.00    0.00      0     0.16    0.00    0.00      0
           0             0     float    none      -1     0.17    0.00    0.00      0     0.15    0.00    0.00      0
           0             0     float    none      -1     0.17    0.00    0.00      0     0.15    0.00    0.00      0
           0             0     float    none      -1     0.15    0.00    0.00      0     0.15    0.00    0.00      0
           0             0     float    none      -1     0.16    0.00    0.00      0     0.17    0.00    0.00      0
         256             4     float    none      -1    234.3    0.00    0.00      0    236.8    0.00    0.00      0
         512             8     float    none      -1    233.9    0.00    0.00      0    231.2    0.00    0.00      0
        1024            16     float    none      -1    232.8    0.00    0.00      0    231.8    0.00    0.00      0
        2048            32     float    none      -1    231.8    0.01    0.01      0    233.9    0.01    0.01      0
        4096            64     float    none      -1    232.6    0.02    0.02      0    232.7    0.02    0.02      0
        8192           128     float    none      -1    233.5    0.04    0.03      0    234.4    0.03    0.03      0
       16384           256     float    none      -1    236.1    0.07    0.07      0    236.4    0.07    0.06      0
       32768           512     float    none      -1    235.3    0.14    0.13      0    236.4    0.14    0.13      0
       65536          1024     float    none      -1    261.4    0.25    0.24      0    261.9    0.25    0.23      0
      131072          2048     float    none      -1    263.6    0.50    0.47      0    275.5    0.48    0.45      0
      262144          4096     float    none      -1    274.2    0.96    0.90      0    274.7    0.95    0.89      0
      524288          8192     float    none      -1    317.5    1.65    1.55      0    310.5    1.69    1.58      0
     1048576         16384     float    none      -1    316.2    3.32    3.11      0    318.0    3.30    3.09      0
     2097152         32768     float    none      -1    320.2    6.55    6.14      0    311.7    6.73    6.31      0
     4194304         65536     float    none      -1    326.3   12.85   12.05      0    333.6   12.57   11.79      0
     8388608        131072     float    none      -1    350.2   23.95   22.46      0    347.3   24.16   22.65      0
    16777216        262144     float    none      -1    408.1   41.11   38.54      0    400.9   41.85   39.24      0
    33554432        524288     float    none      -1    475.6   70.55   66.14      0    470.7   71.28   66.82      0
    67108864       1048576     float    none      -1    722.6   92.87   87.07      0    737.4   91.01   85.32      0
   134217728       2097152     float    none      -1   1111.1  120.80  113.25      0   1107.7  121.17  113.60      0
   268435456       4194304     float    none      -1   1730.4  155.13  145.43      0   1718.6  156.20  146.44      0
   536870912       8388608     float    none      -1   2869.5  187.09  175.40      0   2876.6  186.63  174.97      0
  1073741824      16777216     float    none      -1   5512.5  194.78  182.61      0   5536.8  193.93  181.81      0
  2147483648      33554432     float    none      -1    10843  198.04  185.67      0    10822  198.45  186.04      0
  4294967296      67108864     float    none      -1    21450  200.23  187.72      0    21594  198.90  186.47      0
  8589934592     134217728     float    none      -1    42672  201.30  188.72      0    42654  201.38  188.80      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 45.7171

A3 ultra - gcloud toolkit instructions - verified
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16     float    none      -1    44.22    0.02    0.02      0    43.28    0.02    0.02      0
        2048            32     float    none      -1    43.02    0.05    0.04      0    43.17    0.05    0.04      0
        4096            64     float    none      -1    43.09    0.10    0.09      0    43.02    0.10    0.09      0
        8192           128     float    none      -1    43.38    0.19    0.18      0    43.48    0.19    0.18      0
       16384           256     float    none      -1    44.04    0.37    0.35      0    44.16    0.37    0.35      0
       32768           512     float    none      -1    45.15    0.73    0.68      0    45.02    0.73    0.68      0
       65536          1024     float    none      -1    47.33    1.38    1.30      0    47.14    1.39    1.30      0
      131072          2048     float    none      -1    47.30    2.77    2.60      0    49.76    2.63    2.47      0
      262144          4096     float    none      -1    51.89    5.05    4.74      0    50.20    5.22    4.90      0
      524288          8192     float    none      -1    54.76    9.57    8.98      0    53.83    9.74    9.13      0
     1048576         16384     float    none      -1    72.77   14.41   13.51      0    72.46   14.47   13.57      0
     2097152         32768     float    none      -1    76.52   27.40   25.69      0    74.89   28.00   26.25      0
     4194304         65536     float    none      -1    82.02   51.14   47.94      0    80.70   51.98   48.73      0
     8388608        131072     float    none      -1    91.48   91.70   85.97      0    93.68   89.54   83.95      0
    16777216        262144     float    none      -1    121.2  138.39  129.74      0    121.7  137.87  129.26      0
    33554432        524288     float    none      -1    182.4  183.98  172.48      0    178.9  187.61  175.89      0
    67108864       1048576     float    none      -1    280.4  239.32  224.36      0    279.6  239.98  224.98      0
   134217728       2097152     float    none      -1    497.3  269.90  253.03      0    490.5  273.61  256.51      0
   268435456       4194304     float    none      -1    864.0  310.70  291.28      0    853.7  314.45  294.80      0
   536870912       8388608     float    none      -1   1556.4  344.95  323.39      0   1552.0  345.92  324.30      0
  1073741824      16777216     float    none      -1   3083.6  348.21  326.45      0   3041.0  353.09  331.02      0
  2147483648      33554432     float    none      -1   6135.2  350.03  328.15      0   6035.9  355.79  333.55      0
  4294967296      67108864     float    none      -1    12239  350.92  328.99      0    11961  359.09  336.65      0
  8589934592     134217728     float    none      -1    24403  352.01  330.01      0    23913  359.21  336.76      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 121.569 
#

A3 ultra - gcloud custom cluster instructions (2 node) - verified
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024            16     float    none      -1    45.61    0.02    0.02      0    44.73    0.02    0.02      0
        2048            32     float    none      -1    44.93    0.05    0.04      0    44.73    0.05    0.04      0
        4096            64     float    none      -1    45.31    0.09    0.08      0    45.28    0.09    0.08      0
        8192           128     float    none      -1    45.25    0.18    0.17      0    45.58    0.18    0.17      0
       16384           256     float    none      -1    46.55    0.35    0.33      0    46.37    0.35    0.33      0
       32768           512     float    none      -1    48.21    0.68    0.64      0    47.47    0.69    0.65      0
       65536          1024     float    none      -1    49.49    1.32    1.24      0    50.21    1.31    1.22      0
      131072          2048     float    none      -1    49.04    2.67    2.51      0    52.31    2.51    2.35      0
      262144          4096     float    none      -1    50.85    5.16    4.83      0    51.11    5.13    4.81      0
      524288          8192     float    none      -1    55.68    9.42    8.83      0    56.90    9.21    8.64      0
     1048576         16384     float    none      -1    75.04   13.97   13.10      0    73.83   14.20   13.32      0
     2097152         32768     float    none      -1    76.15   27.54   25.82      0    76.77   27.32   25.61      0
     4194304         65536     float    none      -1    78.74   53.27   49.94      0    78.70   53.30   49.97      0
     8388608        131072     float    none      -1    94.29   88.96   83.40      0    91.44   91.73   86.00      0
    16777216        262144     float    none      -1    122.3  137.20  128.63      0    124.0  135.28  126.82      0
    33554432        524288     float    none      -1    181.4  184.93  173.37      0    181.4  185.01  173.45      0
    67108864       1048576     float    none      -1    280.6  239.20  224.25      0    276.1  243.03  227.84      0
   134217728       2097152     float    none      -1    498.5  269.22  252.40      0    494.8  271.28  254.33      0
   268435456       4194304     float    none      -1    861.7  311.50  292.03      0    857.8  312.94  293.38      0
   536870912       8388608     float    none      -1   1558.0  344.60  323.06      0   1556.1  345.02  323.45      0
  1073741824      16777216     float    none      -1   3082.3  348.36  326.59      0   3046.2  352.48  330.45      0
  2147483648      33554432     float    none      -1   6142.6  349.61  327.76      0   6037.7  355.68  333.45      0
  4294967296      67108864     float    none      -1    12235  351.05  329.11      0    12002  357.87  335.50      0
  8589934592     134217728     float    none      -1    24433  351.57  329.59      0    23902  359.38  336.92      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 121.386 
#

A3 ultra - gcloud custom cluster instructions (Jobset) - verified
Broken, but expected - these instructions use the TAS plugin

@annuay-google annuay-google changed the title Add static name for networks Use static naming prefixes for kubernetes network objects Feb 20, 2025
@annuay-google annuay-google changed the title Use static naming prefixes for kubernetes network objects Standardize naming prefixes for kubernetes network objects Feb 20, 2025
@annuay-google annuay-google marked this pull request as ready for review February 20, 2025 13:43
@annuay-google annuay-google requested review from a team as code owners February 20, 2025 13:43
Copy link
Contributor

@annuay-google annuay-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -43,11 +43,11 @@ locals {
"a3-megagpu-8g" = {
# Manifest to be installed for enabling TCPXO on a3-megagpu-8g machines
gpu_direct_manifests = [
"https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/b324ec8994aa98ca320438dd2d01ff6d7f9165bb/gpudirect-tcpxo/nccl-tcpxo-installer.yaml", # nccl_plugin v1.0.7 for tcpxo
"https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/b324ec8994aa98ca320438dd2d01ff6d7f9165bb/nri_device_injector/nri-device-injector.yaml", # nri_plugin
"https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/39308db7574925ea3c14f9113fcf87f70a6fcc26/gpudirect-tcpxo/nccl-tcpxo-installer.yaml", # nccl_plugin v1.0.7 for tcpxo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1.0.8-1

@@ -52,6 +52,7 @@ locals {
initial_node_set = try(var.initial_node_count > 0, false)

module_unique_id = replace(lower(var.internal_ghpc_module_id), "/[^a-z0-9\\-]/", "")
nccl_path = var.machine_type == "a3-highgpu-8g" ? "configs" : "scripts"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is only used in outputs.tf, please put it there instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants