Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tpu): add tpu vm create topology sample. #9611

Merged
merged 29 commits into from
Nov 20, 2024

Conversation

TetyanaYahodska
Copy link
Contributor

@TetyanaYahodska TetyanaYahodska commented Oct 29, 2024

Description

Documentation - https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm

Sample in Python - GoogleCloudPlatform/python-docs-samples#12719
Fixes #

Note: Before submitting a pull request, please open an issue for discussion if you are not associated with Google.

Checklist

  • I have followed Sample Format Guide
  • pom.xml parent set to latest shared-configuration
  • Appropriate changes to README are included in PR
  • These samples need a new API enabled in testing projects to pass (let us know which ones)
  • These samples need a new/updated env vars in testing projects set to pass (let us know which ones)
  • Tests pass: mvn clean verify required
  • Lint passes: mvn -P lint checkstyle:check required
  • Static Analysis: mvn -P lint clean compile pmd:cpd-check spotbugs:check advisory only
  • This sample adds a new sample directory, and I updated the CODEOWNERS file with the codeowners for this sample
  • This sample adds a new Product API, and I updated the Blunderbuss issue/PR auto-assigner with the codeowners for this sample
  • Please merge this PR for me once it is approved

@product-auto-label product-auto-label bot added api: tpu Issues related to the Cloud TPU API. samples Issues that are directly related to samples. labels Oct 29, 2024
@TetyanaYahodska TetyanaYahodska added kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Oct 29, 2024
@kokoro-team kokoro-team removed kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Oct 29, 2024
@TetyanaYahodska TetyanaYahodska added kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Oct 30, 2024
@kokoro-team kokoro-team removed kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Oct 30, 2024
@TetyanaYahodska TetyanaYahodska added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 30, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Oct 30, 2024
@TetyanaYahodska TetyanaYahodska added kokoro:run Add this label to force Kokoro to re-run the tests. and removed kokoro:run Add this label to force Kokoro to re-run the tests. labels Oct 30, 2024
@TetyanaYahodska TetyanaYahodska added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 18, 2024
@kokoro-team kokoro-team removed kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Nov 18, 2024
@TetyanaYahodska TetyanaYahodska added the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 18, 2024
@kokoro-team kokoro-team removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 18, 2024
@TetyanaYahodska TetyanaYahodska added the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 18, 2024
@minherz
Copy link
Contributor

minherz commented Nov 18, 2024

We have separate samples for tpu_vm_create_topology and tpu_vm_create for other languages. Let's keep it for Java as well.
Here is Python sample - https://github.com/GoogleCloudPlatform/python-docs-samples/blob/115ba78aad1db3129b2ae2d5f7289288f336a18e/tpu/create_tpu_topology.py

As far as I understand the context of your comment, it is about my question of reusing the code sample CreateTpuWithTopologyFlag to demonstrate creation of the TPU and creation of the TPU with a topology flag. Let me first address your argument regarding other languages:

The arguments like this (i.e. it is done this way in another code sample or in another language) do not show that it is right or wrong because we did many things in different way than we do today.

To elaborate my question, please note that we use region tags to show code samples in the documentation. There is nothing that prevent showing the same code in two code snippets.My comment comes from the fact that these two codes are exactly the same except for a single line. It creates unnecessary maintenance overhead.

@TetyanaYahodska
Copy link
Contributor Author

Actually it is not a single line comment. We have different parameters and building chains to create Node. It will be not user friendly to have all these lines commented.

@minherz
Copy link
Contributor

minherz commented Nov 19, 2024

@TetyanaYahodska I have looked at code from CreateTpuWithTopologyFlag.java in this PR and the code from CreateTpuVm.java in the same branch. Below are the code excerpts showing the code of the samples:

  1. CreateTpuWithTopologyFlag
    TpuSettings.Builder clientSettings =
        TpuSettings.newBuilder();
    clientSettings
        .createNodeOperationSettings()
        .setPollingAlgorithm(
            OperationTimedPollAlgorithm.create(
                RetrySettings.newBuilder()
                    .setInitialRetryDelay(Duration.ofMillis(5000L))
                    .setRetryDelayMultiplier(1.5)
                    .setMaxRetryDelay(Duration.ofMillis(45000L))
                    .setInitialRpcTimeout(Duration.ZERO)
                    .setRpcTimeoutMultiplier(1.0)
                    .setMaxRpcTimeout(Duration.ZERO)
                    .setTotalTimeout(Duration.ofHours(24L))
                    .build()));
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create(clientSettings.build())) {
      String parent = String.format("projects/%s/locations/%s", projectId, zone);
      Node tpuVm = Node.newBuilder()
              .setName(nodeName)
              .setAcceleratorConfig(Node.newBuilder()
                  .getAcceleratorConfigBuilder()
                  .setType(tpuVersion)
                  .setTopology(topology)
                  .build())
              .setRuntimeVersion(tpuSoftwareVersion)
              .build();
      CreateNodeRequest request =
          CreateNodeRequest.newBuilder()
              .setParent(parent)
              .setNodeId(nodeName)
              .setNode(tpuVm)
              .build();
      return tpuClient.createNodeAsync(request).get();
    }
  2. CreateTpuVm.java:
    TpuSettings.Builder clientSettings =
        TpuSettings.newBuilder();
    clientSettings
        .createNodeOperationSettings()
        .setPollingAlgorithm(
            OperationTimedPollAlgorithm.create(
                RetrySettings.newBuilder()
                    .setInitialRetryDelay(Duration.ofMillis(5000L))
                    .setRetryDelayMultiplier(1.5)
                    .setMaxRetryDelay(Duration.ofMillis(45000L))
                    .setInitialRpcTimeout(Duration.ZERO)
                    .setRpcTimeoutMultiplier(1.0)
                    .setMaxRpcTimeout(Duration.ZERO)
                    .setTotalTimeout(Duration.ofHours(24L))
                    .build()));
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (TpuClient tpuClient = TpuClient.create(clientSettings.build())) {
      String parent = String.format("projects/%s/locations/%s", projectId, zone);
      Node tpuVm = Node.newBuilder()
              .setName(nodeName)
              .setAcceleratorType(tpuType)
              .setRuntimeVersion(tpuSoftwareVersion)
              .build();
      CreateNodeRequest request = CreateNodeRequest.newBuilder()
              .setParent(parent)
              .setNodeId(nodeName)
              .setNode(tpuVm)
              .build();
      return tpuClient.createNodeAsync(request).get();
    }

It is easy to see that the only difference is in the code that calls setAcceleratorConfig() vs setAcceleratorType().
This difference is very subtle to be easily recognized by developers. Comparing to the code in Python which you reference in your previous reply, placing these two excerpts one vs another it is very hard to find the difference.

Actually it is not a single line comment. We have different parameters and building chains to create Node. It will be not user friendly to have all these lines commented.

Taking into account what I explained above, I do not understand what you reference as "not user friendly". Note that neither of these code samples are currently used in any documentation. So, arguments about readability or friendliness can be made only based on the code. The fact is that it is not easy to spot the difference between these two calls (setAcceleratorConfig() and setAcceleratorType()) from the code.

I am going to approve this PR to unblock your progress. However, introducing almost exact code samples without documentation that provides the context and, possibly, elaborates the implementation details is not recommended practice.

Copy link
Contributor

@minherz minherz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change introduces duplicated code samples.

@kokoro-team kokoro-team removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 20, 2024
@TetyanaYahodska TetyanaYahodska added the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 20, 2024
@kokoro-team kokoro-team removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 20, 2024
@TetyanaYahodska TetyanaYahodska added kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Nov 20, 2024
@kokoro-team kokoro-team removed kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Nov 20, 2024
@TetyanaYahodska TetyanaYahodska added the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 20, 2024
@kokoro-team kokoro-team removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 20, 2024
@TetyanaYahodska TetyanaYahodska added the kokoro:run Add this label to force Kokoro to re-run the tests. label Nov 20, 2024
@TetyanaYahodska TetyanaYahodska merged commit bcecf8f into main Nov 20, 2024
10 checks passed
@TetyanaYahodska TetyanaYahodska deleted the tpu_vm_create_topology branch November 20, 2024 15:11
@m-strzelczyk
Copy link
Contributor

@minherz I see your point, however I think in this situation, a separate code sample is justified.

The difference in the middle - setting the accelerator config vs accelerator type - if packed inside a single method with if-else or something like that would complicate the sample. I really want to keep those samples as simple as possible, as with Java in particular, they do take up a lot of lines and might feel overwhelming. I know that the number of lines should not be a deciding factor, but introducing if-else into a sample that's supposed to do one specific thing does add a bit of friction for the user reading the sample.

I'd like to point out an example in Python, where I did decide to keep reusing a single sample just adding various if-else blocks, since each one didn't seem worthy of forking to a new sample. The VM instance creation sample - a very simple operation grows to ~130 lines of code, handling various options and possibilities. If I were to do it again, I think I would rather duplicate some code to keep it all simpler.

Both samples will be used on this page where the gcloud tab for creation provides two different commands for the regular creation and topology creation.

rohithrajan-ai pushed a commit to rohithrajan-ai/java-docs-samples that referenced this pull request Nov 26, 2024
* Changed package, added information to CODEOWNERS

* Added information to CODEOWNERS

* Added timeout

* Fixed parameters for test

* Fixed DeleteTpuVm and naming

* Added comment, created Util class

* Fixed naming

* Fixed whitespace

* Split PR into smaller, deleted redundant code

* Implemented tpu_vm_create_topology sample, created test

* Changed zone

* Fixed empty lines and tests, deleted cleanup method

* Fixed tests

* Fixed test

* Fixed imports

* Increased timeout to 10 sec

* Fixed tests

* Fixed tests

* Deleted settings

* Made ByteArrayOutputStream bout as local variable

* Changed timeout to 10 sec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: tpu Issues related to the Cloud TPU API. kokoro:run Add this label to force Kokoro to re-run the tests. samples Issues that are directly related to samples.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants