Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 21 additions & 15 deletions axlearn/cloud/gcp/job.py
Original file line number Diff line number Diff line change
Expand Up @@ -563,23 +563,29 @@ def _build_pod(self) -> Nested[Any]:
}
)

pod_spec = dict(
terminationGracePeriodSeconds=60,
# Fail if any pod fails, and allow retries to happen at JobSet level.
restartPolicy="Never",
nodeSelector={
"cloud.google.com/gke-tpu-accelerator": system.gke_accelerator,
"cloud.google.com/gke-tpu-topology": system.topology,
**selector,
},
tolerations=tolerations,
containers=[self._build_container()],
serviceAccountName=cfg.service_account,
volumes=volumes,
)

# hostNetwork True and dnsPolicy do not work with Workload Identity and GCS Fuse.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we should check here that workload identity is not being used, not just gcs fuse --- as sync'ed offline, will do more testing to see whether necessary before merging.

Copy link
Copy Markdown
Contributor

@amcw7777 amcw7777 Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I have more context about how hostNetwork and dnsPolicy do not work with Workload Identity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GKE Workload Identity requires forcing metadata traffic to a specific pod. However, if you set hostNetwork true that's no longer possible so it doesn't work.

Note your GKE clusters can keep on using Workload Identity, however the issue is that it won't work for any pod that is using hostNetwork: true. All your other pods using hostNetwork: false will continue to be able to utilize workload identity just like before.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can attach the right service account (the same one used in the container) to GKE node pools instead of using the default one. Then it should work with hostNetwork=True

if not cfg.gcsfuse_mount:
pod_spec["hostNetwork"] = True
pod_spec["dnsPolicy"] = "ClusterFirstWithHostNet"

return dict(
metadata=dict(annotations=annotations),
spec=dict(
# NOTE: Don't set hostNetwork or dnsPolicy for compat with Workload Identity.
terminationGracePeriodSeconds=60,
# Fail if any pod fails, and allow retries to happen at JobSet level.
restartPolicy="Never",
nodeSelector={
"cloud.google.com/gke-tpu-accelerator": system.gke_accelerator,
"cloud.google.com/gke-tpu-topology": system.topology,
**selector,
},
tolerations=tolerations,
containers=[self._build_container()],
serviceAccountName=cfg.service_account,
volumes=volumes,
),
spec=pod_spec,
)

def _build_job(self) -> Nested[Any]:
Expand Down