This document explains how to run a DLRover elastic job using torchrun on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).
- Create a Kubernetes cluster on ACK.
- Configure cluster credentials on your local computer.
- Create a NAS storage and mount it to the cluster.
- Deploy the controller on the cluster.
make deploy IMG=easydl/elasticjob-controller:master
- Grant permission for the DLRover master to Access CRDs.
kubectl -n dlrover apply -f dlrover/go/operator/config/rbac/default_role.yaml
- Submit a job to train a CNN model with MNIST dataset.
kubectl -n dlrover apply -f dlrover/examples/torch_mnist.yaml
- Check the job status
kubectl -n dlrover get elasticjob torch-mnist
NAME PHASE AGE
torch-mnist Running 19h
- Check the Pod status
kubectl -n dlrover get pods -l elasticjob-name=torch-mnist
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 26s
torch-mnist-edljob-worker-0 1/1 Running 0 29s
torch-mnist-edljob-worker-1 1/1 Running 0 32s
- Delete a worker.
kubectl -n dlrover delete pod torch-mnist-edljob-worker-1
Then, we can see there are only one worker.
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 1m12s
torch-mnist-edljob-worker-0 1/1 Running 0 1m15s
For a while, DLRover will restore the deleted worker.
NAME READY STATUS RESTARTS AGE
elasticjob-torch-mnist-dlrover-master 1/1 Running 0 1m52s
torch-mnist-edljob-worker-0 1/1 Running 0 1m55s
torch-mnist-edljob-worker-1 1/1 Running 0 32s