AllReduce Training Using DLRover on Public Cloud

This document explains how to run a DLRover elastic job using torchrun on a public cloud, namely, Alibaba Cloud Container Service for Kubernetes(ACK).

Preliminary

Create a Kubernetes cluster on ACK.
Configure cluster credentials on your local computer.
Create a NAS storage and mount it to the cluster.

Deploy the ElasticJob CRD on ACK

Deploy the controller on the cluster.

make deploy IMG=easydl/elasticjob-controller:master

Grant permission for the DLRover master to Access CRDs.

kubectl -n dlrover apply -f dlrover/go/operator/config/rbac/default_role.yaml

Submit a Job

Submit a job to train a CNN model with MNIST dataset.

kubectl -n dlrover apply -f dlrover/examples/torch_mnist.yaml

Check the job status

kubectl -n dlrover get elasticjob torch-mnist

NAME          PHASE     AGE
torch-mnist   Running   19h

Check the Pod status

kubectl -n dlrover get pods -l elasticjob-name=torch-mnist

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          26s
torch-mnist-edljob-worker-0             1/1     Running   0          29s
torch-mnist-edljob-worker-1             1/1     Running   0          32s

Test Fault-tolerance

Delete a worker.

kubectl -n dlrover delete pod torch-mnist-edljob-worker-1

Then, we can see there are only one worker.

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m12s
torch-mnist-edljob-worker-0             1/1     Running   0          1m15s

For a while, DLRover will restore the deleted worker.

NAME                                    READY   STATUS    RESTARTS   AGE
elasticjob-torch-mnist-dlrover-master   1/1     Running   0          1m52s
torch-mnist-edljob-worker-0             1/1     Running   0          1m55s
torch-mnist-edljob-worker-1             1/1     Running   0          32s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!