Skip to content

Commit 7881eaa

Browse files
authored
Remove deprecated rcp-caas-test cluster (#11)
* Remove deprecated rcp-caas-test cluster * Change references to rcp-prod to simply rcp
1 parent 6d517f1 commit 7881eaa

File tree

4 files changed

+14
-37
lines changed

4 files changed

+14
-37
lines changed

Diff for: README.md

+7-14
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ The step-by-step instructions for first time users to quickly get a job running.
4343

4444
> [!TIP]
4545
> After completing the setup, the **TL;DR** of the interaction with the cluster (using the scripts in this repo) is:
46-
> * Choose a cluster and just run the command to set it up: `ic-cluster`, `rcp-cluster`, or `rcp-cluster-prod`
46+
> * Choose a cluster and just run the command to set it up: `ic-cluster` or `rcp-cluster`
4747
>
4848
> * Get a running job with one GPU that is reserved for you: `python csub.py -n sandbox`
4949
>
@@ -96,18 +96,12 @@ curl -o ~/.kube/config https://raw.githubusercontent.com/epfml/getting-started/
9696
# Sketch for macOS with Apple Silicon
9797
# Download the CLI from the link shown in the help section.
9898
# for Linux: replace `darwin` with `linux`
99-
wget --content-disposition https://rcp-caas-test.rcp.epfl.ch/cli/darwin
99+
wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/darwin
100100
# Give it the right permissions and move it.
101101
chmod +x ./runai
102102
sudo mv ./runai /usr/local/bin/runai-rcp
103103
sudo chown root: /usr/local/bin/runai-rcp
104104

105-
# Repeat for RCP Prod Cluster
106-
wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/darwin
107-
chmod +x ./runai
108-
sudo mv ./runai /usr/local/bin/runai-rcp-prod
109-
sudo chown root: /usr/local/bin/runai-rcp-prod
110-
111105
# Repeat for IC Cluster
112106
# for Linux: replace `macos` with `linux`
113107
wget --content-disposition https://go.epfl.ch/iccluster-runai-macos
@@ -128,7 +122,7 @@ runai-ic list projects
128122
# Put default project
129123
runai-ic config project mlo-$GASPAR_USERNAME
130124
# Repeat for the RCP cluster
131-
runai-rcp config cluster rcp-caas-test
125+
runai-rcp config cluster rcp-caas
132126
runai-rcp login
133127
runai-rcp list projects
134128
runai-rcp config project mlo-$GASPAR_USERNAME
@@ -151,11 +145,10 @@ source ~/.zshrc
151145
# Let's use the normal RCP cluster
152146
rcp-cluster
153147
# Try to submit a job that mounts our shared storage and see its content.
154-
# (side note: on the new rcp-prod, the pvc is called mlo-scratch, so the arg below has to be changed)
155148
runai submit \
156149
--name setup-test-storage \
157150
--image ubuntu \
158-
--pvc runai-mlo-$GASPAR_USERNAME-scratch:/mloscratch \
151+
--pvc mlo-scratch:/mloscratch \
159152
-- ls -la /mloscratch/homes
160153
# Check the status of the job
161154
runai describe job setup-test-storage
@@ -235,7 +228,7 @@ For remote development (changing code, debugging, etc.), we recommend using VSCo
235228
>
236229
> To have a job that can run in the background, do `python csub.py -n sandbox --train --command "cd /mloscratch/homes/<your username>/<your code>; python main.py "`
237230
>
238-
> There are differences between the clusters of IC and RCP, which require different tool versions (`runai-ic`, `runai-rcp`, ...). Since this is a bit of a hassle, we made it easy to switch between the clusters via the commands `ic-cluster`, `rcp-cluster` and `rcp-cluster-prod`. To make sure you're aware of the cluster you're using, the `csub` script asks you to set the cluster to use before submitting a job: `python csub.py -n sandbox --cluster ic-caas` (choosing between `["rcp-caas-test", "ic-caas", "rcp-caas-prod"]`). It only works when the cluster argument matches your currently chosen cluster.
231+
> There are differences between the clusters of IC and RCP, which require different tool versions (`runai-ic`, `runai-rcp`, ...). Since this is a bit of a hassle, we made it easy to switch between the clusters via the commands `ic-cluster` and `rcp-cluster`. To make sure you're aware of the cluster you're using, the `csub` script asks you to set the cluster to use before submitting a job: `python csub.py -n sandbox --cluster ic-caas` (choosing between `["ic-caas", "rcp-caas"]`). It only works when the cluster argument matches your currently chosen cluster.
239232
240233
You're good to go now! :) It's up to you to customize your environment and install the packages you need. Read up on the rest of this README to learn more about the cluster and the scripts.
241234

@@ -338,7 +331,7 @@ The python script `csub.py` is a wrapper around the run:ai CLI that makes it eas
338331
General usage:
339332

340333
```bash
341-
python csub.py --n <job_name> -g <number of GPUs> -t <time> --cluster rcp-caas-test -i ic-registry.epfl.ch/mlo/mlo:v1 --command <cmd> [--train]
334+
python csub.py --n <job_name> -g <number of GPUs> -t <time> --cluster rcp-caas -i ic-registry.epfl.ch/mlo/mlo:v1 --command <cmd> [--train]
342335
```
343336
Check the arguments for the script to see what they do.
344337

@@ -382,7 +375,7 @@ kubectl port-forward <pod_name> 8888:8888
382375
```
383376

384377
## Distributed training
385-
Newer versions of runai support distributed training, meaning the ability to use run accross multiple compute nodes, even beyond the several GPUs available on one node. This is currently set up on the new RCP Prod cluster (rcp-caas-prod).
378+
Newer versions of runai support distributed training, meaning the ability to use run accross multiple compute nodes, even beyond the several GPUs available on one node. This is currently set up on the new RCP Prod cluster (rcp-caas).
386379
A nice [documentation to get started with distributed jobs is available here](docs/multinode.md).
387380

388381
# File overview of this repository

Diff for: csub.py

+4-5
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@
2121
"-cl",
2222
"--cluster",
2323
type=str,
24-
default="rcp-caas-test",
25-
choices=["rcp-caas-test", "ic-caas", "rcp-caas-prod"],
24+
default="rcp-caas",
25+
choices=["ic-caas", "rcp-caas"],
2626
)
2727
parser.add_argument(
2828
"-c",
@@ -147,11 +147,10 @@
147147
text=True,
148148
).stdout.strip()
149149

150-
if current_cluster == "rcp-caas-prod":
150+
if current_cluster == "rcp-caas":
151+
# the latest version can be found on https://wiki.rcp.epfl.ch/home/CaaS/FAQ/how-to-prepare-environment
151152
runai_cli_version = "2.16.70"
152153
scratch_name = "mlo-scratch"
153-
elif current_cluster == "rcp-caas-test":
154-
runai_cli_version = "2.9.25"
155154
elif current_cluster == "ic-caas":
156155
runai_cli_version = "2.16.52"
157156
assert (

Diff for: kubeconfig.yaml

+1-10
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,6 @@ clusters:
88
cluster:
99
server: https://caas-prod.rcp.epfl.ch:443
1010
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURCVENDQWUyZ0F3SUJBZ0lJRHdwSElpTmQrVUV3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRBMk1UUXdPVFU1TWpkYUZ3MHpOREEyTVRJeE1EQTBNamRhTUJVeApFekFSQmdOVkJBTVRDbXQxWW1WeWJtVjBaWE13Z2dFaU1BMEdDU3FHU0liM0RRRUJBUVVBQTRJQkR3QXdnZ0VLCkFvSUJBUURvOHJDRjNjeXdRRTlxTVpEOHNGTXo2K0FzSEpnWi81WVNwMGNhWHNKd0JWUERneGdwRGZKY0hnYXYKS2tOdVhTNGpBN1VrZkg1amZXQitvdytpamN3OUR4cjV6STB2TUNReWtzYk9kMVFFMis0Q0J1U0JXU01Gc1pYZQp2T01SanltN056SytxWkVldHpxR0M0bU5LdU9qbC92cGd4ZDNuM2Y2L3loRHhockp2bkVWKzZlUE5icWpDZURZCld1VWFZdUYxRmM4QnZHN0hma3FYRlRWWVdlNkpNa3JSbDQxOVo5a2diNnIvUFNZVzZqdDhhNThTSGNHSVhnTFcKOTBta3BFb1JCMENOSG0wQllEQjdjNFJxMmdyaWtZTUlldGM0eXk2L3NSdFp6NzFiTUQrM2ZDNk92NDdvOXUzWgpld0VWeEJ4dG11ZkVvVGduVEVyNXFYMlhxWFZMQWdNQkFBR2pXVEJYTUE0R0ExVWREd0VCL3dRRUF3SUNwREFQCkJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJSazdCMm84a3cxcyt0Ny9ZaGxmV1h1MnR6TkdEQVYKQmdOVkhSRUVEakFNZ2dwcmRXSmxjbTVsZEdWek1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQXFOdnQrR01lTwp6QnZZZEQ2SExCakFVeWc1czd0TDgzOVltd0RhRXBseG45ZlBRdUV6UW14cnEwUEoxcnVZNnRvRks1SEN4RFVzCmJDN3R3WlMzaVdNNXQ5NEJveHJGVC92c3QrQmtzbWdvTGM2T0N1MitYcngyMUg3UnFLTnNVR01LN2tFdGN6cHgKeXUrYTB6T0tISEUxNWFSVENPbklzQ1pXaTRhVFhIZ00zQ2U4VEhBMXRxaW9pREFHMVFUQXNhNXhTeVM3RWlUSQpDYi9xbktPRlVvM3V3bkRocWljRTU3dE1LTjliRE8rV3hNMzVxT2lBZXVXOUVnc2JlOFA5aDY2NG1tK1QzbjY0ClJNL1l1NHhmcDZwMHMvdGZyZTVjaUFvT0dGekYyRmVKek5PYm1vRkVseUtKc0RwbEorcWFTVXlaL2NtNWRIYUUKQVUxOVMrUWpFc1cvCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
11-
# Cluster RCP Test
12-
- name: caas-test.rcp.epfl.ch
13-
cluster:
14-
server: https://caas-test.rcp.epfl.ch:443
15-
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJek1EUXlOakE0TVRRME5sb1hEVE16TURReU16QTRNVFEwTmxvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTFAxCmtTZ2E4NWRWU0p0VUxGQ1g5VWo1K1lTT2dCbG9MZGVxZVgrM1ByVGtQZkptWFBxeXlsVVBLN0tJUWlvSUplNm8KRTBaS2JZbU03SnEvL0lPaHF4R0VraUNrTHJCamJrYXF5M3NibkNhWGFMa1pQYkhNWjgwdmlMMGNFZHNJTWN4WgozdHpMTzFNTldwZW9mZlJ6L1NvbXpqSTVDQldJbUptTmhvZXpJQUVNOGJuaDJKeFBFNzRwWThTS1BTRk5YVzN0CjgxNmM5cXRvc1lJQjVrTnh1UjRGWVh5bGloZHZ3UmVqVW9wajA2ME1rSkl3QmpXM01YTFUrdkVyandKeFc5Q1cKZ2plUndzOG5kdW5VVHREcy9CVjhGbW5JZy81VVNhZTBzUE5FQWxvZC9TbGhrMnNuWTJvUXZlTHpFNkhrMnluRgpHNXd1VGVXRDZGY2Erd1pNMjM4Q0F3RUFBYU5aTUZjd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZNVVhkVWVnK2xMdTlHWElMQ2VlOVJzOENmUXpNQlVHQTFVZEVRUU8KTUF5Q0NtdDFZbVZ5Ym1WMFpYTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBR051a2ZUR3E0RTlrckkreVZQbApaem1reSszaUNTMnYvTU9OU3h0S01idWZ2V0ROZFM3QzZaK1RDQTJSd0c1Y2gzZUh5UW9oTSs0K2wrSTJxMTFwCjNJVGRxYVI4RDhpQkFCbXV6Yzl2a3BKanZTTzZ4VVpnTFJZMHRDTUxXZ3g2b2tBcWhxZDV3YTZIYmN6Z1QrSUcKQlVGbERtR0R4K0MxTnFIYVFKUVN1bENqL1ZyS1RROVFlY1NoZGZqVDgvS1NVUjQ4VTlEdlA3dnU0YkRnWW5DKwpoOXEwUlFpUGR4TEtlL2Q5aGd0UnM5TjFQdGRYZXAxdHB3NCs3Y3N4TE1DSXNmYTBwaW8yb3lEems0bTNjSWRNCi9iNElHUEZaM2hYZktOVGtybnUrWmdCUms5Yjk3emNKZVdhendxTXUyd1dkV2JiQjdpaU5ZK2xtWkl1S0dUeFQKWWpRPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
1611
# Cluster IC
1712
- name: ic-caas
1813
cluster:
@@ -44,14 +39,10 @@ users:
4439
name: oidc
4540
contexts:
4641
# Contexts (a context a cluster associated with a user)
47-
- name: rcp-caas-prod
42+
- name: rcp-caas
4843
context:
4944
cluster: caas-prod.rcp.epfl.ch
5045
user: runai-rcp-authenticated-user
51-
- name: rcp-caas-test
52-
context:
53-
cluster: caas-test.rcp.epfl.ch
54-
user: runai-rcp-authenticated-user
5546
- name: ic-caas
5647
context:
5748
cluster: ic-caas

Diff for: template/cluster_switch.sh

+2-8
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,10 @@ function ic-cluster {
44
runai config cluster ic-caas
55
runai config project mlo-$GASPAR_USERNAME
66
}
7+
78
function rcp-cluster {
89
alias runai=runai-rcp
910
# This is actually changing the context not the cluster ...
10-
runai config cluster rcp-caas-test
11-
runai config project mlo-$GASPAR_USERNAME
12-
}
13-
14-
function rcp-prod-cluster {
15-
alias runai=runai-rcp-prod
16-
# This is actually changing the context not the cluster ...
17-
runai config cluster rcp-caas-prod
11+
runai config cluster rcp-caas
1812
runai config project mlo-$GASPAR_USERNAME
1913
}

0 commit comments

Comments
 (0)