Skip to content

Commit 355d7f9

Browse files
Yadan-WeiYadan Wei
and
Yadan Wei
authored
[PyTorch][Training][EC2][SageMaker]PyTorch 2.6.0 Currency Release (#4556)
* change test-ecr-scan image data storage * add init files * revert ecr change * fix torch link * add --no-build-isolation for TE * skip OSS compliance to build image for OSS test * add 2.6 ec2 test * add 2.6 ec2 tests * build with OSS, enbale telemetry test, add venv in ec2 test * add allowlist * remove conda from ec2 tests * change setup file conda * skip dgl test * enable all ec2 tests * modify te tests * build SM image and modify tests * fix some tests * not build * add fastai comment * rebuild sm without fatsai and change telemetry tests * fix sm local tests * fix skip dict and run all tests * rebuilt sm with cuda compat and run all tests * build ec2 image with cuda patch * rebuild sm image and run all tests * build 2.5 to verify telemetry tests * add smppy and build * run all tests * revert toml * address comments and build ec2 * build sm image * run all tests * fix te tests * ec2 images run all tests * backward compatible test with 2.5 * disable autopatch * do build * build autopatch 2.5 * fix autopatch * revert toml * add remind of telemetry --------- Co-authored-by: Yadan Wei <[email protected]>
1 parent a184c91 commit 355d7f9

17 files changed

+1281
-32
lines changed
+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2+
prod_account_id: &PROD_ACCOUNT_ID 763104351884
3+
region: &REGION <set-$REGION-in-environment>
4+
framework: &FRAMEWORK pytorch
5+
version: &VERSION 2.6.0
6+
short_version: &SHORT_VERSION "2.6"
7+
arch_type: x86
8+
# autopatch_build: "True"
9+
10+
repository_info:
11+
training_repository: &TRAINING_REPOSITORY
12+
image_type: &TRAINING_IMAGE_TYPE training
13+
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
14+
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
15+
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
16+
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
17+
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]
18+
19+
context:
20+
training_context: &TRAINING_CONTEXT
21+
start_cuda_compat:
22+
source: docker/build_artifacts/start_cuda_compat.sh
23+
target: start_cuda_compat.sh
24+
dockerd_entrypoint:
25+
source: docker/build_artifacts/dockerd_entrypoint.sh
26+
target: dockerd_entrypoint.sh
27+
changehostname:
28+
source: docker/build_artifacts/changehostname.c
29+
target: changehostname.c
30+
start_with_right_hostname:
31+
source: docker/build_artifacts/start_with_right_hostname.sh
32+
target: start_with_right_hostname.sh
33+
example_mnist_file:
34+
source: docker/build_artifacts/mnist.py
35+
target: mnist.py
36+
deep_learning_container:
37+
source: ../../src/deep_learning_container.py
38+
target: deep_learning_container.py
39+
40+
images:
41+
BuildEC2CPUPTTrainPy3DockerImage:
42+
<<: *TRAINING_REPOSITORY
43+
build: &PYTORCH_CPU_TRAINING_PY3 false
44+
image_size_baseline: 6500
45+
device_type: &DEVICE_TYPE cpu
46+
python_version: &DOCKER_PYTHON_VERSION py3
47+
tag_python_version: &TAG_PYTHON_VERSION py312
48+
os_version: &OS_VERSION ubuntu22.04
49+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-ec2" ]
50+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-ec2" ]
51+
# build_tag_override: "beta:2.6.0-cpu-py311-ubuntu22.04-ec2"
52+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
53+
target: ec2
54+
context:
55+
<<: *TRAINING_CONTEXT
56+
BuildEC2GPUPTTrainPy3cu126DockerImage:
57+
<<: *TRAINING_REPOSITORY
58+
build: &PYTORCH_GPU_TRAINING_PY3 false
59+
image_size_baseline: 19700
60+
device_type: &DEVICE_TYPE gpu
61+
python_version: &DOCKER_PYTHON_VERSION py3
62+
tag_python_version: &TAG_PYTHON_VERSION py312
63+
cuda_version: &CUDA_VERSION cu126
64+
os_version: &OS_VERSION ubuntu22.04
65+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
66+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
67+
# build_tag_override: "beta:2.6.0-gpu-py311-cu121-ubuntu22.04-ec2"
68+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
69+
*DEVICE_TYPE ]
70+
target: ec2
71+
context:
72+
<<: *TRAINING_CONTEXT

pytorch/training/buildspec-2-6-sm.yml

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2+
prod_account_id: &PROD_ACCOUNT_ID 763104351884
3+
region: &REGION <set-$REGION-in-environment>
4+
framework: &FRAMEWORK pytorch
5+
version: &VERSION 2.6.0
6+
short_version: &SHORT_VERSION "2.6"
7+
arch_type: x86
8+
# autopatch_build: "True"
9+
10+
repository_info:
11+
training_repository: &TRAINING_REPOSITORY
12+
image_type: &TRAINING_IMAGE_TYPE training
13+
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
14+
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
15+
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
16+
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
17+
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]
18+
19+
context:
20+
training_context: &TRAINING_CONTEXT
21+
start_cuda_compat:
22+
source: docker/build_artifacts/start_cuda_compat.sh
23+
target: start_cuda_compat.sh
24+
dockerd_entrypoint:
25+
source: docker/build_artifacts/dockerd_entrypoint.sh
26+
target: dockerd_entrypoint.sh
27+
changehostname:
28+
source: docker/build_artifacts/changehostname.c
29+
target: changehostname.c
30+
start_with_right_hostname:
31+
source: docker/build_artifacts/start_with_right_hostname.sh
32+
target: start_with_right_hostname.sh
33+
example_mnist_file:
34+
source: docker/build_artifacts/mnist.py
35+
target: mnist.py
36+
deep_learning_container:
37+
source: ../../src/deep_learning_container.py
38+
target: deep_learning_container.py
39+
40+
images:
41+
BuildSageMakerCPUPTTrainPy3DockerImage:
42+
<<: *TRAINING_REPOSITORY
43+
build: &PYTORCH_CPU_TRAINING_PY3 false
44+
image_size_baseline: 6200
45+
device_type: &DEVICE_TYPE cpu
46+
python_version: &DOCKER_PYTHON_VERSION py3
47+
tag_python_version: &TAG_PYTHON_VERSION py312
48+
os_version: &OS_VERSION ubuntu22.04
49+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ]
50+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ]
51+
# build_tag_override: "beta:2.6.0-cpu-py311-ubuntu22.04-sagemaker"
52+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
53+
target: sagemaker
54+
context:
55+
<<: *TRAINING_CONTEXT
56+
BuildSageMakerGPUPTTrainPy3DockerImage:
57+
<<: *TRAINING_REPOSITORY
58+
build: &PYTORCH_GPU_TRAINING_PY3 false
59+
image_size_baseline: 21500
60+
device_type: &DEVICE_TYPE gpu
61+
python_version: &DOCKER_PYTHON_VERSION py3
62+
tag_python_version: &TAG_PYTHON_VERSION py312
63+
cuda_version: &CUDA_VERSION cu126
64+
os_version: &OS_VERSION ubuntu22.04
65+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
66+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
67+
# build_tag_override: "beta:2.6.0-gpu-py311-cu124-ubuntu22.04-sagemaker"
68+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
69+
*DEVICE_TYPE ]
70+
target: sagemaker
71+
context:
72+
<<: *TRAINING_CONTEXT

pytorch/training/buildspec.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
buildspec_pointer: buildspec-2-5-sm.yml
1+
buildspec_pointer: buildspec-2-6-sm.yml

0 commit comments

Comments
 (0)