[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949
[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949AndreasKaratzas wants to merge 21 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces significant optimizations to the ROCm Docker build process by leveraging docker bake, multi-stage builds, and caching mechanisms like ccache. The new ci-bake.sh script centralizes and improves the CI build logic, enhancing build times and reliability. The changes are well-structured and thoughtful. I've identified a couple of critical issues related to missing runtime dependencies in the Dockerfile and a high-severity issue regarding configuration consistency in the new bake script.
| RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \ | ||
| uv pip install --system /deep_install/*.whl |
There was a problem hiding this comment.
The rocshmem library appears to be a runtime dependency for deepep. This test stage installs the deepep wheel but no longer copies the rocshmem installation from the build stage. This could lead to runtime errors if the deepep wheel does not bundle the rocshmem shared libraries. Please restore the copy of the rocshmem directory from the build_rocshmem stage to ensure deepep can function correctly.
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem
| RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \ | ||
| uv pip install --system /deep_install/*.whl |
There was a problem hiding this comment.
Similar to the test stage, the final stage now installs the deepep wheel but is missing the rocshmem runtime libraries which are likely a runtime dependency. This is likely to cause runtime failures. Please add a COPY instruction to include the rocshmem installation from the build_rocshmem stage.
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem
| # Check if baked-vllm-builder already exists and is using the socket | ||
| if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then | ||
| echo "Using existing baked-vllm-builder" | ||
| docker buildx use baked-vllm-builder | ||
| else | ||
| echo "Creating baked-vllm-builder with remote driver" | ||
| docker buildx create \ | ||
| --name baked-vllm-builder \ | ||
| --driver remote \ | ||
| --use \ | ||
| "unix://${BUILDKIT_SOCKET}" | ||
| fi |
There was a problem hiding this comment.
There's an inconsistency in the buildx builder naming. The script accepts a BUILDER_NAME environment variable (defaulting to vllm-builder), but when a local buildkitd socket is detected, it hardcodes the builder name to baked-vllm-builder. This could lead to confusion and incorrect builder usage if BUILDER_NAME is customized. For consistency, please use the ${BUILDER_NAME} variable throughout the script.
| # Check if baked-vllm-builder already exists and is using the socket | |
| if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then | |
| echo "Using existing baked-vllm-builder" | |
| docker buildx use baked-vllm-builder | |
| else | |
| echo "Creating baked-vllm-builder with remote driver" | |
| docker buildx create \ | |
| --name baked-vllm-builder \ | |
| --driver remote \ | |
| --use \ | |
| "unix://${BUILDKIT_SOCKET}" | |
| fi | |
| # Check if ${BUILDER_NAME} already exists and is using the socket | |
| if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then | |
| echo "Using existing builder: ${BUILDER_NAME}" | |
| docker buildx use "${BUILDER_NAME}" | |
| else | |
| echo "Creating builder '${BUILDER_NAME}' with remote driver" | |
| docker buildx create \ | |
| --name "${BUILDER_NAME}" \ | |
| --driver remote \ | |
| --use \ | |
| "unix://${BUILDKIT_SOCKET}" | |
| fi |
| apt-transport-https ca-certificates wget curl | ||
| apt-transport-https ca-certificates wget curl \ | ||
| ccache mold \ | ||
| && update-alternatives --install /usr/bin/ld ld /usr/bin/mold 100 |
There was a problem hiding this comment.
Changing the system loader hardly falls under "Install some basic utilities"
Could you at least provide the motivation for this in the PR description?
There was a problem hiding this comment.
Correct, mb, I updated the comment there as well.
| RUN --mount=type=cache,target=/root/.cache/ccache \ | ||
| --mount=type=cache,target=/root/.cache/uv \ | ||
| cd vllm \ | ||
| && uv pip install --system -r requirements/rocm-build.txt \ |
There was a problem hiding this comment.
Why is rocm-build.txt being used in the docker build?
There was a problem hiding this comment.
That's an oversight on my part, thought it was just rocm.txt. I updated that as well.
| COPY requirements/rocm-build.txt requirements/rocm-build.txt | ||
| COPY pyproject.toml setup.py CMakeLists.txt ./ | ||
| COPY cmake cmake/ | ||
| COPY csrc csrc/ |
There was a problem hiding this comment.
Are you copying host files here? The point of REMOTE_VLLM is exactly to not do this
There was a problem hiding this comment.
I refactored based on offline conversation to bring back the old way of doing this and avoid any trouble. I also integrated another recommended point which is a per-arch build so that we then use a specific docker dependency and not an all-arch dependency. Hope it looks better now :)
|
First successful build with vllm-project/ci-infra#307: https://buildkite.com/vllm/amd-ci/builds/6536/steps/canvas |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@mawong-amd Let's check if |
|
This pull request has merge conflicts that must be resolved before it can be |
…er changes with caching (vllm-project#36949) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
4b4f357 to
012f864
Compare
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ker_build Signed-off-by: Andreas Karatzas <akaratza@amd.com>
mawong-amd
left a comment
There was a problem hiding this comment.
Two comments to start with:
-
Does it make sense to also add the CI base content SHA to the image tag, rather than using
rocm/vllm-dev:ci_basegenerically? The situation I have in mind is a PR where we're updating say rocSHMEM.
Each commit pushed to this PR branch would trigger a fresh CI run and the "ensure CI base" stage would rebuild and pushrocm/vllm-dev:ci_baseto a version consistent with this PR branch, but inconsistent withmain. And vice-versa. So we might see ping-pong invalidation ofrocm/vllm-dev:ci_basebetweenmainand the PR branch.
This may not be a major issue if caching somehow means that subsequent rebuilds of CI base images are fast, but I have not yet reasoned as to whether this should be the case. Have we tested it? -
Do we need all the Docker builder related additions?
| setup_builder() { | ||
| echo "--- :buildkite: Setting up buildx builder" | ||
|
|
||
| if [[ -S "${BUILDKIT_SOCKET}" ]]; then | ||
| echo "Found local buildkitd socket at ${BUILDKIT_SOCKET}" | ||
| echo "Using remote driver to connect to buildkitd" | ||
|
|
||
| if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then | ||
| use_existing_builder | ||
| else | ||
| create_and_bootstrap_builder remote "unix://${BUILDKIT_SOCKET}" | ||
| fi | ||
| elif docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then | ||
| use_existing_builder | ||
| else | ||
| echo "No local buildkitd found, using docker-container driver" | ||
| create_and_bootstrap_builder docker-container | ||
| fi | ||
|
|
||
| echo "Active builder:" | ||
| docker buildx ls | grep -E '^\*|^NAME' || docker buildx ls | ||
| } |
There was a problem hiding this comment.
What do we need the Docker builder changes here and elsewhere for?
…ker_build Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
…ker_build Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
…ker_build Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…argets. Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Latest CI run with vllm-project/ci-infra#307 integrated: https://buildkite.com/vllm/ci/builds/69415/canvas |
Summary
Implements the three-tier Docker build for ROCm CI. Every PR currently rebuilds RIXL, DeepEP, rocshmem, torchcodec, and RDMA libraries from scratch, costing a total of 26 minutes on average per build. This PR introduces a pre-built Tier-1
ci_baseimage that absorbs those stable layers. Per-PR builds then only rebuild the thin vLLM wheel + workspace layer.Image registry layout after this PR:
rocm/vllm-dev:baseDockerfile.rocm_baserocm/vllm-dev:ci_baseci_basestage (this PR)rocm/vllm-ci:$COMMITteststage (this PR)This PR is connected to: vllm-project/ci-infra#307
These two PRs should likely be merged simultaneously.
cc @kenroche @okakarpa @tjtanaa @gshtras @khluu