Skip to content

[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949

Open
AndreasKaratzas wants to merge 21 commits into
vllm-project:mainfrom
ROCm:akaratza_optimize_docker_build
Open

[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949
AndreasKaratzas wants to merge 21 commits into
vllm-project:mainfrom
ROCm:akaratza_optimize_docker_build

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Member

@AndreasKaratzas AndreasKaratzas commented Mar 13, 2026

Summary

Implements the three-tier Docker build for ROCm CI. Every PR currently rebuilds RIXL, DeepEP, rocshmem, torchcodec, and RDMA libraries from scratch, costing a total of 26 minutes on average per build. This PR introduces a pre-built Tier-1 ci_base image that absorbs those stable layers. Per-PR builds then only rebuild the thin vLLM wheel + workspace layer.

Image registry layout after this PR:

Tag Built by Frequency
rocm/vllm-dev:base Dockerfile.rocm_base Monthly
rocm/vllm-dev:ci_base ci_base stage (this PR) Weekly
rocm/vllm-ci:$COMMIT test stage (this PR) Per PR

This PR is connected to: vllm-project/ci-infra#307
These two PRs should likely be merged simultaneously.

cc @kenroche @okakarpa @tjtanaa @gshtras @khluu

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant optimizations to the ROCm Docker build process by leveraging docker bake, multi-stage builds, and caching mechanisms like ccache. The new ci-bake.sh script centralizes and improves the CI build logic, enhancing build times and reliability. The changes are well-structured and thoughtful. I've identified a couple of critical issues related to missing runtime dependencies in the Dockerfile and a high-severity issue regarding configuration consistency in the new bake script.

Comment thread docker/Dockerfile.rocm Outdated
Comment on lines 406 to 407
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The rocshmem library appears to be a runtime dependency for deepep. This test stage installs the deepep wheel but no longer copies the rocshmem installation from the build stage. This could lead to runtime errors if the deepep wheel does not bundle the rocshmem shared libraries. Please restore the copy of the rocshmem directory from the build_rocshmem stage to ensure deepep can function correctly.

RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
    uv pip install --system /deep_install/*.whl

# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment thread docker/Dockerfile.rocm Outdated
Comment on lines +491 to +492
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the test stage, the final stage now installs the deepep wheel but is missing the rocshmem runtime libraries which are likely a runtime dependency. This is likely to cause runtime failures. Please add a COPY instruction to include the rocshmem installation from the build_rocshmem stage.

RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
    uv pip install --system /deep_install/*.whl

# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment thread .buildkite/scripts/ci-bake.sh Outdated
Comment on lines +85 to +96
# Check if baked-vllm-builder already exists and is using the socket
if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then
echo "Using existing baked-vllm-builder"
docker buildx use baked-vllm-builder
else
echo "Creating baked-vllm-builder with remote driver"
docker buildx create \
--name baked-vllm-builder \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an inconsistency in the buildx builder naming. The script accepts a BUILDER_NAME environment variable (defaulting to vllm-builder), but when a local buildkitd socket is detected, it hardcodes the builder name to baked-vllm-builder. This could lead to confusion and incorrect builder usage if BUILDER_NAME is customized. For consistency, please use the ${BUILDER_NAME} variable throughout the script.

Suggested change
# Check if baked-vllm-builder already exists and is using the socket
if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then
echo "Using existing baked-vllm-builder"
docker buildx use baked-vllm-builder
else
echo "Creating baked-vllm-builder with remote driver"
docker buildx create \
--name baked-vllm-builder \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi
# Check if ${BUILDER_NAME} already exists and is using the socket
if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then
echo "Using existing builder: ${BUILDER_NAME}"
docker buildx use "${BUILDER_NAME}"
else
echo "Creating builder '${BUILDER_NAME}' with remote driver"
docker buildx create \
--name "${BUILDER_NAME}" \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment thread docker/Dockerfile.rocm Outdated
apt-transport-https ca-certificates wget curl
apt-transport-https ca-certificates wget curl \
ccache mold \
&& update-alternatives --install /usr/bin/ld ld /usr/bin/mold 100
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the system loader hardly falls under "Install some basic utilities"
Could you at least provide the motivation for this in the PR description?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, mb, I updated the comment there as well.

Comment thread docker/Dockerfile.rocm Outdated
RUN --mount=type=cache,target=/root/.cache/ccache \
--mount=type=cache,target=/root/.cache/uv \
cd vllm \
&& uv pip install --system -r requirements/rocm-build.txt \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is rocm-build.txt being used in the docker build?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an oversight on my part, thought it was just rocm.txt. I updated that as well.

Comment thread docker/Dockerfile.rocm
COPY requirements/rocm-build.txt requirements/rocm-build.txt
COPY pyproject.toml setup.py CMakeLists.txt ./
COPY cmake cmake/
COPY csrc csrc/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you copying host files here? The point of REMOTE_VLLM is exactly to not do this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored based on offline conversation to bring back the old way of doing this and avoid any trouble. I also integrated another recommended point which is a per-arch build so that we then use a specific docker dependency and not an all-arch dependency. Hope it looks better now :)

@AndreasKaratzas AndreasKaratzas marked this pull request as draft March 13, 2026 18:27
@AndreasKaratzas
Copy link
Copy Markdown
Member Author

@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 18, 2026 05:38
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 19, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 19, 2026
@AndreasKaratzas
Copy link
Copy Markdown
Member Author

@mawong-amd Let's check if Kernels Core Operation Test passes as well. We may need to bring back the compilation of triton_kernels. Not sure yet.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 30, 2026
AndreasKaratzas added a commit to ROCm/vllm that referenced this pull request Mar 31, 2026
…er changes with caching (vllm-project#36949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify Bot removed the needs-rebase label Mar 31, 2026
@AndreasKaratzas AndreasKaratzas force-pushed the akaratza_optimize_docker_build branch from 4b4f357 to 012f864 Compare April 3, 2026 08:36
@mergify mergify Bot added the nvidia label Apr 3, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ker_build

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Copy Markdown
Contributor

@mawong-amd mawong-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments to start with:

  1. Does it make sense to also add the CI base content SHA to the image tag, rather than using rocm/vllm-dev:ci_base generically? The situation I have in mind is a PR where we're updating say rocSHMEM.
    Each commit pushed to this PR branch would trigger a fresh CI run and the "ensure CI base" stage would rebuild and push rocm/vllm-dev:ci_base to a version consistent with this PR branch, but inconsistent with main. And vice-versa. So we might see ping-pong invalidation of rocm/vllm-dev:ci_base between main and the PR branch.
    This may not be a major issue if caching somehow means that subsequent rebuilds of CI base images are fast, but I have not yet reasoned as to whether this should be the case. Have we tested it?

  2. Do we need all the Docker builder related additions?

Comment on lines +539 to +560
setup_builder() {
echo "--- :buildkite: Setting up buildx builder"

if [[ -S "${BUILDKIT_SOCKET}" ]]; then
echo "Found local buildkitd socket at ${BUILDKIT_SOCKET}"
echo "Using remote driver to connect to buildkitd"

if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then
use_existing_builder
else
create_and_bootstrap_builder remote "unix://${BUILDKIT_SOCKET}"
fi
elif docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then
use_existing_builder
else
echo "No local buildkitd found, using docker-container driver"
create_and_bootstrap_builder docker-container
fi

echo "Active builder:"
docker buildx ls | grep -E '^\*|^NAME' || docker buildx ls
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need the Docker builder changes here and elsewhere for?

…ker_build

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 22, 2026
…ker_build

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify Bot removed the needs-rebase label May 22, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Comment thread .buildkite/scripts/ci-bake-rocm.sh Outdated
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Comment thread docker/Dockerfile.rocm
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 28, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 28, 2026
…ker_build

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas requested a review from khluu as a code owner June 1, 2026 18:04
@mergify mergify Bot removed the needs-rebase label Jun 1, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…argets.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas
Copy link
Copy Markdown
Member Author

AndreasKaratzas commented Jun 2, 2026

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia rocm Related to AMD ROCm

Projects

Status: Todo
Status: Ready

Development

Successfully merging this pull request may close these issues.

4 participants