You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use pip-compile to help with consistent Python dependency resolution (#371)
# Summary
- All Python packages, except for a few build dependencies, are now
installed using **pip-tools**.
- The JAX and upstream T5X/PAX containers are now built in a two-stage
procedure:
1. The **'meal kit'** stage: source packages are downloaded, wheels
built if necessary (for TE, tensorflow-text, lingvo, etc.), but **no**
package is installed. Instead, manifest files are created in the
`/opt/pip-tools.d` folder to instruct which packages shall be installed
by pip-tools. The stage is named due to its similarity in how
ingredients in a meal kit are prepared while deferring the final cooking
step.
2. The **'final'** (cooking🔥) stage: this is when pip-tools collectively
compile the manifests from the various container layers and then
sync-install everything to exactly match the resolved versions.
- Note that downstream containers will **build on top of the meal kit
image of its base container**, thus ensuring all packages and
dependencies are installed exactly once to avoid conflicts and image
bloating.
- The meal kit and final images are published as
- mealkit: `ghcr.io/nvidia/image:mealkit` and
`ghcr.io/nvidia/image:mealkit-YYYY-MM-DD`
- final: `ghcr.io/nvidia/image:latest` and
`ghcr.io/nvidia/image:nightly-YYYY-MM-DD`
# Additional changes to the workflows
- `/opt/jax-source` is renamed to `/opt/jax`. The `-source` suffix is
only added to packages that needs compilation, e.g. XLA and TE.
- The CI workflow is now matricized against CPU arch.
- The reusable `_build_*.yaml` workflows are simplified to build only
one image for a single architecture at a time. The logic for creating
multi-arch images is relocated into the `_publish_container.yaml`
workflows and involved during the nightly runs only.
- TE is now built as a wheel and shipped in the JAX core meal kit image.
- TE unit tests will be performed using the upstream-pax image due to
the dependency on praxis.
- Build workflows now produce sitreps following the paradigm of #229.
- Removed the various one-off workflows for pinned CUDA/JAX versions.
- Refactored the PAX arm64 Dockerfile in preparation for #338
# What remains to be done
- [ ] Update the Rosetta container build + test process to use the
upstream T5X/PAX mealkit (ghcr.io/nvidia/upstream-t5x:mealkit,
ghcr.io/nvidia/upstream-pax:mealkit) containers
# Reviewing tips
This PR requires a multitude of reviewers due to its size and scope. I'd
truly appreciate code owners to review any changes related to their
previous contributions. An incomplete list of reviewer-scope is:
- @terrykong, @ashors1, @sharathts, @maanug-nv: Rosetta, TE, T5X and PAX
MGMN tests
- @nouiz: JAX, TE and T5X build
- @joker-eph: PAX arm64 build
- @nluehr: Base image, NCCL, PAX
- @DwarKapex: base/JAX/XLA build, workflow logic
Closes#223Closes#230Closes#231Closes#232Closes#233Closes#271Fixes#328Fixes#337
Co-authored-by: Terry Kong <[email protected]>
---------
Co-authored-by: Terry Kong <[email protected]>
Co-authored-by: Vladislav Kozlov <[email protected]>
0 commit comments