Releases: NVIDIA/cloudai
v1.2.beta2
What's Changed
- Introduce 'cmd' field for SlurmContainer jobs by @amaslenn in #362
- Intial dse parameters for llama_8b by @srivatsankrishnan in #359
- Small housekeeping updates by @amaslenn in #363
- Base Config for NemoRun LLama3-8b by @srivatsankrishnan in #366
- Rework reporting logic by @amaslenn in #360
- Llama and Nemotron Configs by @srivatsankrishnan in #365
- Enable and configure Nsys tracing via test config by @amaslenn in #364
Full Changelog: v1.2.beta1...v1.2.beta2
v1.2.beta1
Highlights
Changes in mounts for Slurm runs
Documentation is available in the User Guide.
Default mount
Test output directory <output_path>/<scenario_name_with_timestamp>/<test_name>/<iteration>
(for ex. results/scenario_2024-06-18_17-40-13/Tests.1/0
) is mounted as /cloudai_run_results
.
Custom mounts
Users can now specify custom mounts via Test configuration:
extra_container_mounts = [
"/path/to/mount1:/path/in/container1",
"/path/to/mount2:/path/in/container2"
]
Git repo mounts
Arbitrary amount of Git repositories can be cloned as part of cloudai install
and the mounted into containers.
[[git_repos]]
url = "https://github.com/NVIDIA/cloudai"
commit = "sha1"
mount_as = "/work"
[[git_repos]]
url = "https://github.com/NVIDIA/cloudai-new"
commit = "sha1"
mount_as = "/opt/new"
Configuration is done via Test TOML file.
Sbatch custom arguments
Users can now specify custom sbatch arguments via System configuration:
extra_sbatch_args = [
"--section=4",
"--other-arg val"
]
The snippet above will result in the following sbatch directives added in addition to others:
#SBATCH --section=4
#SBATCH --other-arg val
More info.
What's Changed
- Move conf/staging/nemo/ to CloudAI by @TaekyungHeo in #349
- Implement report generation for NeMoRun, summarizing train step timing by @TaekyungHeo in #354
- Always mount current output dir as /cloudai_run_results by @amaslenn in #355
- Support custom mounts for slurm container jobs by @amaslenn in #356
- Manage fields' serialization of SlurmSystem by @amaslenn in #341
- NemoRun DSE PoC by @srivatsankrishnan in #353
- Add support for extra sbatch args via system model by @amaslenn in #357
- Introduce configurable mounts for git repos by @amaslenn in #358
Full Changelog: v1.1.0...v1.2.beta1
v1.1.0
CloudAI v1.1 (GA) release notes
Compatibility
CloudAI v1.1 has been tested with: PyTorch/JAX NGC Container 24.05, NCCL 2.19/2.21, and SPC-X 1.1.
Key Features and Enhancements:
- First GA release with verification and QA testing
- Verifiable test schemas using Pydantic
- Use subcommands for command line options for better user experience
What’s next
- Support for GB200 and GB300 systems
- General availability - CloudAI Configurator and Gym
- Support wide range of Nemo 2.0 models
- Deprecate PAXML JAXToolbox and replace it with MaxText JAXToolbox.
v1.1.rc1
What's Changed
- Docs Review - Suggested Changes by @RulaHallak in #344
- Step integration with Runner by @srivatsankrishnan in #343
- Integrate report generation into Gym by @srivatsankrishnan in #348
New Contributors
- @RulaHallak made their first contribution in #344
Full Changelog: v1.1.beta21...v1.1.rc1
v1.1.beta21
What's Changed
- Reflect Andrei's comments for PR345 by @TaekyungHeo in #347
- fix the docker url for dse [for QA] by @srivatsankrishnan in #346
Full Changelog: v1.1.beta20...v1.1.beta21
v1.1.beta20
What's Changed
- Safety valve to seperate the DSE execution with benchmarking execution by @srivatsankrishnan in #334
- Results directory for DSE job by @srivatsankrishnan in #333
- Fix time limit calculation to include pre and post test hook durations by @TaekyungHeo in #345
Full Changelog: v1.1.beta19...v1.1.beta20
v1.1.beta19
What's Changed
Full Changelog: v1.1.beta18...v1.1.beta19
v1.1.beta18
v1.1.beta17
What's Changed
- Bug fix in job completion checks by @TaekyungHeo in #340
Full Changelog: v1.1.beta16...v1.1.beta17
v1.1.beta16
What's Changed
- Replace NCCL_TEST_SPLIT_MASK with NCCL_TESTS_SPLIT_MASK by @TaekyungHeo in #339
- Configurable CloudAIGym from TestRun by @srivatsankrishnan in #327
- Check RW permissions for install and results folders by @amaslenn in #338
- Configurable agents interface by @srivatsankrishnan in #329
Full Changelog: v1.1.beta15...v1.1.beta16