Releases: NVIDIA/Model-Optimizer
Releases · NVIDIA/Model-Optimizer
0.43.0rc1
Install the 0.43.0rc1 pre-release version using
pip install nvidia-modelopt==0.43.0rc1 --extra-index-url https://pypi.nvidia.com
0.43.0rc0
Install the 0.43.0rc0 pre-release version using
pip install nvidia-modelopt[all]==0.43.0rc0 --extra-index-url https://pypi.nvidia.com
ModelOpt 0.42.0 Release
Bug Fixes
- Fix calibration data generation with multiple samples in the ONNX workflow.
New Features
- Added a standalone type inference option (
--use_standalone_type_inference) to ONNX AutoCast as an experimental alternative to ONNX'sinfer_shapes. This option performs type-only inference without shape inference, which can help when shape inference fails or when you want to avoid extra shape inference overhead. - Added quantization support for the Kimi K2 Thinking model from the original int4 checkpoint.
- Introduced support for params constraint-based automatic neural architecture search in Minitron pruning (
mcore_minitron) as an alternative to manual pruning withexport_config. See examples/pruning/README.md for more details. - Example added for Minitron pruning using the Megatron-Bridge framework, including advanced pruning usage with params-constraint-based pruning and a new distillation example. See examples/megatron_bridge/README.md.
- Supported calibration data with multiple samples in
.npzformat in the ONNX Autocast workflow. - Added the
--opsetoption to the ONNX quantization CLI to specify the target opset version for the quantized model. - Enabled support for context parallelism in Eagle speculative decoding for both HuggingFace and Megatron Core models.
- Added unified Hugging Face export support for diffusers pipelines/components.
- Added support for LTX-2 and Wan2.2 (T2V) in the diffusers quantization workflow.
- Provided PTQ support for GLM-4.7, including loading MTP layer weights from a separate
mtp.safetensorsfile and supporting export as-is. - Added support for image-text data calibration in PTQ for Nemotron VL models.
- Enabled advanced weight scale search for NVFP4 quantization and its export pathway.
- Provided PTQ support for Nemotron Parse.
- Added distillation support for LTX-2. See examples/diffusers/distillation/README.md for more details.
0.42.0rc2
Install the 0.42.0rc2 pre-release version using
pip install nvidia-modelopt[all]==0.42.0rc2 --extra-index-url https://pypi.nvidia.com
0.42.0rc1
Install the 0.42.0rc1 pre-release version using
pip install nvidia-modelopt==0.42.0rc1 --extra-index-url https://pypi.nvidia.com
0.42.0rc0
Install the 0.42.0rc0 pre-release version using
pip install nvidia-modelopt==0.42.0rc0 --extra-index-url https://pypi.nvidia.com
ModelOpt 0.41.0 Release
Bug Fixes
- Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).
New Features
- Add support for Transformer Engine quantization for Megatron Core models.
- Add support for Qwen3-Next model quantization.
- Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
- Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
- Add support for subgraphs in ONNX autocast.
- Add support for parallel draft heads in Eagle speculative decoding.
- Add support to enable custom emulated quantization backend. See
register_quant_backendfor more details. See an example intests/unit/torch/quantization/test_custom_backend.py. - Add
examples/llm_qadfor QAD training with Megatron-LM.
Deprecations
- Deprecate
num_query_groupsparameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.
Backward Breaking Changes
- Remove
torchprofileas a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.
0.41.0rc3
0.41.0rc3
0.41.0rc2
0.41.0rc2
0.41.0rc1
0.41.0rc1