Knowledge Distillation is a machine learning technique where a compact "student" model learns to replicate the behavior of a larger, more complex "teacher" model to achieve comparable performance with improved efficiency.
Model Optimizer's Distillation is a set of wrappers and utilities to easily perform Knowledge Distillation among teacher and student models. Given a pretrained teacher model, Distillation has the potential to train a smaller student model faster and/or with higher accuracy than the student model could achieve on its own.
This section focuses on demonstrating how to apply Model Optimizer to perform knowledge distillation with ease.
| Section | Description | Link | Docs |
|---|---|---|---|
| Pre-Requisites | Required & optional packages to use this technique | [Link] | |
| Getting Started | Learn how to optimize your models using distillation to produce more intellegant smaller models | [Link] | [docs] |
| Support Matrix | View the support matrix to see compatibility and feature availability across different models | [Link] | |
| Distillation with Megatron-LM | Learn how to distill your models with Megatron-LM Framework | [Link] | |
| Distillation with NeMo | Learn how to distill your models with NeMo Framework | [Link] | [docs] |
| Distillation with Huggingface | Learn how to distill your models with Hugging Face | [Link] | [docs] |
| Resources | Extra links to relevant resources | [Link] | |
| NeMo Prune + Distill Simplified Flow | Example script demonstrating end-to-end pruning plus distillation in NeMo | [Link] |
For Hugging Face models, please use the PyTorch docker image (e.g., nvcr.io/nvidia/pytorch:25.06-py3).
For NeMo models, use the NeMo container (e.g., nvcr.io/nvidia/nemo:25.09) which has all the dependencies installed.
Visit our installation docs for more information.
Also follow the installation steps below to upgrade to the latest version of Model Optimizer and install example-specific dependencies.
For Hugging Face models, install Model Optimizer with hf dependencies using pip from PyPI and install the requirements for the example:
pip install -U nvidia-modelopt[hf]
pip install -r requirements.txtFirst obtain both a pretrained model to act as the teacher and a (usually smaller) model to serve as the student.
from transformers import AutoModelForCausalLM
# Define student & teacher
student_model = AutoModelForCausalLM.from_pretrained("student-model-id-or-path")
teacher_model = AutoModelForCausalLM.from_pretrained("teacher-model-id-or-path")As Knowledge Distillation involves (at least) two models, ModelOpt simplifies the integration process by wrapping both student and teacher into one meta model.
Please see an example Distillation setup below. This example assumes the outputs of teacher_model and student_model are logits.
import modelopt.torch.distill as mtd
distillation_config = {
"teacher_model": teacher_model,
"criterion": mtd.LogitsDistillationLoss(), # callable receiving student and teacher outputs, in order
"loss_balancer": mtd.StaticLossBalancer(), # combines multiple losses; omit if only one distillation loss used
}
distillation_model = mtd.convert(student_model, mode=[("kd_loss", distillation_config)])The teacher_model can be either a nn.Module, a callable which returns an nn.Module, or a tuple of (model_cls, args, kwargs). The criterion is the distillation loss used between student and teacher tensors. The loss_balancer determines how the original and distillation losses are combined (if needed).
See Distillation for more info.
To Distill from teacher to student, simply use the meta model in the usual training loop, while also using the meta model’s .compute_kd_loss() method to compute the distillation loss, in addition to the original user loss.
An example of Distillation training is given below:
# Setup the data loaders. As example:
train_loader = get_train_loader()
# Define user loss function. As example:
loss_fn = get_user_loss_fn()
for input, labels in train_dataloader:
distillation_model.zero_grad()
# Forward through the wrapped models
out = distillation_model(input)
# Same loss as originally present
loss = loss_fn(out, labels)
# Combine distillation and user losses
loss_total = distillation_model.compute_kd_loss(student_loss=loss)
loss_total.backward()Note
DataParallel may break ModelOpt’s Distillation feature. Note that HuggingFace Trainer uses DataParallel by default.
The model can easily be reverted to its original class for further use (i.e deployment) without any ModelOpt modifications attached.
model = mtd.export(distillation_model)Loss criterion:
mtd.LogitsDistillationLoss()- Standard KL-Divergence on output logitsmtd.MGDLoss()- Masked Generative Distillation loss for 2D convolutional outputsmtd.MFTLoss()- KL-divergence loss with Minifinetuning threshold modification
Loss balancers:
mtd.StaticLossBalancer()- Combines original student loss and KD loss into a single weighted sum (without changing over time)
Note
The following are models that were confirmed to run with ModelOpt distillation, but it is absolutely not limited to these
| Model | type | confirmed compatible |
|---|---|---|
| Nemotron | gpt | ✅ |
| Llama 3 | llama | ✅ |
| Llama 4 | llama | ✅ |
| Gemma 2 | gemma | ✅ |
| Gemma 3 | gemma | ✅ |
| Phi 3 | phi | ✅ |
| Qwen 2 | qwen2 | ✅ |
| Qwen 3 | qwen3 | ✅ |
| Mamba | mamba | ✅ |
Checkout the Knowledge Distillation example in the Megatron-LM repository.
Checkout the stand-alone distillation script in the NeMo documentation.
You can also look at the NeMo tutorial notebooks here which showcase the usage of Minitron pruning followed by distillation for Qwen 3 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
In this e2e example we finetune Llama-3.2 models on the smol-smoltalk-Interaction-SFT dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.
We replace normal supervised finetuning (SFT) of a Llama-3.2-1B base model by distilling information from Llama-3.2-3B-Instruct which has already been instruction-finetuned.
Note
We can fit the following in memory using FSDP enabled on 8x RTX 6000 (total ~400GB VRAM)
accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
main.py \
--teacher_name_or_path 'meta-llama/Llama-3.2-3B-Instruct' \
--student_name_or_path 'meta-llama/Llama-3.2-1B' \
--output_dir ./llama3.2-distill \
--max_length 2048 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 8 \
--max_steps 200 \
--logging_steps 5