Name	Name	Last commit message	Last commit date
parent directory ..
accelerate_config	accelerate_config
README.md	README.md
main.py	main.py
requirements.txt	requirements.txt

Knowledge Distillation

Knowledge Distillation is a machine learning technique where a compact "student" model learns to replicate the behavior of a larger, more complex "teacher" model to achieve comparable performance with improved efficiency.

Model Optimizer's Distillation is a set of wrappers and utilities to easily perform Knowledge Distillation among teacher and student models. Given a pretrained teacher model, Distillation has the potential to train a smaller student model faster and/or with higher accuracy than the student model could achieve on its own.

This section focuses on demonstrating how to apply Model Optimizer to perform knowledge distillation with ease.

Section	Description	Link	Docs
Pre-Requisites	Required & optional packages to use this technique	[Link]
Getting Started	Learn how to optimize your models using distillation to produce more intellegant smaller models	[Link]	[docs]
Support Matrix	View the support matrix to see compatibility and feature availability across different models	[Link]
Distillation with Megatron-LM	Learn how to distill your models with Megatron-LM Framework	[Link]
Distillation with NeMo	Learn how to distill your models with NeMo Framework	[Link]	[docs]
Distillation with Huggingface	Learn how to distill your models with Hugging Face	[Link]	[docs]
Resources	Extra links to relevant resources	[Link]
NeMo Prune + Distill Simplified Flow	Example script demonstrating end-to-end pruning plus distillation in NeMo	[Link]

Pre-Requisites

Docker

For Hugging Face models, please use the PyTorch docker image (e.g., nvcr.io/nvidia/pytorch:25.06-py3). For NeMo models, use the NeMo container (e.g., nvcr.io/nvidia/nemo:25.09) which has all the dependencies installed. Visit our installation docs for more information.

Also follow the installation steps below to upgrade to the latest version of Model Optimizer and install example-specific dependencies.

Local Installation

For Hugging Face models, install Model Optimizer with hf dependencies using pip from PyPI and install the requirements for the example:

pip install -U nvidia-modelopt[hf]
pip install -r requirements.txt

Getting Started

Set up your base models

First obtain both a pretrained model to act as the teacher and a (usually smaller) model to serve as the student.

from transformers import AutoModelForCausalLM

# Define student & teacher
student_model = AutoModelForCausalLM.from_pretrained("student-model-id-or-path")
teacher_model = AutoModelForCausalLM.from_pretrained("teacher-model-id-or-path")

Set up the meta model

As Knowledge Distillation involves (at least) two models, ModelOpt simplifies the integration process by wrapping both student and teacher into one meta model.

Please see an example Distillation setup below. This example assumes the outputs of teacher_model and student_model are logits.

import modelopt.torch.distill as mtd

distillation_config = {
    "teacher_model": teacher_model,
    "criterion": mtd.LogitsDistillationLoss(),  # callable receiving student and teacher outputs, in order
    "loss_balancer": mtd.StaticLossBalancer(),  # combines multiple losses; omit if only one distillation loss used
}

distillation_model = mtd.convert(student_model, mode=[("kd_loss", distillation_config)])

The teacher_model can be either a nn.Module, a callable which returns an nn.Module, or a tuple of (model_cls, args, kwargs). The criterion is the distillation loss used between student and teacher tensors. The loss_balancer determines how the original and distillation losses are combined (if needed).

See Distillation for more info.

Distill during training

To Distill from teacher to student, simply use the meta model in the usual training loop, while also using the meta model’s .compute_kd_loss() method to compute the distillation loss, in addition to the original user loss.

An example of Distillation training is given below:

# Setup the data loaders. As example:
train_loader = get_train_loader()

# Define user loss function. As example:
loss_fn = get_user_loss_fn()

for input, labels in train_dataloader:
    distillation_model.zero_grad()
    # Forward through the wrapped models
    out = distillation_model(input)
    # Same loss as originally present
    loss = loss_fn(out, labels)
    # Combine distillation and user losses
    loss_total = distillation_model.compute_kd_loss(student_loss=loss)
    loss_total.backward()

Note

DataParallel may break ModelOpt’s Distillation feature. Note that HuggingFace Trainer uses DataParallel by default.

Export trained model

The model can easily be reverted to its original class for further use (i.e deployment) without any ModelOpt modifications attached.

model = mtd.export(distillation_model)

Support Matrix

Current out of the box components

Loss criterion:

mtd.LogitsDistillationLoss() - Standard KL-Divergence on output logits
mtd.MGDLoss() - Masked Generative Distillation loss for 2D convolutional outputs
mtd.MFTLoss() - KL-divergence loss with Minifinetuning threshold modification

Loss balancers:

mtd.StaticLossBalancer() - Combines original student loss and KD loss into a single weighted sum (without changing over time)

Supported Models

Note

The following are models that were confirmed to run with ModelOpt distillation, but it is absolutely not limited to these

Model	type	confirmed compatible
Nemotron	gpt	✅
Llama 3	llama	✅
Llama 4	llama	✅
Gemma 2	gemma	✅
Gemma 3	gemma	✅
Phi 3	phi	✅
Qwen 2	qwen2	✅
Qwen 3	qwen3	✅
Mamba	mamba	✅

Knowledge Distillation (KD) in NVIDIA Megatron-LM Framework

Checkout the Knowledge Distillation example in the Megatron-LM repository.

Knowledge Distillation (KD) in NVIDIA NeMo Framework

Checkout the stand-alone distillation script in the NeMo documentation.

You can also look at the NeMo tutorial notebooks here which showcase the usage of Minitron pruning followed by distillation for Qwen 3 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.

Knowledge Distillation (KD) for HuggingFace Models

In this e2e example we finetune Llama-3.2 models on the smol-smoltalk-Interaction-SFT dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.

We replace normal supervised finetuning (SFT) of a Llama-3.2-1B base model by distilling information from Llama-3.2-3B-Instruct which has already been instruction-finetuned.

Note

We can fit the following in memory using FSDP enabled on 8x RTX 6000 (total ~400GB VRAM)

accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
    main.py \
    --teacher_name_or_path 'meta-llama/Llama-3.2-3B-Instruct' \
    --student_name_or_path 'meta-llama/Llama-3.2-1B' \
    --output_dir ./llama3.2-distill \
    --max_length 2048 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 8 \
    --max_steps 200 \
    --logging_steps 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Knowledge Distillation

Pre-Requisites

Docker

Local Installation

Getting Started

Set up your base models

Set up the meta model

Distill during training

Export trained model

Support Matrix

Current out of the box components

Supported Models

Knowledge Distillation (KD) in NVIDIA Megatron-LM Framework

Knowledge Distillation (KD) in NVIDIA NeMo Framework

Knowledge Distillation (KD) for HuggingFace Models

Resources

FilesExpand file tree

llm_distill

Directory actions

More options

Directory actions

More options

Latest commit

History

llm_distill

Folders and files

parent directory

README.md

Knowledge Distillation

Pre-Requisites

Docker

Local Installation

Getting Started

Set up your base models

Set up the meta model

Distill during training

Export trained model

Support Matrix

Current out of the box components

Supported Models

Knowledge Distillation (KD) in NVIDIA Megatron-LM Framework

Knowledge Distillation (KD) in NVIDIA NeMo Framework

Knowledge Distillation (KD) for HuggingFace Models

Resources