End-to-end knowledge distillation #214

tscholak · 2025-03-30T21:37:07Z

🎯 Goal (What & Why)

Implement end-to-end knowledge distillation support, enabling training of a student model to mimic the outputs (logits) of a frozen teacher model.

🚀 Execution Plan

Step 1: What is the smallest working version?

Use existing support for loading a frozen reference (teacher) model from Fast-LLM Issue #212.
Add a knowledge distillation loss component (cross-entropy or KL-divergence) that compares logits from the teacher and student models.
During training:
- Compute logits from both the student and the frozen reference model.
- Calculate and minimize the distillation loss.
- Ensure the teacher model remains frozen and gradient-free.

Step 2: What additional optimizations are possible (but optional)?

tbd

📌 Acceptance Criteria (Must-Haves for Completion)

The distillation loss is integrated and functional.
Teacher model logits can be seamlessly accessed during student training.
Tested end-to-end distillation workflow is documented clearly.
Benchmark demonstrating accuracy retention after distillation.

🛠️ Project Management

Assign the project to the Fast-LLM project.
Set the Estimate field (in days) in the GitHub project.
Use the Size field to categorize the PR size (Small/Medium/Large).
Assign an owner when opening the issue.

The text was updated successfully, but these errors were encountered:

nitsanluke · 2025-04-01T21:36:39Z

Additional pointers from NeMo distillation implementation.

KL Loss
The above megatron_gpt_distillation.py makes use of modelopt.torch.distil model for the trainer loop (not essential for us at this point)

tscholak added enhancement New feature or request need update labels Mar 30, 2025

tscholak assigned nitsanluke Mar 31, 2025

tscholak mentioned this issue Mar 31, 2025

Frozen reference model support for DPO, distillation, etc. #212

Closed

4 tasks

tscholak mentioned this issue Apr 6, 2025

Reference model support for distillation,. etc. #216

Merged

8 tasks

tscholak assigned jlamypoirier and unassigned nitsanluke Apr 9, 2025

This was referenced Apr 10, 2025

[bug] Logit scaling incorrect for triton and torch cross-entropy #228

Closed

Knowledge distillation, fix and improve cross-entropy #229

Merged

jlamypoirier closed this as completed in #229 Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end knowledge distillation #214

End-to-end knowledge distillation #214

tscholak commented Mar 30, 2025

nitsanluke commented Apr 1, 2025 •

edited

Loading

End-to-end knowledge distillation #214

End-to-end knowledge distillation #214

Comments

tscholak commented Mar 30, 2025

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: What additional optimizations are possible (but optional)?

📌 Acceptance Criteria (Must-Haves for Completion)

🛠️ Project Management

nitsanluke commented Apr 1, 2025 • edited Loading

nitsanluke commented Apr 1, 2025 •

edited

Loading