Skip to content

End-to-end knowledge distillation #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 4 tasks
tscholak opened this issue Mar 30, 2025 · 1 comment · Fixed by #229
Closed
2 of 4 tasks

End-to-end knowledge distillation #214

tscholak opened this issue Mar 30, 2025 · 1 comment · Fixed by #229
Assignees
Labels
enhancement New feature or request need update

Comments

@tscholak
Copy link
Collaborator

🎯 Goal (What & Why)

Implement end-to-end knowledge distillation support, enabling training of a student model to mimic the outputs (logits) of a frozen teacher model.

🚀 Execution Plan

Step 1: What is the smallest working version?

  • Use existing support for loading a frozen reference (teacher) model from Fast-LLM Issue #212.
  • Add a knowledge distillation loss component (cross-entropy or KL-divergence) that compares logits from the teacher and student models.
  • During training:
    • Compute logits from both the student and the frozen reference model.
    • Calculate and minimize the distillation loss.
    • Ensure the teacher model remains frozen and gradient-free.

Step 2: What additional optimizations are possible (but optional)?

  • tbd

📌 Acceptance Criteria (Must-Haves for Completion)

  • The distillation loss is integrated and functional.
  • Teacher model logits can be seamlessly accessed during student training.
  • Tested end-to-end distillation workflow is documented clearly.
  • Benchmark demonstrating accuracy retention after distillation.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Small/Medium/Large).
  • Assign an owner when opening the issue.
@nitsanluke
Copy link
Contributor

nitsanluke commented Apr 1, 2025

Additional pointers from NeMo distillation implementation.

KL Loss
The above megatron_gpt_distillation.py makes use of modelopt.torch.distil model for the trainer loop (not essential for us at this point)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request need update
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants