-
Notifications
You must be signed in to change notification settings - Fork 30
Frozen reference model support for DPO, distillation, etc. #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Plan looks reasonable, implementation should be quite straightforward. Some comments:
We can add a
Just need to setup with forward-only support, it will take care of everything.
We can check the vocab size, but the model itself has no knowledge of the tokenizer so we can't easily check that the reference model was trained with the same tokenizer.
We can add a
Also pipeline parallelism support, distillation example. |
After a deeper look, the challenges seem to be the following:
|
🎯 Goal (What & Why)
Add support for loading and executing a (potentially very large) frozen reference model in parallel with the main trainable model.
This is a prerequisite for implementing Direct Preference Optimization (DPO) #209 with normalization and for distillation-style training #214. The reference model must be completely frozen (no gradients, no optimizer), support parallelism (data, tensor, etc.), and be configurable independently from the primary model.
🚀 Execution Plan
Step 1: What is the smallest working version?
reference_model
config block:torch.no_grad()
modeget_ref_logits(batch)
)Step 2: What additional optimizations are possible (but optional)?
📌 Acceptance Criteria
🛠️ Project Management
Estimate
field (in days) in the GitHub project.Size
field to categorize the PR size (Large).The text was updated successfully, but these errors were encountered: