[feat] LoRA #149

jlamypoirier · 2025-02-19T01:22:09Z

🎯 Goal (What & Why)

Add LoRA (Low-Rank Adaptation) support to Fast-LLM for flexible and memory-efficient fine-tuning.

Motivations:

Generic Low-Compute Fine=tuning: Enable standard LoRA use cases to reduce memory usage and improve fine-tuning accessibility.
Token-Switched LoRA (Phi-4): Support the architecture used in Phi-4-Multimodal's token-switched LoRAs for modular multimodal capabilities, see https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/phi_4_mm.tech_report.02252025.pdf
LoRA-Infused SSM-Transformer Hybrid Architecture (Zamba-2): Provide compatibility with Zamba-2's architecture to enhance model extensibility, see https://arxiv.org/abs/2411.15242.
LoRA MoEs: Integrate LoRA with Mixture-of-Experts (MoE) to support dynamic and efficient module switching, see @oleksost's paper https://arxiv.org/abs/2405.11157.
LoRA RegMix: Rather than using smaller models, use small compute.

🚀 Execution Plan

Step 1: What is the smallest working version?

Minimal Integration: Add optional LoRA layers to Wq and Wv of each transformer layer in Fast-LLM.
Configuration Design: Implement a minimal LoraConfig similar to PEFT's LoraConfig, focusing only on the essential parameters:
- r (int): Lora attention dimension (the "rank").
- lora_alpha (int): The alpha parameter for Lora scaling.
MVP Approach: Keep the implementation simple:
- LoRA layers are functionally always present, but they are lazily initialized with zeros (no-op) and remain inactive when their learning rate is set to 0.
- When exporting models to HF, store LoRA weights separately so that they can directly be used with PeftModel.from_pretrained, see https://huggingface.co/docs/peft/en/tutorial/peft_model_config#peft-models.

Step 2: What additional optimizations are possible (later, out-of-scope for now)?

Loading HF LoRA Models: Convert LoRA weights from HF hub to Fast-LLM LoRA weights.
Advanced Configurations: Introduce more advanced LoRA configurations from PEFT's LoreConfig, e.g. to define which weights get LoRA adapters.
Performance Optimization: Improve speed and memory efficiency. We shouldn't over-invest here, because LoRA is fast and memory-efficient by design already.
Support for Complex Architectures: Extend LoRA to support token-switching (Phi-4) and MoEs, supplementing Fast-LLM's existing MoE approach.

📌 Acceptance Criteria (Must-Haves for Completion)

LoRA layers must be functional and tested in Fast-LLM.
The implementation must include clear documentation explaining the minimal viable setup and configurations.
The PR must include a tutorial for LoRA based fine-tuning.
The PR must provide a performance/impact summary demonstrating memory savings and fine-tuning flexibility.
No refactors unless directly necessary for feature completion.

🛠️ Project Management

Assign the project to the Fast-LLM project.
Set the Estimate field (in days) in the GitHub project.
Use the Size field to categorize the PR size (Small/Medium/Large).
Assign an owner when opening the issue.

The text was updated successfully, but these errors were encountered:

tscholak · 2025-03-07T00:00:42Z

Hey @jlamypoirier, checking in on LoRA progress. This was assigned last Tuesday, but I haven't seen updates yet. LoRA is a blocker for multiple upcoming projects, so we need execution now.
Can you share an update on what's done and when you expect it to be completed? Thanks

jlamypoirier · 2025-03-07T02:48:36Z

I had to address outstanding bugs and maintenance, so could only start today. I am working on a prototype for linear layers, following https://pytorch.org/torchtune/0.3/tutorials/lora_finetune.html. With a bit of luck I'll have something this week. But:

There are a bunch of linear implementations, ex. for tensor-parallel and mlp, that need to be adapted for lora (even if we don't use tensor parallel, we still use those implementations). I'll ignore most optional features and performance optimizations for now, but it will still take some work.
Implementing LoRA won't be enough, because we create gradient and optimizer buffers for all the weights no matter what. I'll need to find a way to exclude the full weights from the buffers, and this means working on the core of fast-llm which won't be easy. I'll have to think about it more before I can provide an estimate, but it could easily take as long as implementing LoRA itself.

jlamypoirier added the enhancement New feature or request label Feb 19, 2025

tscholak assigned jlamypoirier Mar 4, 2025

tscholak mentioned this issue Mar 10, 2025

[Prototype] LoRA #180

Draft

20 tasks

jlamypoirier mentioned this issue Mar 11, 2025

Minimal LoRA implementation #182

Merged

17 tasks

jlamypoirier linked a pull request Mar 11, 2025 that will close this issue

Minimal LoRA implementation #182

Merged

17 tasks

jlamypoirier mentioned this issue Mar 11, 2025

Support frozen weights #183

Closed

4 tasks

jlamypoirier mentioned this issue Mar 25, 2025

Add converter for Hugging Face peft (LoRA) #204

Open

4 tasks

jlamypoirier closed this as completed in #182 Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] LoRA #149

[feat] LoRA #149

jlamypoirier commented Feb 19, 2025 •

edited by tscholak

Loading

tscholak commented Mar 7, 2025

jlamypoirier commented Mar 7, 2025

[feat] LoRA #149

[feat] LoRA #149

Comments

jlamypoirier commented Feb 19, 2025 • edited by tscholak Loading

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: What additional optimizations are possible (later, out-of-scope for now)?

📌 Acceptance Criteria (Must-Haves for Completion)

🛠️ Project Management

tscholak commented Mar 7, 2025

jlamypoirier commented Mar 7, 2025

jlamypoirier commented Feb 19, 2025 •

edited by tscholak

Loading