Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions program-data-separation/config/qwen3_06b_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"dim": 1024,
"ffn_dim_multiplier": 1,
"hidden_dim": 3072,
"n_heads": 16,
"head_dim": 128,
"n_kv_heads": 8,
"n_layers": 28,
"norm_eps": 1e-06,
"rope_theta": 1000000.0,
"use_scaled_rope": false,
"vocab_size": 151936,
"use_hf_rope": true,
"attention_qkv_bias": false,
"use_qk_norm": true,
"qk_norm_before_rope": true
}
18 changes: 18 additions & 0 deletions program-data-separation/config/qwen3_xnnpack.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
base:
model_class: "qwen3_0_6b"
params: "config/qwen3_06b_config.json"
metadata: '{"get_bos_id": 151644, "get_eos_ids":[151645]}'

model:
use_kv_cache: True
use_sdpa_with_kv_cache: True
dtype_override: fp32

export:
max_seq_length: 2048
max_context_length: 2048

backend:
xnnpack:
enabled: True
extended_ops: True
253 changes: 106 additions & 147 deletions program-data-separation/cpp/lora_example/README.md
Original file line number Diff line number Diff line change
@@ -1,137 +1,116 @@
# ExecuTorch LoRA Demo

This directory contains the C++ code for the LoRA demo.
This directory contains a C++ example demonstrating program-data separation with LoRA adapters in ExecuTorch.

You'll learn how to:
1. Export LoRA PTE files that share a single foundation weight file.
2. Load and run multiple LoRA PTE files at the same, and notice that the runtime memory increases by the LoRA adapter size (small) and not the foundation weight size (large), because the foundation weights are shared.
2. Load and run multiple LoRA PTE files at the same time, where runtime memory increases only by the LoRA adapter size (small) rather than the foundation weight size (large), because the foundation weights are shared.

This example uses **Qwen3-0.6B** with a math-focused LoRA adapter, but the approach works for other models as well.

Note:
- Weight-sharing is supported with the XNNPACK backend.
- Quantization (outside of embedding quantization) is currently not supported when weight-sharing.
- There are many ways to fine-tune LoRA adapters. We will go through a few examples to create a demo.
- There are many ways to fine-tune LoRA adapters. This example uses a pre-trained adapter from HuggingFace.

## Table of Contents
- [Size savings](#size-savings)
- [Finetune lora adapters from scratch with unsloth and Llama](#finetune-from-scratch-with-unsloth-and-llama)
- [Install executorch](#install-executorch)
- [Export lora models](#export-models)
- [Run lora models](#install-runtime-dependencies)
- [Demo video](#demo-video)
- [LoRA adapters](#lora-adapters)
- [Install ExecuTorch](#install-executorch)
- [Export models](#export-models)
- [Install runtime dependencies](#install-runtime-dependencies)
- [Build the runtime](#build-the-runtime)
- [Run the executable](#run-the-executable)

## Size savings

Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below.
Size savings will vary depending on the model and LoRA configuration. By storing foundation weights in a separate, sharable PTD file, you can:
- Save disk space by avoiding duplicate weights across multiple LoRA models
- Save runtime memory by sharing weights through the XNNPACK weight cache

### XNNPACK weight sharing

The XNNPACK backend is a singleton. Weight sharing is implemented via the XNNPACK weight cache. At delegate init time, XNNPACK checks the weight cache for the weights it needs. If they don't exist, XNNPACK will fetch weights from the NamedDataMap (the API that exposes weights in a PTD file), pack them, store them in the weight cache and free the original. This means we won't keep around multiple copies of the same weights.
The XNNPACK backend implements weight sharing via its weight cache. At delegate init time, XNNPACK checks the weight cache for required weights. If they don't exist, XNNPACK fetches weights from the NamedDataMap (the API that exposes weights in a PTD file), packs them, stores them in the weight cache, and frees the original. This prevents keeping multiple copies of the same weights in memory.

## Finetune from scratch with Unsloth and Llama
[Unsloth](https://unsloth.ai/) provides a [colab notebook](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/datasets-guide#synthetic-dataset-notebook) that showcases how to generate data using the Meta Synthetic Data Kit, and then fine-tune it to create a LoRA adapter.
## LoRA adapters

For this demo, we trained on two datasets:
1. executorch/docs/source/: an adapter with domain knowledge of executorch. This used Meta Synthetic Data Kit to generate qa pairs based on the documentation.
2. Recent Nobel prize winners (2024-2025): an adapter with knowledge beyond the cutoff date of Llama-3-2-1B. This data was taken from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Nobel_laureates), and formatted into the chat template for training.
This example uses a pre-trained LoRA adapter from HuggingFace:
- **Base model**: [unsloth/Qwen3-0.6B](https://huggingface.co/unsloth/Qwen3-0.6B)
- **LoRA adapter**: [lucylq/qwen3_06B_lora_math](https://huggingface.co/lucylq/qwen3_06B_lora_math) (math-focused adapter)

The training notebook takes a few shortcuts to reduce the latency/compute. You can change these settings for better results.
1. When generating data, play around with the chunk sizes and overlap to see what works best for your dataset.
2. At the training step, the notebook uses max_steps=60 to speed things up. Setting num_train_epochs=1 (or greater) for a full run and max_steps=None has better results.
If you want to train your own LoRA adapters, [Unsloth](https://unsloth.ai/) provides tools and notebooks for fine-tuning. The key files needed from training are:
- `adapter_config.json`
- `adapter_model.safetensors`

Unsloth will output the adapter artifacts to the specified directory (in the colab notebook, 'lora_model/'). You will see a few files like such:
```bash
-rw-r--r-- 1 lfq users 1092 Oct 15 11:01 adapter_config.json
-rw-r--r-- 1 lfq users 45118424 Oct 15 11:01 adapter_model.safetensors
-rw-r--r-- 1 lfq users 3827 Oct 15 11:01 chat_template.jinja
-rw-r--r-- 1 lfq users 5268 Oct 15 11:01 README.md
-rw-r--r-- 1 lfq users 454 Oct 15 11:01 special_tokens_map.json
-rw-r--r-- 1 lfq users 50642 Oct 15 11:01 tokenizer_config.json
-rw-r--r-- 1 lfq users 17209920 Oct 15 11:01 tokenizer.json
```
## Install ExecuTorch

The files we want are:
- adapter_config.json
- adapter_model.safetensors
[Install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source):

## Install executorch
[Install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source).

```
# Move to the executorch subdirectory
```bash
# Navigate to the executorch submodule
cd ~/executorch-examples/program-data-separation/cpp/executorch

# Update to recent main.
git pull origin main

# Update submodules
git submodule sync
git submodule update --init --recursive

# Install ExecuTorch pip package.
# Install ExecuTorch pip package
./install_executorch.sh --editable
```

You can also install from a recent nightly build.
```
Alternatively, install from a recent nightly build:
```bash
pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu
```

Use main or a recent nightly, as some features are not available in executorch==1.0.0.
Use main or a recent nightly, as some features may not be available in older releases.

## Export models

1. Download the base model. We're using https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct.
```
pip install huggingface_hub
The easiest way to export the models is to use the provided script:

# As this is a gated model, login.
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct --local-dir ./Llama-3.2-1B-Instruct
```
```bash
cd ~/executorch-examples/program-data-separation

2. Set your paths and the model name.
```
DOWNLOADED_PATH=Llama-3.2-1B-Instruct
ADAPTER_PATH=lora_model
MODEL_NAME=<model_name>
```
# Export non-quantized models
bash export_lora.sh

3. Export command. Run this with different MODEL_NAMEs for each adapter.
```
python -m executorch.extension.llm.export.export_llm \
base.checkpoint="${DOWNLOADED_PATH}/original/consolidated.00.pth" \
base.params="${DOWNLOADED_PATH}/original/params.json" \
base.tokenizer_path="${DOWNLOADED_PATH}/original/tokenizer.model" \
base.adapter_checkpoint="${ADAPTER_PATH}/adapter_model.safetensors" \
base.adapter_config="${ADAPTER_PATH}/adapter_config.json" \
model.use_kv_cache=true \
model.use_sdpa_with_kv_cache=true \
model.dtype_override="fp32" \
backend.xnnpack.enabled=true \
backend.xnnpack.extended_ops=true \
export.output_name="${MODEL_NAME}.pte" \
export.foundation_weights_file="foundation.ptd"
# Or export quantized models (8da4w quantization)
bash export_lora.sh -q
```

Expect to see two files: '<model_name>.pte' and 'foundation.ptd'. Run the command again to generate more adapter PTE files. You only need to keep one `foundation.ptd` file.
This script will:
1. Download the Qwen3-0.6B base model and LoRA adapter from HuggingFace
2. Export a non-LoRA model with program-data separation
3. Export a LoRA model with program-data separation

You can also run `~/executorch-examples/program-data-separation/export_lora.sh`. This will export the dummy lora model and the base Llama-3-2-1B model PTE files.
After running, you'll see files in the `models/` directory:
```bash
models/
-rw-r--r-- 1 lfq users 39M Dec 15 16:55 qwen3_06B_lora.ptd # LoRA adapter weights
-rw-r--r-- 1 lfq users 792K Dec 15 16:55 qwen3_06B_lora.pte # LoRA adapter program
-rw-r--r-- 1 lfq users 2.3G Dec 15 16:55 qwen3_06B.ptd # Base model weights (foundation)
-rw-r--r-- 1 lfq users 561K Dec 15 16:55 qwen3_06B.pte # Base model program
```

Example files, trained on executorch/docs/source/ and recent Nobel prize winners.
For quantized versions, expect file sizes like:
```bash
-rw-r--r-- 1 lfq users 45555712 Oct 17 18:05 executorch_lora.pte # executorch docs lora model.
-rw-r--r-- 1 lfq users 5994013600 Oct 17 18:05 foundation.ptd # foundation weight file
-rw-r--r-- 1 lfq users 27628928 Oct 17 14:31 llama_3_2_1B_lora.pte # dummy lora model.
-rw-r--r-- 1 lfq users 45555712 Oct 17 18:00 nobel_lora.pte # Nobel prize winners lora model.
models/
-rw-r--r-- 1 lfq users 39M Dec 16 09:57 qwen3_06B_lora_q.ptd
-rw-r--r-- 1 lfq users 855K Dec 16 09:57 qwen3_06B_lora_q.pte
-rw-r--r-- 1 lfq users 918M Dec 16 09:57 qwen3_06B_q.ptd
-rw-r--r-- 1 lfq users 621K Dec 15 19:14 qwen3_06B_q.pte
```

Notice the adapter PTE files are about the same size as the `adapter_model.safetensors`/`adapter_model.pt` files generated during training. The PTE contains the adapter weights (which are not shared) and the program.
The LoRA PTE file contains only the program and adapter weights, while the foundation weights are stored separately and can be shared across multiple LoRA models.

## Install runtime dependencies
The ExecuTorch repository is configured as a git submodule at `~/executorch-examples/program-data-separation/cpp/executorch`. To initialize it:

The ExecuTorch repository is configured as a git submodule at `~/executorch-examples/program-data-separation/cpp/executorch`. To initialize it:

```bash
cd ~/executorch-examples/

# Update to the remote main branch.
# Update submodules
git submodule update --remote program-data-separation/cpp/executorch
git submodule sync
git submodule update --init --recursive
Expand All @@ -144,96 +123,76 @@ pip install -r requirements-dev.txt
```

## Build the runtime
Install some dependencies:
```bash
cd ~/executorch-examples/program-data-separation/cpp/executorch
sh examples/models/llama/install_requirements.sh
```

Build the executable:
```bash
cd ~/executorch-examples/program-data-separation/cpp/lora_example
sh build_example.sh
bash build_example.sh
```

## Run the executable
```bash
cd ~/executorch-examples/program-data-separation/cpp/lora_example

DOWNLOADED_PATH=~/path/to/Llama-3.2-1B-Instruct/
./build/bin/executorch_program_data_separation \
--tokenizer_path="${DOWNLOADED_PATH}" \
--model1="executorch_lora.pte" \
--model2="nobel_lora.pte" \
--weights="foundation.ptd" \
--prompt="Who were the winners of the Nobel Prize in Physics in 2025?" \
--apply_chat_template
```
Passing in the `DOWNLOADED_PATH` as the tokenizer directory will invoke the HFTokenizer, and parse additional tokenizers files: `tokenizer_config.json` and `special_tokens_map.json`. `special_tokens_map.json` tells us which bos/eos token to use, especially if there are multiple.

`apply_chat_template` formats the prompt according to the LLAMA chat template.

Sample output:
```
I 00:00:00.538779 executorch:main.cpp:133] Generating with model et.pte..
...
I 00:00:06.999737 executorch:text_llm_runner.cpp:182] RSS after prompt prefill: 6941.296875 MiB (0 if unsupported)
I don't have information on the winners of the Nobel Prize in Physics in 2025.<|eot_id|>
...
I 00:00:11.635379 executorch:main.cpp:141] Generating with model nobel.pte...
...
I 00:00:14.109447 executorch:text_llm_runner.cpp:182] RSS after prompt prefill: 8041.632812 MiB (0 if unsupported)
John Clarke, Michel H. Devoret, John M. Martinis<|eot_id|>
First, get the path to the downloaded Qwen model (this was downloaded during export):
```bash
TOKENIZER_PATH=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('unsloth/Qwen3-0.6B'))")
```
We can see that the ExecuTorch-trained adapter model does not have knowledge of the recent Nobel Prize winners, as neither the base model or adapter was trained on it. Meanwhile, the Nobel-prize adapter model can answer well.

There is about ~1.1GB memory increase between running the two models.
Most of that (about ~1GB) comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command.
~50MB comes from the adapter model, which is not shared.

Let's try with an executorch-specific prompt.
Run the example (remove the _q suffix for non-quantized):
```bash
cd ~/executorch-examples/program-data-separation/cpp/lora_example

DOWNLOADED_PATH=~/path/to/Llama-3.2-1B-Instruct/
./build/bin/executorch_program_data_separation \
--tokenizer_path="${DOWNLOADED_PATH}" \
--model1="executorch_lora.pte" \
--model2="nobel_lora.pte" \
--weights="foundation.ptd" \
--prompt="Help me get started with ExecuTorch in 3 steps" \
--tokenizer_path="${TOKENIZER_PATH}" \
--model1="../../models/qwen3_06B_lora_q.pte" \
--weights1="../../models/qwen3_06B_q.ptd,../../models/qwen3_06B_lora_q.ptd" \
--model2="../../models/qwen3_06B_q.pte" \
--weights2="../../models/qwen3_06B_q.ptd" \
--prompt="Calculate 15% of 80" \
--apply_chat_template
```

Sample output:
```
...
I 00:00:00.554048 executorch:main.cpp:133] Generating with model et.pte...
...
Here are 3 steps to get started with ExecuTorch:

Step 1: Install ExecuTorch dependencies. This includes installing Python 3.8+ library, PyTorch library, and the ExecuTorch runtime.

Step 2: Set up a Python environment with pip and a virtual environment (e.g., conda) to isolate ExecuTorch dependencies.
The `--tokenizer_path` should point to the directory containing the tokenizer files. The `--apply_chat_template` flag formats the prompt using the Qwen chat template.

Step 3: Clone the Execu
I 00:00:27.243400 executorch:text_llm_runner.cpp:206] RSS after finishing text generation: 6940.410156 MiB (0 if unsupported)
...
I 00:00:27.243504 executorch:main.cpp:141] Generating with model nobel.pte...
...
Here are the 3 steps to get started with Excetorch:
This example runs two models:
1. **model1** (LoRA): The math-focused LoRA adapter, which should give accurate math answers
2. **model2** (base): The base Qwen model without LoRA

**Step 1: Install Node.js and npm**
The foundation weights are shared between both models via the XNNPACK weight cache, reducing memory usage.

Excetorch is a JavaScript compiler, so you'll need Node.js and npm (the Node Package Manager) installed on your computer. You can download Node.js from the official website and npm from the npm website. Follow the installation instructions for your operating system.
Sample output, lora model:
```
I 00:00:01.149525 executorch:main.cpp:145] Generating with model ../../models/qwen3_06B_lora_q.pte...
ToI 00:00:03.107697 executorch:text_llm_runner.cpp:188] RSS after prompt prefill: 1913.269531 MiB (0 if unsupported)
calculate 15% of 80, we can multiply 80 by 15/100.
So, 15% of 80 is equal to (80 * 15) / 100 = 1200 / 100 = 12.
#### 12
The answer is: 12<|im_end|>
I 00:00:33.889797 executorch:text_token_generator.h:130]
Reached to the end of generation
```
In memory, we have:
- lora model (qwen3_06B_lora_q.pte)
- lora weights (qwen3_06B_lora_q.ptd)
- base weights (qwen3_06B_q.ptd)

**Step 2: Install Excetorch**
Sample output, base model:
```
I 00:00:33.889921 executorch:main.cpp:153] Generating with model ../../models/qwen3_06B_q.pte...
<think>I 00:00:34.847727 executorch:text_llm_runner.cpp:188] RSS after prompt prefill: 3122.109375 MiB (0 if unsupported)

Okay, so I need to calculate 15% of 80. Let me think about how to approach this. Hmm, percentages can sometimes be tricky because they can be converted to decimals or fractions. Let me recall the formula for percentage: percentage equals (number × percentage rate) / 100. So in this case, 15% of 80. Let me write that down: 15% of 80.

I 00:00:50.189743 executorch:text_llm_runner.cpp:206] RSS after finishing text generation: 8039.152344 MiB (0 if unsupported)
First, maybe I can convert 15% to a decimal. Since
I 00:01:01.815193 executorch:text_llm_runner.cpp:214] RSS after finishing text generation: 3122.109375 MiB (0 if unsupported)
```
We can see the base model is less capable than the lora model at mathematics.

The ExecuTorch-trained adapter model has domain knowledge of ExecuTorch codebase, whereas the Nobel-prize trained adapter model does not.
In memory, we have a base model in addition to the items above.
- base model (qwen3_06B.pte)

There is ~1GB memory increase despite the base model only being 621K. This is due to embeddings that aren't lowered to XNNPACK, and are duplicated between the lora and base models. This can be reduced by quantizing the embeddings as well.

## Demo video
This demo video contains two lora adapters, with one trained on ExecuTorch documentation and another on recent Nobel Prize Winners, using Llama3 1B as the base model.

https://github.com/user-attachments/assets/34f5488d-c1e3-4613-953f-f53745c9b01e
Loading