diff --git a/program-data-separation/config/qwen3_06b_config.json b/program-data-separation/config/qwen3_06b_config.json new file mode 100644 index 00000000..0170d1b8 --- /dev/null +++ b/program-data-separation/config/qwen3_06b_config.json @@ -0,0 +1,17 @@ +{ + "dim": 1024, + "ffn_dim_multiplier": 1, + "hidden_dim": 3072, + "n_heads": 16, + "head_dim": 128, + "n_kv_heads": 8, + "n_layers": 28, + "norm_eps": 1e-06, + "rope_theta": 1000000.0, + "use_scaled_rope": false, + "vocab_size": 151936, + "use_hf_rope": true, + "attention_qkv_bias": false, + "use_qk_norm": true, + "qk_norm_before_rope": true +} diff --git a/program-data-separation/config/qwen3_xnnpack.yaml b/program-data-separation/config/qwen3_xnnpack.yaml new file mode 100644 index 00000000..bc617c82 --- /dev/null +++ b/program-data-separation/config/qwen3_xnnpack.yaml @@ -0,0 +1,18 @@ +base: + model_class: "qwen3_0_6b" + params: "config/qwen3_06b_config.json" + metadata: '{"get_bos_id": 151644, "get_eos_ids":[151645]}' + +model: + use_kv_cache: True + use_sdpa_with_kv_cache: True + dtype_override: fp32 + +export: + max_seq_length: 2048 + max_context_length: 2048 + +backend: + xnnpack: + enabled: True + extended_ops: True diff --git a/program-data-separation/cpp/lora_example/README.md b/program-data-separation/cpp/lora_example/README.md index dd00a938..796fc320 100644 --- a/program-data-separation/cpp/lora_example/README.md +++ b/program-data-separation/cpp/lora_example/README.md @@ -1,137 +1,116 @@ # ExecuTorch LoRA Demo -This directory contains the C++ code for the LoRA demo. +This directory contains a C++ example demonstrating program-data separation with LoRA adapters in ExecuTorch. You'll learn how to: 1. Export LoRA PTE files that share a single foundation weight file. -2. Load and run multiple LoRA PTE files at the same, and notice that the runtime memory increases by the LoRA adapter size (small) and not the foundation weight size (large), because the foundation weights are shared. +2. Load and run multiple LoRA PTE files at the same time, where runtime memory increases only by the LoRA adapter size (small) rather than the foundation weight size (large), because the foundation weights are shared. + +This example uses **Qwen3-0.6B** with a math-focused LoRA adapter, but the approach works for other models as well. Note: - Weight-sharing is supported with the XNNPACK backend. -- Quantization (outside of embedding quantization) is currently not supported when weight-sharing. -- There are many ways to fine-tune LoRA adapters. We will go through a few examples to create a demo. +- There are many ways to fine-tune LoRA adapters. This example uses a pre-trained adapter from HuggingFace. ## Table of Contents - [Size savings](#size-savings) -- [Finetune lora adapters from scratch with unsloth and Llama](#finetune-from-scratch-with-unsloth-and-llama) -- [Install executorch](#install-executorch) -- [Export lora models](#export-models) -- [Run lora models](#install-runtime-dependencies) -- [Demo video](#demo-video) +- [LoRA adapters](#lora-adapters) +- [Install ExecuTorch](#install-executorch) +- [Export models](#export-models) +- [Install runtime dependencies](#install-runtime-dependencies) +- [Build the runtime](#build-the-runtime) +- [Run the executable](#run-the-executable) ## Size savings -Size results will vary depending on the model and LoRA config. For this demo, we save ~5GB of disk space by storing weights in a separate, sharable file and ~5GB runtime memory by sharing weights at runtime through the XNNPACK weight cache. Detailed results are below. +Size savings will vary depending on the model and LoRA configuration. By storing foundation weights in a separate, sharable PTD file, you can: +- Save disk space by avoiding duplicate weights across multiple LoRA models +- Save runtime memory by sharing weights through the XNNPACK weight cache ### XNNPACK weight sharing -The XNNPACK backend is a singleton. Weight sharing is implemented via the XNNPACK weight cache. At delegate init time, XNNPACK checks the weight cache for the weights it needs. If they don't exist, XNNPACK will fetch weights from the NamedDataMap (the API that exposes weights in a PTD file), pack them, store them in the weight cache and free the original. This means we won't keep around multiple copies of the same weights. +The XNNPACK backend implements weight sharing via its weight cache. At delegate init time, XNNPACK checks the weight cache for required weights. If they don't exist, XNNPACK fetches weights from the NamedDataMap (the API that exposes weights in a PTD file), packs them, stores them in the weight cache, and frees the original. This prevents keeping multiple copies of the same weights in memory. -## Finetune from scratch with Unsloth and Llama -[Unsloth](https://unsloth.ai/) provides a [colab notebook](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/datasets-guide#synthetic-dataset-notebook) that showcases how to generate data using the Meta Synthetic Data Kit, and then fine-tune it to create a LoRA adapter. +## LoRA adapters -For this demo, we trained on two datasets: -1. executorch/docs/source/: an adapter with domain knowledge of executorch. This used Meta Synthetic Data Kit to generate qa pairs based on the documentation. -2. Recent Nobel prize winners (2024-2025): an adapter with knowledge beyond the cutoff date of Llama-3-2-1B. This data was taken from [Wikipedia](https://en.wikipedia.org/wiki/List_of_Nobel_laureates), and formatted into the chat template for training. +This example uses a pre-trained LoRA adapter from HuggingFace: +- **Base model**: [unsloth/Qwen3-0.6B](https://huggingface.co/unsloth/Qwen3-0.6B) +- **LoRA adapter**: [lucylq/qwen3_06B_lora_math](https://huggingface.co/lucylq/qwen3_06B_lora_math) (math-focused adapter) -The training notebook takes a few shortcuts to reduce the latency/compute. You can change these settings for better results. -1. When generating data, play around with the chunk sizes and overlap to see what works best for your dataset. -2. At the training step, the notebook uses max_steps=60 to speed things up. Setting num_train_epochs=1 (or greater) for a full run and max_steps=None has better results. +If you want to train your own LoRA adapters, [Unsloth](https://unsloth.ai/) provides tools and notebooks for fine-tuning. The key files needed from training are: +- `adapter_config.json` +- `adapter_model.safetensors` -Unsloth will output the adapter artifacts to the specified directory (in the colab notebook, 'lora_model/'). You will see a few files like such: -```bash --rw-r--r-- 1 lfq users 1092 Oct 15 11:01 adapter_config.json --rw-r--r-- 1 lfq users 45118424 Oct 15 11:01 adapter_model.safetensors --rw-r--r-- 1 lfq users 3827 Oct 15 11:01 chat_template.jinja --rw-r--r-- 1 lfq users 5268 Oct 15 11:01 README.md --rw-r--r-- 1 lfq users 454 Oct 15 11:01 special_tokens_map.json --rw-r--r-- 1 lfq users 50642 Oct 15 11:01 tokenizer_config.json --rw-r--r-- 1 lfq users 17209920 Oct 15 11:01 tokenizer.json -``` +## Install ExecuTorch -The files we want are: -- adapter_config.json -- adapter_model.safetensors +[Install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source): -## Install executorch -[Install from source](https://docs.pytorch.org/executorch/stable/using-executorch-building-from-source.html#install-executorch-pip-package-from-source). - -``` -# Move to the executorch subdirectory +```bash +# Navigate to the executorch submodule cd ~/executorch-examples/program-data-separation/cpp/executorch -# Update to recent main. -git pull origin main - +# Update submodules git submodule sync git submodule update --init --recursive -# Install ExecuTorch pip package. +# Install ExecuTorch pip package ./install_executorch.sh --editable ``` -You can also install from a recent nightly build. -``` +Alternatively, install from a recent nightly build: +```bash pip install executorch==1.1.0.devYYYYMMDD --extra-index-url https://download.pytorch.org/whl/nightly/cpu ``` -Use main or a recent nightly, as some features are not available in executorch==1.0.0. +Use main or a recent nightly, as some features may not be available in older releases. ## Export models -1. Download the base model. We're using https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct. -``` -pip install huggingface_hub +The easiest way to export the models is to use the provided script: -# As this is a gated model, login. -huggingface-cli login -huggingface-cli download meta-llama/Llama-3.2-1B-Instruct --local-dir ./Llama-3.2-1B-Instruct -``` +```bash +cd ~/executorch-examples/program-data-separation -2. Set your paths and the model name. -``` -DOWNLOADED_PATH=Llama-3.2-1B-Instruct -ADAPTER_PATH=lora_model -MODEL_NAME= -``` +# Export non-quantized models +bash export_lora.sh -3. Export command. Run this with different MODEL_NAMEs for each adapter. -``` -python -m executorch.extension.llm.export.export_llm \ - base.checkpoint="${DOWNLOADED_PATH}/original/consolidated.00.pth" \ - base.params="${DOWNLOADED_PATH}/original/params.json" \ - base.tokenizer_path="${DOWNLOADED_PATH}/original/tokenizer.model" \ - base.adapter_checkpoint="${ADAPTER_PATH}/adapter_model.safetensors" \ - base.adapter_config="${ADAPTER_PATH}/adapter_config.json" \ - model.use_kv_cache=true \ - model.use_sdpa_with_kv_cache=true \ - model.dtype_override="fp32" \ - backend.xnnpack.enabled=true \ - backend.xnnpack.extended_ops=true \ - export.output_name="${MODEL_NAME}.pte" \ - export.foundation_weights_file="foundation.ptd" +# Or export quantized models (8da4w quantization) +bash export_lora.sh -q ``` -Expect to see two files: '.pte' and 'foundation.ptd'. Run the command again to generate more adapter PTE files. You only need to keep one `foundation.ptd` file. +This script will: +1. Download the Qwen3-0.6B base model and LoRA adapter from HuggingFace +2. Export a non-LoRA model with program-data separation +3. Export a LoRA model with program-data separation -You can also run `~/executorch-examples/program-data-separation/export_lora.sh`. This will export the dummy lora model and the base Llama-3-2-1B model PTE files. +After running, you'll see files in the `models/` directory: +```bash +models/ +-rw-r--r-- 1 lfq users 39M Dec 15 16:55 qwen3_06B_lora.ptd # LoRA adapter weights +-rw-r--r-- 1 lfq users 792K Dec 15 16:55 qwen3_06B_lora.pte # LoRA adapter program +-rw-r--r-- 1 lfq users 2.3G Dec 15 16:55 qwen3_06B.ptd # Base model weights (foundation) +-rw-r--r-- 1 lfq users 561K Dec 15 16:55 qwen3_06B.pte # Base model program +``` -Example files, trained on executorch/docs/source/ and recent Nobel prize winners. +For quantized versions, expect file sizes like: ```bash --rw-r--r-- 1 lfq users 45555712 Oct 17 18:05 executorch_lora.pte # executorch docs lora model. --rw-r--r-- 1 lfq users 5994013600 Oct 17 18:05 foundation.ptd # foundation weight file --rw-r--r-- 1 lfq users 27628928 Oct 17 14:31 llama_3_2_1B_lora.pte # dummy lora model. --rw-r--r-- 1 lfq users 45555712 Oct 17 18:00 nobel_lora.pte # Nobel prize winners lora model. +models/ +-rw-r--r-- 1 lfq users 39M Dec 16 09:57 qwen3_06B_lora_q.ptd +-rw-r--r-- 1 lfq users 855K Dec 16 09:57 qwen3_06B_lora_q.pte +-rw-r--r-- 1 lfq users 918M Dec 16 09:57 qwen3_06B_q.ptd +-rw-r--r-- 1 lfq users 621K Dec 15 19:14 qwen3_06B_q.pte ``` -Notice the adapter PTE files are about the same size as the `adapter_model.safetensors`/`adapter_model.pt` files generated during training. The PTE contains the adapter weights (which are not shared) and the program. +The LoRA PTE file contains only the program and adapter weights, while the foundation weights are stored separately and can be shared across multiple LoRA models. ## Install runtime dependencies -The ExecuTorch repository is configured as a git submodule at `~/executorch-examples/program-data-separation/cpp/executorch`. To initialize it: + +The ExecuTorch repository is configured as a git submodule at `~/executorch-examples/program-data-separation/cpp/executorch`. To initialize it: + ```bash cd ~/executorch-examples/ -# Update to the remote main branch. +# Update submodules git submodule update --remote program-data-separation/cpp/executorch git submodule sync git submodule update --init --recursive @@ -144,96 +123,76 @@ pip install -r requirements-dev.txt ``` ## Build the runtime -Install some dependencies: -```bash -cd ~/executorch-examples/program-data-separation/cpp/executorch -sh examples/models/llama/install_requirements.sh -``` Build the executable: ```bash cd ~/executorch-examples/program-data-separation/cpp/lora_example -sh build_example.sh +bash build_example.sh ``` ## Run the executable -```bash -cd ~/executorch-examples/program-data-separation/cpp/lora_example - -DOWNLOADED_PATH=~/path/to/Llama-3.2-1B-Instruct/ -./build/bin/executorch_program_data_separation \ - --tokenizer_path="${DOWNLOADED_PATH}" \ - --model1="executorch_lora.pte" \ - --model2="nobel_lora.pte" \ - --weights="foundation.ptd" \ - --prompt="Who were the winners of the Nobel Prize in Physics in 2025?" \ - --apply_chat_template -``` -Passing in the `DOWNLOADED_PATH` as the tokenizer directory will invoke the HFTokenizer, and parse additional tokenizers files: `tokenizer_config.json` and `special_tokens_map.json`. `special_tokens_map.json` tells us which bos/eos token to use, especially if there are multiple. -`apply_chat_template` formats the prompt according to the LLAMA chat template. - -Sample output: -``` -I 00:00:00.538779 executorch:main.cpp:133] Generating with model et.pte.. -... -I 00:00:06.999737 executorch:text_llm_runner.cpp:182] RSS after prompt prefill: 6941.296875 MiB (0 if unsupported) -I don't have information on the winners of the Nobel Prize in Physics in 2025.<|eot_id|> -... -I 00:00:11.635379 executorch:main.cpp:141] Generating with model nobel.pte... -... -I 00:00:14.109447 executorch:text_llm_runner.cpp:182] RSS after prompt prefill: 8041.632812 MiB (0 if unsupported) -John Clarke, Michel H. Devoret, John M. Martinis<|eot_id|> +First, get the path to the downloaded Qwen model (this was downloaded during export): +```bash +TOKENIZER_PATH=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('unsloth/Qwen3-0.6B'))") ``` -We can see that the ExecuTorch-trained adapter model does not have knowledge of the recent Nobel Prize winners, as neither the base model or adapter was trained on it. Meanwhile, the Nobel-prize adapter model can answer well. - -There is about ~1.1GB memory increase between running the two models. -Most of that (about ~1GB) comes from embeddings that are not lowered to XNNPACK (and currently are not shared). This can be alleviated by quantizing the embeddings by adding the config `quantization.embedding_quantize=\'4,32\'` to the export command. -~50MB comes from the adapter model, which is not shared. -Let's try with an executorch-specific prompt. +Run the example (remove the _q suffix for non-quantized): ```bash cd ~/executorch-examples/program-data-separation/cpp/lora_example -DOWNLOADED_PATH=~/path/to/Llama-3.2-1B-Instruct/ ./build/bin/executorch_program_data_separation \ - --tokenizer_path="${DOWNLOADED_PATH}" \ - --model1="executorch_lora.pte" \ - --model2="nobel_lora.pte" \ - --weights="foundation.ptd" \ - --prompt="Help me get started with ExecuTorch in 3 steps" \ + --tokenizer_path="${TOKENIZER_PATH}" \ + --model1="../../models/qwen3_06B_lora_q.pte" \ + --weights1="../../models/qwen3_06B_q.ptd,../../models/qwen3_06B_lora_q.ptd" \ + --model2="../../models/qwen3_06B_q.pte" \ + --weights2="../../models/qwen3_06B_q.ptd" \ + --prompt="Calculate 15% of 80" \ --apply_chat_template ``` -Sample output: -``` -... -I 00:00:00.554048 executorch:main.cpp:133] Generating with model et.pte... -... -Here are 3 steps to get started with ExecuTorch: - - Step 1: Install ExecuTorch dependencies. This includes installing Python 3.8+ library, PyTorch library, and the ExecuTorch runtime. - - Step 2: Set up a Python environment with pip and a virtual environment (e.g., conda) to isolate ExecuTorch dependencies. +The `--tokenizer_path` should point to the directory containing the tokenizer files. The `--apply_chat_template` flag formats the prompt using the Qwen chat template. - Step 3: Clone the Execu -I 00:00:27.243400 executorch:text_llm_runner.cpp:206] RSS after finishing text generation: 6940.410156 MiB (0 if unsupported) -... -I 00:00:27.243504 executorch:main.cpp:141] Generating with model nobel.pte... -... -Here are the 3 steps to get started with Excetorch: +This example runs two models: +1. **model1** (LoRA): The math-focused LoRA adapter, which should give accurate math answers +2. **model2** (base): The base Qwen model without LoRA -**Step 1: Install Node.js and npm** +The foundation weights are shared between both models via the XNNPACK weight cache, reducing memory usage. -Excetorch is a JavaScript compiler, so you'll need Node.js and npm (the Node Package Manager) installed on your computer. You can download Node.js from the official website and npm from the npm website. Follow the installation instructions for your operating system. +Sample output, lora model: +``` +I 00:00:01.149525 executorch:main.cpp:145] Generating with model ../../models/qwen3_06B_lora_q.pte... +ToI 00:00:03.107697 executorch:text_llm_runner.cpp:188] RSS after prompt prefill: 1913.269531 MiB (0 if unsupported) + calculate 15% of 80, we can multiply 80 by 15/100. +So, 15% of 80 is equal to (80 * 15) / 100 = 1200 / 100 = 12. +#### 12 +The answer is: 12<|im_end|> +I 00:00:33.889797 executorch:text_token_generator.h:130] +Reached to the end of generation +``` +In memory, we have: +- lora model (qwen3_06B_lora_q.pte) +- lora weights (qwen3_06B_lora_q.ptd) +- base weights (qwen3_06B_q.ptd) -**Step 2: Install Excetorch** +Sample output, base model: +``` +I 00:00:33.889921 executorch:main.cpp:153] Generating with model ../../models/qwen3_06B_q.pte... +I 00:00:34.847727 executorch:text_llm_runner.cpp:188] RSS after prompt prefill: 3122.109375 MiB (0 if unsupported) +Okay, so I need to calculate 15% of 80. Let me think about how to approach this. Hmm, percentages can sometimes be tricky because they can be converted to decimals or fractions. Let me recall the formula for percentage: percentage equals (number × percentage rate) / 100. So in this case, 15% of 80. Let me write that down: 15% of 80. -I 00:00:50.189743 executorch:text_llm_runner.cpp:206] RSS after finishing text generation: 8039.152344 MiB (0 if unsupported) +First, maybe I can convert 15% to a decimal. Since +I 00:01:01.815193 executorch:text_llm_runner.cpp:214] RSS after finishing text generation: 3122.109375 MiB (0 if unsupported) ``` +We can see the base model is less capable than the lora model at mathematics. -The ExecuTorch-trained adapter model has domain knowledge of ExecuTorch codebase, whereas the Nobel-prize trained adapter model does not. +In memory, we have a base model in addition to the items above. +- base model (qwen3_06B.pte) + +There is ~1GB memory increase despite the base model only being 621K. This is due to embeddings that aren't lowered to XNNPACK, and are duplicated between the lora and base models. This can be reduced by quantizing the embeddings as well. ## Demo video +This demo video contains two lora adapters, with one trained on ExecuTorch documentation and another on recent Nobel Prize Winners, using Llama3 1B as the base model. + https://github.com/user-attachments/assets/34f5488d-c1e3-4613-953f-f53745c9b01e diff --git a/program-data-separation/cpp/lora_example/main.cpp b/program-data-separation/cpp/lora_example/main.cpp index 3f820f7e..646fa692 100644 --- a/program-data-separation/cpp/lora_example/main.cpp +++ b/program-data-separation/cpp/lora_example/main.cpp @@ -8,6 +8,7 @@ */ #include +#include #include #include @@ -24,12 +25,14 @@ #include #endif -DEFINE_string(model1, "llama_3_2_1B_lora.pte", - "First model, a PTE file."); -DEFINE_string(model2, "llama_3_2_1B.pte", - "Second model, a PTE file."); -DEFINE_string(weights, "foundation.ptd", - "Shared weights, a PTD file."); +DEFINE_string(model1, "qwen3_06B_lora.pte", + "First model (with LoRA), a PTE file."); +DEFINE_string(model2, "qwen3_06B.pte", + "Second model (without LoRA), a PTE file."); +DEFINE_string(weights1, "qwen3_06B.ptd,qwen3_06B_lora.ptd", + "Comma-separated weights for model1 (foundation + LoRA PTD files)."); +DEFINE_string(weights2, "qwen3_06B.ptd", + "Weights for model2 (foundation PTD file)."); DEFINE_string(tokenizer_path, "tokenizer.model", "Tokenizer."); @@ -59,12 +62,9 @@ static inline std::unique_ptr> _get_default_special_tokens() { auto special_tokens = std::make_unique>(std::vector{ - "<|begin_of_text|>", "<|end_of_text|>", - "<|reserved_special_token_0|>", "<|reserved_special_token_1|>", - "<|finetune_right_pad_id|>", "<|step_id|>", "<|start_header_id|>", - "<|end_header_id|>", "<|eom_id|>", "<|eot_id|>", "<|python_tag|>"}); + "<|endoftext|>", "<|im_start|>", "<|im_end|>"}); // pad the rest of the special tokens with reserved tokens - ssize_t reserved_special_token_num = 2; + ssize_t reserved_special_token_num = 0; while (special_tokens->size() < kSpecialTokensSize) { special_tokens->emplace_back("<|reserved_special_token_" + std::to_string(reserved_special_token_num++) + @@ -72,6 +72,19 @@ _get_default_special_tokens() { } return special_tokens; } + +// Parse comma-separated string into vector of strings +static std::vector parse_data_paths(const std::string& paths) { + std::vector result; + std::stringstream ss(paths); + std::string item; + while (std::getline(ss, item, ',')) { + if (!item.empty()) { + result.push_back(item); + } + } + return result; +} } // namespace int main(int argc, char *argv[]) { @@ -81,7 +94,6 @@ int main(int argc, char *argv[]) { const char *model1 = FLAGS_model1.c_str(); const char *model2 = FLAGS_model2.c_str(); - const char *weights = FLAGS_weights.c_str(); const char *tokenizer_path = FLAGS_tokenizer_path.c_str(); const char *prompt = FLAGS_prompt.c_str(); @@ -103,13 +115,16 @@ int main(int argc, char *argv[]) { return 1; } - // Create runners. + // Create runners with parsed data paths. + std::vector data_files1 = parse_data_paths(FLAGS_weights1); + std::vector data_files2 = parse_data_paths(FLAGS_weights2); + std::unique_ptr runner1 = llm::create_text_llm_runner(model1, std::move(tokenizer1), - weights, temperature); + data_files1, temperature); std::unique_ptr runner2 = llm::create_text_llm_runner(model2, std::move(tokenizer2), - weights, temperature); + data_files2, temperature); llm::GenerationConfig config{ .echo = false, @@ -119,13 +134,10 @@ int main(int argc, char *argv[]) { std::string formatted_prompt = std::string(); if (FLAGS_apply_chat_template) { ET_LOG(Info, "Applying chat template..."); - // System Prompt. - formatted_prompt += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n"; - // User Prompt. - formatted_prompt += "You are a helpful assistant.<|eot_id|>"; - formatted_prompt += "<|start_header_id|>user<|end_header_id|>\n"; + // Qwen3 chat template format + formatted_prompt += "<|im_start|>user\n"; formatted_prompt += prompt; - formatted_prompt += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"; + formatted_prompt += "<|im_end|>\n<|im_start|>assistant\n"; } else { formatted_prompt += prompt; } diff --git a/program-data-separation/export_lora.sh b/program-data-separation/export_lora.sh index 8f06cd83..ff019323 100644 --- a/program-data-separation/export_lora.sh +++ b/program-data-separation/export_lora.sh @@ -7,48 +7,65 @@ set -exu -python -m pip install torchtune==0.7.0.dev20250730 --extra-index-url https://download.pytorch.org/whl/nightly/cpu - -# Download model artifacts from HF. -DOWNLOADED_PATH=$(python -c " -from huggingface_hub import snapshot_download -path=snapshot_download( - repo_id=\"lucylq/llama3_1B_lora\", -) -import os -print(path) -") - -# Copy over tokenizer, for use at runtime. -cp "${DOWNLOADED_PATH}/tokenizer.model" . - -# Export a non-LoRA model with program-data separated. -DIR="models/" -MODEL="llama_3_2_1B" +# Parse command line arguments. +QUANTIZE=false +while getopts "q" opt; do + case ${opt} in + q) + QUANTIZE=true + ;; + *) + echo "Usage: $0 [-q]" + echo " -q Enable quantization (8da4w, group_size=32)" + exit 1 + ;; + esac +done + +# Install huggingface_hub for downloading model artifacts. +python -m pip install -q huggingface_hub + +# Download LoRA adapter and config. +HF_ADAPTER_REPO="lucylq/qwen3_06B_lora_math" +HF_ADAPTER_PATH=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('${HF_ADAPTER_REPO}'))") +echo "LoRA adapter downloaded to: $HF_ADAPTER_PATH" + +# Download Qwen3-0.6B model. +HF_QWEN_PATH=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('unsloth/Qwen3-0.6B'))") +echo "Model downloaded to: $HF_QWEN_PATH" + +# Output directory. +DIR="models" +mkdir -p "${DIR}" + +# Set model names and quantization args based on -q flag. +SCRIPT_DIR="$(dirname "${BASH_SOURCE[0]}")" +CONFIG="${SCRIPT_DIR}/config/qwen3_xnnpack.yaml" +if [ "$QUANTIZE" = true ]; then + MODEL="qwen3_06B_q" + LORA_MODEL="qwen3_06B_lora_q" + FOUNDATION_WEIGHTS="qwen3_06B_q" + QUANT_ARGS=("+quantization.qmode=8da4w" "+quantization.group_size=32") +else + MODEL="qwen3_06B" + LORA_MODEL="qwen3_06B_lora" + FOUNDATION_WEIGHTS="qwen3_06B" + QUANT_ARGS=() +fi + +# Export a non-LoRA Qwen model with program-data separated. python -m executorch.extension.llm.export.export_llm \ - base.checkpoint="${DOWNLOADED_PATH}/consolidated.00.pth" \ - base.params="${DOWNLOADED_PATH}/params.json" \ - base.tokenizer_path="${DOWNLOADED_PATH}/tokenizer.model" \ - model.use_kv_cache=true \ - model.use_sdpa_with_kv_cache=true \ - model.dtype_override="fp32" \ - backend.xnnpack.enabled=true \ - backend.xnnpack.extended_ops=true \ - export.output_name="${DIR}/${MODEL}.pte" \ - export.foundation_weights_file="${DIR}/${MODEL}.ptd" - -# Export a LoRA model, with program and data separated. -LORA_MODEL="llama_3_2_1B_lora" + --config "${CONFIG}" \ + +export.output_name="${DIR}/${MODEL}.pte" \ + +export.foundation_weights_file="${DIR}/${FOUNDATION_WEIGHTS}.ptd" \ + "${QUANT_ARGS[@]}" + +# Export a LoRA Qwen model with program-data separated. python -m executorch.extension.llm.export.export_llm \ - base.checkpoint="${DOWNLOADED_PATH}/consolidated.00.pth" \ - base.params="${DOWNLOADED_PATH}/params.json" \ - base.adapter_checkpoint="${DOWNLOADED_PATH}/adapter_model.pt" \ - base.adapter_config="${DOWNLOADED_PATH}/adapter_config.json" \ - base.tokenizer_path="${DOWNLOADED_PATH}/tokenizer.model" \ - model.use_kv_cache=true \ - model.use_sdpa_with_kv_cache=true \ - model.dtype_override="fp32" \ - backend.xnnpack.enabled=true \ - backend.xnnpack.extended_ops=true \ - export.output_name="${DIR}/${LORA_MODEL}.pte" \ - export.foundation_weights_file="${DIR}/foundation.ptd" + --config "${CONFIG}" \ + +base.adapter_checkpoint="${HF_ADAPTER_PATH}/adapter_model.safetensors" \ + +base.adapter_config="${HF_ADAPTER_PATH}/adapter_config.json" \ + +export.output_name="${DIR}/${LORA_MODEL}.pte" \ + +export.foundation_weights_file="${DIR}/${FOUNDATION_WEIGHTS}.ptd" \ + +export.lora_weights_file="${DIR}/${LORA_MODEL}.ptd" \ + "${QUANT_ARGS[@]}"