Skip to content

Commit ad8df24

Browse files
authored
Merge pull request #3 from software-mansion/fix/export-script
Fix export script
2 parents ff66622 + 02b1ca8 commit ad8df24

File tree

5 files changed

+50
-29
lines changed

5 files changed

+50
-29
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ https://docs.swmansion.com/react-native-executorch
1717

1818
## Examples 📲
1919

20-
We currently host a single example demonstrating a chat app built with the latest **LLaMa 3.2 1B/3B** model. If you'd like to run it, navigate to `examples/llama` from the repository root and install the dependencies with:
20+
We currently host a single example demonstrating a chat app built with the latest **Llama 3.2 1B/3B** model. If you'd like to run it, navigate to `examples/llama` from the repository root and install the dependencies with:
2121

2222
```bash
2323
yarn

docs/docs/guides/exporting-llama.mdx

+29-13
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,39 @@
11
---
2-
title: Exporting LLaMa
2+
title: Exporting Llama
33
sidebar_position: 2
44
---
55

66
In order to make the process of export as simple as possible for you, we created a script that runs a Docker container and exports the model.
77

8-
1. Get a [HuggingFace](https://huggingface.co/) account. This will allow you to download needed files. You can also use the [official LLaMa website](https://www.llama.com/llama-downloads/).
9-
2. Pick the model that suits your needs. Before you download it, you'll need to accept a license. For best performance, we recommend using Spin-Quant or QLoRA versions of the model:
10-
- [LLaMa 3.2 3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/tree/main/original)
11-
- [LLaMa 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main/original)
12-
- [LLaMa 3.2 3B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8/tree/main)
13-
- [LLaMa 3.2 1B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/tree/main)
14-
- [LLaMa 3.2 3B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8/tree/main)
15-
- [LLaMa 3.2 1B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8/tree/main)
16-
3. Download the `consolidated.00.pth`, `params.json` and `tokenizer.model` files. If you can't see them, make sure to check the `original` directory. Sometimes the files might have other names, for example `original_params.json`.
17-
4. Run `mv tokenizer.model tokenizer.bin`. The library expects the tokenizers to have .bin extension.
18-
5. Run `./build_llama_binary.sh --model-path /path/to/consolidated.00.pth --params-path /path/to/params.json script that's located in the `llama-export` directory.
19-
6. The script will pull a Docker image from docker hub, and then run it to export the model. By default the output (llama3_2.pte file) will be saved in the `llama-export/outputs` directory. However, you can override that behavior with the `--output-path [path]` flag.
8+
## Steps to export Llama
9+
### 1. Create an Account:
10+
Get a [HuggingFace](https://huggingface.co/) account. This will allow you to download needed files. You can also use the [official Llama website](https://www.llama.com/llama-downloads/).
2011

12+
### 2. Select a Model:
13+
Pick the model that suits your needs. Before you download it, you'll need to accept a license. For best performance, we recommend using Spin-Quant or QLoRA versions of the model:
14+
- [Llama 3.2 3B](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/tree/main/original)
15+
- [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/tree/main/original)
16+
- [Llama 3.2 3B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8/tree/main)
17+
- [Llama 3.2 1B Spin-Quant](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/tree/main)
18+
- [Llama 3.2 3B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8/tree/main)
19+
- [Llama 3.2 1B QLoRA](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8/tree/main)
20+
21+
### 3. Download Files:
22+
Download the `consolidated.00.pth`, `params.json` and `tokenizer.model` files. If you can't see them, make sure to check the `original` directory.
23+
24+
### 4. Rename the Tokenizer File:
25+
Rename the `tokenizer.model` file to `tokenizer.bin` as required by the library:
26+
```bash
27+
mv tokenizer.model tokenizer.bin
28+
```
29+
30+
### 5. Run the Export Script:
31+
Navigate to the `llama_export` directory and run the following command:
32+
```bash
33+
./build_llama_binary.sh --model-path /path/to/consolidated.00.pth --params-path /path/to/params.json
34+
```
35+
36+
The script will pull a Docker image from docker hub, and then run it to export the model. By default the output (llama3_2.pte file) will be saved in the `llama-export/outputs` directory. However, you can override that behavior with the `--output-path [path]` flag.
2137

2238
:::note[Note]
2339
This Docker image was tested on MacOS with ARM chip. This might not work in other environments.

docs/docs/guides/running-llms.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@ title: Running LLMs
33
sidebar_position: 1
44
---
55

6-
React Native ExecuTorch supports LLaMa 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:
6+
React Native ExecuTorch supports Llama 3.2 models, including quantized versions. Before getting started, you’ll need to obtain the .pte binary—a serialized model—and the tokenizer. There are various ways to accomplish this:
77

88
- For your convienience, it's best if you use models exported by us, you can get them from our hugging face repository. You can also use [constants](https://github.com/software-mansion/react-native-executorch/tree/main/src/modelUrls.ts) shipped with our library.
9-
- If you want to export model by yourself,you can use a Docker image that we've prepared. To see how it works, check out [exporting LLaMa](./exporting-llama.mdx)
10-
- Follow the official [tutorial](https://github.com/pytorch/executorch/blob/cbfdf78f8/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md) made by ExecuTorch team to build the model and tokenizer yourself
9+
- If you want to export model by yourself,you can use a Docker image that we've prepared. To see how it works, check out [exporting Llama](./exporting-llama.mdx)
10+
- Follow the official [tutorial](https://github.com/pytorch/executorch/blob/fe20be98c/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md) made by ExecuTorch team to build the model and tokenizer yourself
1111

1212
## Initializing
1313

llama_export/Dockerfile

+1-1
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 &
5050
# Install ExecuTorch
5151
RUN git clone https://github.com/pytorch/executorch.git
5252
WORKDIR /executorch
53-
RUN git checkout cbfdf78f8
53+
RUN git checkout fe20be98c
5454
RUN git submodule sync
5555
RUN git submodule update --init
5656

llama_export/scripts/export_llama.sh

+16-11
Original file line numberDiff line numberDiff line change
@@ -2,29 +2,34 @@
22

33
set -eu
44

5-
# The quantized versions of LLaMa should cointain a quantization_args key in params.json
6-
QUANTIZED=$(grep "lora_args" /model/params.json)
7-
85
export_cmd="python -m examples.models.llama.export_llama \
96
--checkpoint /model/consolidated.00.pth \
107
--params /model/params.json \
118
-kv \
129
--use_sdpa_with_kv_cache \
1310
-X \
1411
-d bf16 \
12+
--max_seq_length 2048 \
1513
--metadata '{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}' \
1614
--output_name=/outputs/llama3_2.pte"
1715

18-
if [ -n "$QUANTIZED" ]; then
16+
# The quantized versions of Llama should cointain a quantization_args key in params.json
17+
if grep -q "quantization_args" /model/params.json; then
1918
export_cmd="${export_cmd//-d bf16/-d fp32}"
2019
export_cmd+=" \
21-
-qat \
22-
-lora 16 \
23-
--preq_mode 8da4w_output_8da8w \
24-
--preq_group_size 32 \
25-
--max_seq_length 2048 \
26-
--xnnpack-extended-ops \
27-
--preq_embedding_quantize 8,0"
20+
--preq_mode 8da4w_output_8da8w \
21+
--preq_group_size 32 \
22+
--xnnpack-extended-ops \
23+
--preq_embedding_quantize 8,0"
24+
25+
if grep -q "lora_args" /model/params.json; then
26+
export_cmd+=" \
27+
-qat \
28+
-lora 16"
29+
else # SpinQuant
30+
export_cmd+=" \
31+
--use_spin_quant native"
32+
fi
2833
fi
2934

3035
if ! eval "$export_cmd"; then

0 commit comments

Comments
 (0)