Skip to content
This repository was archived by the owner on Oct 25, 2024. It is now read-only.

Commit 9807821

Browse files
authored
Refine LLM runtime readme (#395)
* Refine LLM runtime readme Signed-off-by: hshen14 <[email protected]>
1 parent e4e5f05 commit 9807821

File tree

1 file changed

+31
-22
lines changed
  • intel_extension_for_transformers/llm/runtime/graph

1 file changed

+31
-22
lines changed

intel_extension_for_transformers/llm/runtime/graph/README.md

Lines changed: 31 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ LLM Runtime is designed to provide the efficient inference of large language mod
1212
1313
## Supported Models
1414

15-
We support the following models:
16-
### Text generation models
15+
LLM Runtime supports the following models:
16+
### Text Generation
1717
| model name | INT8 | INT4|
1818
|---|:---:|:---:|
1919
|[LLaMA2-7B](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [LLaMA2-13B](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf), [LLaMA2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)|||
@@ -27,26 +27,34 @@ We support the following models:
2727
|[OPT-125m](https://huggingface.co/facebook/opt-125m), [OPT-350m](https://huggingface.co/facebook/opt-350m), [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b), [OPT-13B](https://huggingface.co/facebook/opt-13b)|||
2828
|[ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b), [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)|||
2929

30-
### Code generation models
30+
### Code Generation
3131
| model name | INT8 | INT4|
3232
|---|:---:|:---:|
3333
|[Code-LLaMA-7B](https://huggingface.co/codellama/CodeLlama-7b-hf), [Code-LLaMA-13B](https://huggingface.co/codellama/CodeLlama-13b-hf)|||
3434
|[StarCoder-1B](https://huggingface.co/bigcode/starcoderbase-1b), [StarCoder-3B](https://huggingface.co/bigcode/starcoderbase-3b), [StarCoder-15.5B](https://huggingface.co/bigcode/starcoder)|||
3535

3636

37-
## How to use
37+
## How to Use
3838

39-
### 1. Build LLM Runtime
40-
Linux
39+
### 1. Install LLM Runtime
40+
Install from binary
4141
```shell
42+
pip install intel-extension-for-transformers
43+
```
44+
45+
Build from source
46+
```shell
47+
# Linux
4248
git submodule update --init --recursive
4349
mkdir build
4450
cd build
4551
cmake .. -G Ninja
4652
ninja
4753
```
48-
Windows: install VisualStudio 2022(a validated veresion), search 'Developer PowerShell for VS 2022' and open it, then run the following cmds.
54+
4955
```powershell
56+
# Windows
57+
# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
5058
mkdir build
5159
cd build
5260
cmake ..
@@ -55,24 +63,24 @@ cmake --build . -j
5563

5664
### 2. Run LLM with Python API
5765

58-
You can use the python api to simplely run HF model.
66+
You can use Python API to run Hugging Face model simply. Here is the sample code:
5967
```python
6068
from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig
61-
model_name = "EleutherAI/gpt-j-6b" # support model id of HF or local PATH to model
69+
model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model
6270
woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
6371
model = AutoModel.from_pretrained(model_name, quantization_config=woq_config)
6472
prompt = "Once upon a time, a little girl"
6573
output = model.generate(prompt, max_new_tokens=30)
6674
```
6775

68-
### 3. Run LLM with Script
69-
You can use the following script to run, including convertion, quantization and inference.
76+
### 3. Run LLM with Python Script
77+
You can run LLM with one-click python script including convertion, quantization and inference.
7078
```
7179
python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
7280
```
7381

74-
LLM one-click running script args explanations:
75-
| arg | explanation |
82+
Augument description of run.py:
83+
| Augument | Description |
7684
| -------------- | ----------------------------------------------------------------------- |
7785
| model | directory containing model file or model id |
7886
| --weight_dtype | data type of quantized weight (default: int4) |
@@ -92,10 +100,11 @@ LLM one-click running script args explanations:
92100
| --keep | number of tokens to keep from the initial prompt (default: 0, -1 = all) |
93101

94102

95-
## Advanced use
103+
## Advanced Usage
104+
Besides the one-click script, LLM Runtime also offers the detailed script: 1) convert and quantize, and 2) inference.
96105

97-
### 1. Convert and Quantize LLM model
98-
LLM Runtime assumes the same model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps:
106+
### 1. Convert and Quantize LLM
107+
LLM Runtime assumes the compatible model format as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [ggml](https://github.com/ggerganov/ggml). You can also convert the model by following the below steps:
99108

100109
```bash
101110

@@ -117,8 +126,8 @@ python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_fil
117126
python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 32 --compute_dtype int8
118127

119128
```
120-
quantization args explanations:
121-
| arg | explanation |
129+
Augument description of quantize.py:
130+
| Augument | Description |
122131
| -------------- | ----------------------------------------------------------- |
123132
| --model_file | path to the fp32 model |
124133
| --out_file | path to the quantized model |
@@ -132,9 +141,9 @@ quantization args explanations:
132141
| --use_ggml | enable ggml for quantization and inference |
133142

134143

135-
### 2. Inference model with C++ script API
144+
### 2. Inference LLM
136145

137-
We supply LLM running script to run supported models with c++ api conveniently.
146+
We provide LLM inference script to run the quantized model. Please reach [us](mailto:[email protected]) if you want to run using C++ API directly.
138147
```bash
139148
# recommed to use numactl to bind cores in Intel cpus for better performance
140149
# if you use different core numbers, please also change -t arg value
@@ -148,8 +157,8 @@ OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python scripts/inference.py --model_name
148157
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t 56 --color -p "She opened the door and see" --repeat_penalty 1.2
149158
```
150159

151-
LLM running script args explanations:
152-
| arg | explanation |
160+
Augument description of inference.py:
161+
| Augument | Description |
153162
| -------------- | ----------------------------------------------------------------------- |
154163
| --model_name | model name |
155164
| -m / --model | path to the executed model |

0 commit comments

Comments
 (0)