@@ -12,8 +12,8 @@ LLM Runtime is designed to provide the efficient inference of large language mod
1212
1313## Supported Models
1414
15- We support the following models:
16- ### Text generation models
15+ LLM Runtime supports the following models:
16+ ### Text Generation
1717| model name | INT8 | INT4|
1818| ---| :---:| :---:|
1919| [ LLaMA2-7B] ( https://huggingface.co/meta-llama/Llama-2-7b-chat-hf ) , [ LLaMA2-13B] ( https://huggingface.co/meta-llama/Llama-2-13b-chat-hf ) , [ LLaMA2-70B] ( https://huggingface.co/meta-llama/Llama-2-70b-chat-hf ) | ✅ | ✅ |
@@ -27,26 +27,34 @@ We support the following models:
2727| [ OPT-125m] ( https://huggingface.co/facebook/opt-125m ) , [ OPT-350m] ( https://huggingface.co/facebook/opt-350m ) , [ OPT-1.3B] ( https://huggingface.co/facebook/opt-1.3b ) , [ OPT-13B] ( https://huggingface.co/facebook/opt-13b ) | ✅ | ✅ |
2828| [ ChatGLM-6B] ( https://huggingface.co/THUDM/chatglm-6b ) , [ ChatGLM2-6B] ( https://huggingface.co/THUDM/chatglm2-6b ) | ✅ | ✅ |
2929
30- ### Code generation models
30+ ### Code Generation
3131| model name | INT8 | INT4|
3232| ---| :---:| :---:|
3333| [ Code-LLaMA-7B] ( https://huggingface.co/codellama/CodeLlama-7b-hf ) , [ Code-LLaMA-13B] ( https://huggingface.co/codellama/CodeLlama-13b-hf ) | ✅ | ✅ |
3434| [ StarCoder-1B] ( https://huggingface.co/bigcode/starcoderbase-1b ) , [ StarCoder-3B] ( https://huggingface.co/bigcode/starcoderbase-3b ) , [ StarCoder-15.5B] ( https://huggingface.co/bigcode/starcoder ) | ✅ | ✅ |
3535
3636
37- ## How to use
37+ ## How to Use
3838
39- ### 1. Build LLM Runtime
40- Linux
39+ ### 1. Install LLM Runtime
40+ Install from binary
4141``` shell
42+ pip install intel-extension-for-transformers
43+ ```
44+
45+ Build from source
46+ ``` shell
47+ # Linux
4248git submodule update --init --recursive
4349mkdir build
4450cd build
4551cmake .. -G Ninja
4652ninja
4753```
48- Windows: install VisualStudio 2022(a validated veresion), search 'Developer PowerShell for VS 2022' and open it, then run the following cmds.
54+
4955``` powershell
56+ # Windows
57+ # Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
5058mkdir build
5159cd build
5260cmake ..
@@ -55,24 +63,24 @@ cmake --build . -j
5563
5664### 2. Run LLM with Python API
5765
58- You can use the python api to simplely run HF model.
66+ You can use Python API to run Hugging Face model simply. Here is the sample code:
5967``` python
6068from intel_extension_for_transformers.transformers import AutoModel, WeightOnlyQuantConfig
61- model_name = " EleutherAI/gpt-j-6b " # support model id of HF or local PATH to model
69+ model_name = " Intel/neural-chat-7b-v1-1 " # Hugging Face model_id or local model
6270woq_config = WeightOnlyQuantConfig(compute_dtype = " int8" , weight_dtype = " int4" )
6371model = AutoModel.from_pretrained(model_name, quantization_config = woq_config)
6472prompt = " Once upon a time, a little girl"
6573output = model.generate(prompt, max_new_tokens = 30 )
6674```
6775
68- ### 3. Run LLM with Script
69- You can use the following script to run, including convertion, quantization and inference.
76+ ### 3. Run LLM with Python Script
77+ You can run LLM with one-click python script including convertion, quantization and inference.
7078```
7179python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
7280```
7381
74- LLM one-click running script args explanations :
75- | arg | explanation |
82+ Augument description of run.py :
83+ | Augument | Description |
7684| -------------- | ----------------------------------------------------------------------- |
7785| model | directory containing model file or model id |
7886| --weight_dtype | data type of quantized weight (default: int4) |
@@ -92,10 +100,11 @@ LLM one-click running script args explanations:
92100| --keep | number of tokens to keep from the initial prompt (default: 0, -1 = all) |
93101
94102
95- ## Advanced use
103+ ## Advanced Usage
104+ Besides the one-click script, LLM Runtime also offers the detailed script: 1) convert and quantize, and 2) inference.
96105
97- ### 1. Convert and Quantize LLM model
98- LLM Runtime assumes the same model format as [ llama.cpp] ( https://github.com/ggerganov/llama.cpp ) and [ ggml] ( https://github.com/ggerganov/ggml ) . You can also convert the model by following the below steps:
106+ ### 1. Convert and Quantize LLM
107+ LLM Runtime assumes the compatible model format as [ llama.cpp] ( https://github.com/ggerganov/llama.cpp ) and [ ggml] ( https://github.com/ggerganov/ggml ) . You can also convert the model by following the below steps:
99108
100109``` bash
101110
@@ -117,8 +126,8 @@ python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_fil
117126python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 32 --compute_dtype int8
118127
119128```
120- quantization args explanations :
121- | arg | explanation |
129+ Augument description of quantize.py :
130+ | Augument | Description |
122131| -------------- | ----------------------------------------------------------- |
123132| --model_file | path to the fp32 model |
124133| --out_file | path to the quantized model |
@@ -132,9 +141,9 @@ quantization args explanations:
132141| --use_ggml | enable ggml for quantization and inference |
133142
134143
135- ### 2. Inference model with C++ script API
144+ ### 2. Inference LLM
136145
137- We supply LLM running script to run supported models with c ++ api conveniently .
146+ We
provide LLM
inference script to run
the quantized model. Please reach [ us ] ( mailto:[email protected] ) if you want to run using C ++
API directly .
138147``` bash
139148# recommed to use numactl to bind cores in Intel cpus for better performance
140149# if you use different core numbers, please also change -t arg value
@@ -148,8 +157,8 @@ OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python scripts/inference.py --model_name
148157OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t 56 --color -p " She opened the door and see" --repeat_penalty 1.2
149158```
150159
151- LLM running script args explanations :
152- | arg | explanation |
160+ Augument description of inference.py :
161+ | Augument | Description |
153162| -------------- | ----------------------------------------------------------------------- |
154163| --model_name | model name |
155164| -m / --model | path to the executed model |
0 commit comments