Skip to content

Commit 9f3e9b2

Browse files
authored
Fixed HuggingFace pre-trained model compression deployment example (#1860)
1 parent cbc4d1d commit 9f3e9b2

File tree

13 files changed

+608
-23
lines changed

13 files changed

+608
-23
lines changed

example/auto_compression/pytorch_huggingface/README.md

+63-18
Original file line numberDiff line numberDiff line change
@@ -26,35 +26,36 @@
2626
| bert-base-cased | Base模型 | 60.06 | 84.31 | 90.68 | 90.84 | 63.53 | 91.63 | 88.46 | 81.35 |
2727
| bert-base-cased | 剪枝蒸馏+量化训练 | 58.69 | 85.05 | 90.74 | 90.42 | 65.34 | 92.08 | 88.22 | 81.51 |
2828

29-
模型在多个任务上平均精度以及加速对比如下
30-
| 模型 |策略| Accuracy(avg) | 预测时延<sup><small>FP32</small><sup><br><sup> | 预测时延<sup><small>FP16</small><sup><br><sup> | 预测时延<sup><small>INT8</small><sup><br><sup> | 加速比 |
31-
|:-------:|:----------:|:------------:|:------:|:------:|:------:|:------:|
32-
| bert-base-cased | Base模型 | 81.35 | 195ms | 51.58ms | - | - |
33-
| bert-base-cased | 剪枝+量化训练 | 81.51 | - | - | 31.33ms | 6.22x |
29+
基于bert-base-uncased模型,压缩前后的精度如下
30+
| 模型 |策略| Accuracy(avg) | trt<sup><small>FP32</small><sup><br><sup> | trt<sup><small>FP16</small><sup><br><sup> | trt<sup><small>INT8</small><sup><br><sup> | 加速比 | 模型 |
31+
|:-------:|:----------:|:------------:|:------:|:------:|:------:|:------:|:------:|
32+
| bert-base-uncased | Base模型 | 92.66 | 173.00ms | 38.42ms | - | - | [Model](https://paddle-slim-models.bj.bcebos.com/act/infer_model.zip) |
33+
| bert-base-uncased | 剪枝+量化训练 | 92.31 | - | - | 33.24ms | 5.20x | [Model](https://paddle-slim-models.bj.bcebos.com/act/unsst2.zip) |
3434

3535
- Nvidia GPU 测试环境:
3636
- 硬件:NVIDIA Tesla T4 单卡
37-
- 软件:CUDA 11.2, cuDNN 8.1, TensorRT 8.4
37+
- 软件:CUDA 11.2, cuDNN 8.1, TensorRT 8.6.1.6
3838
- 测试配置:batch_size: 32, seqence length: 128
39+
- 预测脚本为[paddle_inference_eval.py](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/example/auto_compression/pytorch_huggingface/paddle_inference_eval.py)
3940

4041
## 3. 自动压缩流程
4142
#### 3.1 准备环境
4243
- python >= 3.6
43-
- PaddlePaddle >= 2.3 (可从[Paddle官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)下载安装)
44-
- PaddleSlim >= 2.3
44+
- PaddlePaddle ==2.6 (可从[Paddle官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)下载安装)
45+
- PaddleSlim ==2.6
4546
- X2Paddle develop版本
4647
- transformers >= 4.18.0
47-
- PaddleNLP >= 2.3
48+
- PaddleNLP 2.7.2
4849
- tensorflow == 1.14 (如需压缩TensorFlow模型)
49-
- onnx >= 1.6.0 (如需压缩ONNX模型)
50-
- torch >= 1.5.0 (如需压缩PyTorch模型)
50+
- onnx 1.15.0 (如需压缩ONNX模型)
51+
- torch 1.13.1 (如需压缩PyTorch模型)
5152

5253
安装paddlepaddle:
5354
```shell
5455
# CPU
55-
pip install paddlepaddle
56-
# GPU
57-
pip install paddlepaddle-gpu
56+
python -m pip install paddlepaddle==2.6.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
57+
# GPU 以CUDA11.2为例
58+
python -m pip install paddlepaddle-gpu==2.6.0.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
5859
```
5960

6061
安装paddleslim:
@@ -72,7 +73,7 @@ python setup.py install
7273

7374
安装paddlenlp:
7475
```shell
75-
pip install paddlenlp
76+
pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html
7677
```
7778

7879
注:安装PaddleNLP的目的是为了下载PaddleNLP中的数据集。
@@ -99,7 +100,7 @@ attention_msk = torch.zeros([batch_size, max_length]).long()
99100
from x2paddle.convert import pytorch2paddle
100101
pytorch2paddle(torch_model,
101102
save_dir='./x2paddle_cola/',
102-
jit_type="trace",
103+
jit_type="trace",
103104
input_examples=[input_ids, attention_msk, token_type_ids])
104105
```
105106
@@ -186,13 +187,36 @@ python run.py --config_path=./configs/cola.yaml --save_dir='./output/cola/'
186187
export CUDA_VISIBLE_DEVICES=0
187188
python run.py --config_path=./configs/cola.yaml --eval True
188189
```
190+
[bert-base-uncased模型](https://paddle-slim-models.bj.bcebos.com/act/infer_model.zip)
191+
```shell
192+
export CUDA_VISIBLE_DEVICES=0
193+
python run-uncased.py --config_path=./configs/cola-unsst2.yaml --save_dir='./output/unsst2'
194+
```
189195
190196
## 4. 预测部署
191197
192198
量化模型在GPU上可以使用TensorRT进行加速,在CPU上可以使用MKLDNN进行加速。
193199
194-
195-
- TensorRT预测:
200+
以下字段用于配置预测参数:
201+
202+
| 参数名 | 含义 |
203+
|:------:|:------:|
204+
| model_path | inference 模型文件所在目录,该目录下需要有文件 model.pdmodel 和 model.pdiparams 两个文件 |
205+
| model_filename | 模型文件的名称,默认值为model.pdmodel |
206+
| params_filename | 参数文件的名称,默认值为model.pdiparams |
207+
| task_name | 要执行的任务名称,默认为cola,这里指定的任务应该是"METRIC_CLASSES"字典中包含的任务之一 |
208+
| model_type | 选择的模型类型,默认为bert-base-cased。这里指定了预训练模型的类型或架构 |
209+
| model_name_or_path | 模型的目录或名称,默认为bert-based-cased。这里可以指定一个预训练模型的目录或HuggingFace预训练模型的模型名称 |
210+
| device | 选择用于推理的设备,默认为gpu,这里可以是gpu或cpu |
211+
| batch_size | 预测的批处理大大小,默认为32 |
212+
| max_seq_length | 输入序列的最大长度,默认为128,超过这个长度的序列将被截断,短于这个长度的序列将被填充 |
213+
| perf_warmup_steps | 性能测试的预热步骤数,默认为20,这是在正式计算推理性能前,进行的预热迭代次数,以确保性能稳定 |
214+
| use_trt | 是否使用TensorRT进行推理 |
215+
| precision | 推理精度,默认为fp32,可设置为fp16或int8 |
216+
| use_mkldnn | 是否使用MKLDNN进行推理,默认为False。这是针对CPU推理时,是否启用MKL-DNN进行加速 |
217+
| cpu_threads | CPU线程数,默认为10。这是针对CPU推理时,指定使用的线程数 |
218+
219+
- Paddle-TensorRT预测:
196220
197221
环境配置:如果使用 TesorRT 预测引擎,需安装 ```WITH_TRT=ON``` 的Paddle,下载地址:[Python预测库](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python)
198222
@@ -202,6 +226,11 @@ wget https://bj.bcebos.com/v1/paddle-slim-models/act/x2paddle_cola_new_calib.tar
202226
tar -xf x2paddle_cola_new_calib.tar
203227
```
204228
229+
```shell
230+
wget https://paddle-slim-models.bj.bcebos.com/act/unsst2.zip
231+
unzip unstt2.zip
232+
```
233+
205234
```shell
206235
python paddle_inference_eval.py \
207236
--model_path=x2paddle_cola_new_calib \
@@ -221,6 +250,22 @@ python paddle_inference_eval.py \
221250
--batch_size=1 \
222251
--precision=int8
223252
```
253+
bert-base-uncased模型
254+
```shell
255+
python paddle_inference_eval.py \
256+
--model_path=infer_model \
257+
--use_trt \
258+
--precision=fp32 \
259+
--batch_size=1
260+
```
261+
```shell
262+
python paddle_inference_eval.py \
263+
--model_path=output/unsst2 \
264+
--use_trt \
265+
--precision=int8 \
266+
--batch_size=32 \
267+
--task_name=sst-2
268+
```
224269
225270
226271
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
Global:
2+
input_names: ['input_ids', 'token_type_ids']
3+
model_dir: ./infer_model
4+
model_filename: model.pdmodel
5+
params_filename: model.pdiparams
6+
model_type: bert-base-uncased
7+
task_name: sst-2
8+
dataset: glue
9+
batch_size: 1
10+
max_seq_length: 128
11+
padding: max_length
12+
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
27+
TrainConfig:
28+
epochs: 3
29+
eval_iter: 855
30+
learning_rate: 1.0e-6
31+
optimizer_builder:
32+
optimizer:
33+
type: AdamW
34+
weight_decay: 0.01
35+
origin_metric: 0.9266
36+

example/auto_compression/pytorch_huggingface/configs/cola.yaml

+14
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,20 @@ Global:
1010
max_seq_length: 128
1111
padding: max_length
1212
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
1327
TrainConfig:
1428
epochs: 3
1529
eval_iter: 855

example/auto_compression/pytorch_huggingface/configs/mnli.yaml

+15
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,21 @@ Global:
1010
max_seq_length: 128
1111
padding: max_length
1212
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
27+
1328
TrainConfig:
1429
epochs: 3
1530
eval_iter: 1710

example/auto_compression/pytorch_huggingface/configs/mrpc.yaml

+15
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,21 @@ Global:
1010
max_seq_length: 128
1111
padding: max_length
1212
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
27+
1328
TrainConfig:
1429
epochs: 3
1530
eval_iter: 915

example/auto_compression/pytorch_huggingface/configs/qnli.yaml

+16-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Global:
22
input_names: ['x0', 'x1', 'x2']
3-
model_dir: ./x2paddle_qnli
3+
model_dir: xqnli
44
model_filename: model.pdmodel
55
params_filename: model.pdiparams
66
model_type: bert-base-cased
@@ -10,6 +10,21 @@ Global:
1010
max_seq_length: 128
1111
padding: max_length
1212
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
27+
1328
TrainConfig:
1429
epochs: 3
1530
eval_iter: 855

example/auto_compression/pytorch_huggingface/configs/qqp.yaml

+15
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,21 @@ Global:
1010
max_seq_length: 128
1111
padding: max_length
1212
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
27+
1328
TrainConfig:
1429
epochs: 3
1530
eval_iter: 855

example/auto_compression/pytorch_huggingface/configs/rte.yaml

+15
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,21 @@ Global:
1010
max_seq_length: 128
1111
padding: max_length
1212
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
27+
1328
TrainConfig:
1429
epochs: 3
1530
eval_iter: 1240

example/auto_compression/pytorch_huggingface/configs/sst2.yaml

+15
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,21 @@ Global:
1010
max_seq_length: 128
1111
padding: max_length
1212
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
27+
1328
TrainConfig:
1429
epochs: 3
1530
eval_iter: 3367

example/auto_compression/pytorch_huggingface/configs/stsb.yaml

+15
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,21 @@ Global:
1010
max_seq_length: 128
1111
padding: max_length
1212
return_attention_mask: True
13+
14+
Distillation:
15+
alpha: 1.0
16+
loss: soft_label
17+
18+
QuantAware:
19+
use_pact: true
20+
activation_quantize_type: 'moving_average_abs_max'
21+
weight_bits: 8
22+
activation_bits: 8
23+
quantize_op_types:
24+
- matmul
25+
- matmul_v2
26+
27+
1328
TrainConfig:
1429
epochs: 3
1530
eval_iter: 1710

0 commit comments

Comments
 (0)