Skip to content

Commit 9fe0e73

Browse files
authored
[fix]修复自然语言处理模型自动压缩示例 (#1835)
1 parent 9dba994 commit 9fe0e73

File tree

7 files changed

+102
-72
lines changed

7 files changed

+102
-72
lines changed

example/auto_compression/nlp/README.md

+30-18
Original file line numberDiff line numberDiff line change
@@ -56,16 +56,16 @@
5656

5757
#### 3.1 准备环境
5858
- python >= 3.6
59-
- PaddlePaddle >= 2.4 (可从[Paddle官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)下载安装)
60-
- PaddleSlim >= 2.4
61-
- PaddleNLP >= 2.3
59+
- PaddlePaddle ==2.5 (可从[Paddle官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)下载安装)
60+
- PaddleSlim ==2.5
61+
- PaddleNLP ==2.6
6262

6363
安装paddlepaddle:
6464
```shell
6565
# CPU
66-
pip install paddlepaddle==2.4.1
66+
pip install paddlepaddle==2.5.0
6767
# GPU 以Ubuntu、CUDA 11.2为例
68-
python -m pip install paddlepaddle-gpu==2.4.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
68+
python -m pip install paddlepaddle-gpu==2.5.0.post116 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
6969
```
7070

7171
安装paddleslim:
@@ -95,7 +95,6 @@ pip install paddlenlp
9595
|:------:|:------:|:------:|:------:|:------:|:-----------:|:------:|:------:|
9696
| PP-MiniLM | [afqmc](https://bj.bcebos.com/v1/paddle-slim-models/act/afqmc.tar) | [tnews](https://bj.bcebos.com/v1/paddle-slim-models/act/tnews.tar) | [iflytek](https://bj.bcebos.com/v1/paddle-slim-models/act/iflytek.tar) | [cmnli](https://bj.bcebos.com/v1/paddle-slim-models/act/cmnli.tar) | [ ocnli](https://bj.bcebos.com/v1/paddle-slim-models/act/ocnli.tar) | [cluewsc2020](https://bj.bcebos.com/v1/paddle-slim-models/act/cluewsc.tar) | [csl](https://bj.bcebos.com/v1/paddle-slim-models/act/csl.tar) |
9797
| ERNIE 3.0-Medium | [afqmc](https://bj.bcebos.com/v1/paddle-slim-models/act/NLP/ernie3.0-medium/fp32_models/AFQMC.tar) | [tnews](https://bj.bcebos.com/v1/paddle-slim-models/act/NLP/ernie3.0-medium/fp32_models/TNEWS.tar) | [iflytek](https://bj.bcebos.com/v1/paddle-slim-models/act/NLP/ernie3.0-medium/fp32_models/IFLYTEK.tar) | [cmnli](https://bj.bcebos.com/v1/paddle-slim-models/act/NLP/ernie3.0-medium/fp32_models/CMNLI.tar) | [ocnli](https://bj.bcebos.com/v1/paddle-slim-models/act/NLP/ernie3.0-medium/fp32_models/OCNLI.tar) | [cluewsc2020](https://bj.bcebos.com/v1/paddle-slim-models/act/NLP/ernie3.0-medium/fp32_models/CLUEWSC2020.tar) | [csl](https://bj.bcebos.com/v1/paddle-slim-models/act/NLP/ernie3.0-medium/fp32_models/CSL.tar) |
98-
| UIE-base | [报销工单](https://bj.bcebos.com/v1/paddle-slim-models/act/uie_base.tar) |
9998

10099
从上表获得模型超链接, 并用以下命令下载推理模型文件:
101100

@@ -119,11 +118,6 @@ export CUDA_VISIBLE_DEVICES=0
119118
python run.py --config_path='./configs/pp-minilm/auto/afqmc.yaml' --save_dir='./save_afqmc_pruned/'
120119
```
121120

122-
自动压缩UIE系列模型需要使用 run_uie.py 脚本启动,会使用接口```paddleslim.auto_compression.AutoCompression```对模型进行自动压缩。配置config文件中训练部分的参数,将任务名称、模型类型、数据集名称、压缩参数传入,配置完成后便可对模型进行蒸馏量化训练。
123-
```shell
124-
export CUDA_VISIBLE_DEVICES=0
125-
python run_uie.py --config_path='./configs/uie/uie_base.yaml' --save_dir='./save_uie_qat/'
126-
```
127121

128122
如仅需验证模型精度,或验证压缩之后模型精度,在启动```run.py```脚本时,将配置文件中模型文件夹 ```model_dir``` 改为压缩之后保存的文件夹路径 ```./save_afqmc_pruned``` ,命令加上```--eval True```即可:
129123
```shell
@@ -217,8 +211,6 @@ QuantPost:
217211
218212
- TensorRT预测:
219213
220-
环境配置:如果使用 TesorRT 预测引擎,需安装 ```WITH_TRT=ON``` 的Paddle,下载地址:[Python预测库](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python)
221-
222214
首先下载量化好的模型:
223215
```shell
224216
wget https://bj.bcebos.com/v1/paddle-slim-models/act/save_ppminilm_afqmc_new_calib.tar
@@ -227,10 +219,30 @@ tar -xf save_ppminilm_afqmc_new_calib.tar
227219

228220
```shell
229221
python paddle_inference_eval.py \
230-
--model_path=save_ernie3_afqmc_new_cablib \
222+
--model_path=save_ppminilm_afqmc_new_calib \
223+
--model_filename=inference.pdmodel \
224+
--params_filename=inference.pdiparams \
225+
--task_name='afqmc' \
226+
--use_trt \
227+
--precision=int8
228+
```
229+
230+
- ERNIE 3.0-Medium:
231+
```shell
232+
python paddle_inference_eval.py \
233+
--model_path=TNEWS \
231234
--model_filename=infer.pdmodel \
232235
--params_filename=infer.pdiparams \
233-
--task_name='afqmc' \
236+
--task_name='tnews' \
237+
--use_trt \
238+
--precision=fp32
239+
```
240+
```shell
241+
python paddle_inference_eval.py \
242+
--model_path=save_tnews_pruned \
243+
--model_filename=infer.pdmodel \
244+
--params_filename=infer.pdiparams \
245+
--task_name='tnews' \
234246
--use_trt \
235247
--precision=int8
236248
```
@@ -239,9 +251,9 @@ python paddle_inference_eval.py \
239251

240252
```shell
241253
python paddle_inference_eval.py \
242-
--model_path=save_ernie3_afqmc_new_cablib \
243-
--model_filename=infer.pdmodel \
244-
--params_filename=infer.pdiparams \
254+
--model_path=save_ppminilm_afqmc_new_calib \
255+
--model_filename=inference.pdmodel \
256+
--params_filename=inference.pdiparams \
245257
--task_name='afqmc' \
246258
--device=cpu \
247259
--use_mkldnn=True \

example/auto_compression/nlp/configs/ernie3.0/tnews.yaml

+14-9
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,17 @@ Global:
66
dataset: clue
77
batch_size: 16
88
max_seq_length: 128
9-
TrainConfig:
10-
epochs: 6
11-
eval_iter: 1110
12-
learning_rate: 2.0e-5
13-
optimizer_builder:
14-
optimizer:
15-
type: AdamW
16-
weight_decay: 0.01
17-
origin_metric: 0.5700
9+
10+
# 剪枝
11+
Prune:
12+
prune_algo: transformer_pruner
13+
pruned_ratio: 0.25
14+
15+
# 离线量化
16+
QuantPost:
17+
activation_bits: 8
18+
quantize_op_types:
19+
- depthwise_conv2d
20+
- conv2d
21+
weight_bits: 8
22+

example/auto_compression/nlp/configs/pp-minilm/auto/afqmc.yaml

+7-13
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,11 @@ Global:
66
dataset: clue
77
batch_size: 16
88
max_seq_length: 128
9-
TransformerPrune:
10-
pruned_ratio: 0.25
11-
HyperParameterOptimization:
12-
Distillation:
9+
10+
#离线量化
1311
QuantPost:
14-
TrainConfig:
15-
epochs: 6
16-
eval_iter: 1070
17-
learning_rate: 2.0e-5
18-
optimizer_builder:
19-
optimizer:
20-
type: AdamW
21-
weight_decay: 0.01
22-
origin_metric: 0.7403
12+
activation_bits: 8
13+
quantize_op_types:
14+
- conv2d
15+
- depthwise_conv2d
16+
weight_bits: 8

example/auto_compression/nlp/paddle_inference_eval.py

+14-7
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,8 @@ def parse_args():
9191
"--max_seq_length",
9292
default=128,
9393
type=int,
94-
help="The maximum total input sequence length after tokenization. Sequences longer "
94+
help=
95+
"The maximum total input sequence length after tokenization. Sequences longer "
9596
"than this will be truncated, sequences shorter will be padded.", )
9697
parser.add_argument(
9798
"--perf_warmup_steps",
@@ -107,7 +108,8 @@ def parse_args():
107108
type=str,
108109
default="fp32",
109110
choices=["fp32", "fp16", "int8"],
110-
help="The precision of inference. It can be 'fp32', 'fp16' or 'int8'. Default is 'fp16'.",
111+
help=
112+
"The precision of inference. It can be 'fp32', 'fp16' or 'int8'. Default is 'fp16'.",
111113
)
112114
parser.add_argument(
113115
"--use_mkldnn",
@@ -156,8 +158,7 @@ def _convert_example(example,
156158
}
157159
elif "target" in example: # wsc
158160
text, query, pronoun, query_idx, pronoun_idx = (
159-
example["text"],
160-
example["target"]["span1_text"],
161+
example["text"], example["target"]["span1_text"],
161162
example["target"]["span2_text"],
162163
example["target"]["span1_index"],
163164
example["target"]["span2_index"], )
@@ -209,6 +210,12 @@ def create_predictor(cls, args):
209210
config = paddle.inference.Config(
210211
os.path.join(args.model_path, args.model_filename),
211212
os.path.join(args.model_path, args.params_filename))
213+
config.switch_ir_debug(True)
214+
# 适用于ERNIE 3.0-Medium模型
215+
# config.exp_disable_tensorrt_ops(["elementwise_add"])
216+
# config.exp_disable_tensorrt_ops(["fused_embedding_eltwise_layernorm"])
217+
# config.exp_disable_tensorrt_ops(["tmp_3"])
218+
212219
if args.device == "gpu":
213220
# set GPU configs accordingly
214221
config.enable_use_gpu(100, 0)
@@ -239,8 +246,8 @@ def create_predictor(cls, args):
239246
dynamic_shape_file = os.path.join(args.model_path,
240247
"dynamic_shape.txt")
241248
if os.path.exists(dynamic_shape_file):
242-
config.enable_tuned_tensorrt_dynamic_shape(dynamic_shape_file,
243-
True)
249+
config.enable_tuned_tensorrt_dynamic_shape(
250+
dynamic_shape_file, True)
244251
print("trt set dynamic shape done!")
245252
else:
246253
config.collect_shape_range_info(dynamic_shape_file)
@@ -365,4 +372,4 @@ def main():
365372

366373
if __name__ == "__main__":
367374
paddle.set_device("cpu")
368-
main()
375+
main()

paddleslim/quant/advanced/auto_clip.py

+36-22
Original file line numberDiff line numberDiff line change
@@ -21,14 +21,15 @@
2121
from .metrics import mse_loss
2222
from paddle.distributed.fleet.meta_parallel import (
2323
ColumnParallelLinear,
24-
RowParallelLinear,
25-
)
24+
RowParallelLinear, )
2625
__all__ = ['AutoClip']
2726

27+
2828
class AutoClip(nn.Layer):
2929
"""
3030
AutoClip from AWQ[https://arxiv.org/abs/2306.00978]
3131
"""
32+
3233
def __init__(
3334
self,
3435
model,
@@ -39,8 +40,7 @@ def __init__(
3940
n_grid=20,
4041
max_shrink=0.5,
4142
n_sample_token=512,
42-
group_size=128,
43-
):
43+
group_size=128, ):
4444
super(AutoClip, self).__init__()
4545
self.model = model
4646
self.weight_bits = weight_bits
@@ -59,15 +59,17 @@ def __init__(
5959
def _apply_hook(self):
6060
self._forward_hook_list = []
6161
for _, sub_layer in self.model.named_sublayers():
62-
if type(sub_layer) in [ColumnParallelLinear, RowParallelLinear, paddle.nn.Linear]:
62+
if type(sub_layer) in [
63+
ColumnParallelLinear, RowParallelLinear, paddle.nn.Linear
64+
]:
6365
forward_pre_hook_handle = sub_layer.register_forward_pre_hook(
6466
self._forward_pre_hook)
6567
self._forward_hook_list.append(forward_pre_hook_handle)
6668

6769
def _forward_pre_hook(self, layer, input):
6870
self._sample_scale(input, layer.full_name())
6971
return input
70-
72+
7173
def _sample_scale(self, input, name):
7274
input = input[0] if type(input) == tuple else input
7375
input.stop_gradient = True
@@ -80,7 +82,6 @@ def _sample_scale(self, input, name):
8082
else:
8183
self.sampled_inputs[name] = input
8284

83-
8485
def auto_clip(self, group_size=128, oc_batch_size=256):
8586
"""
8687
search clip scale for each layer and update the layer's weight
@@ -89,7 +90,7 @@ def auto_clip(self, group_size=128, oc_batch_size=256):
8990
name = sub_layer.full_name()
9091
if name not in self.sampled_inputs or 'out_linear' in sub_name:
9192
continue
92-
93+
9394
weight = sub_layer.weight.cast('float16')
9495
weight_t = paddle.transpose(weight, perm=[1, 0])
9596
x = self.sampled_inputs[name].cast('float16')
@@ -98,33 +99,41 @@ def auto_clip(self, group_size=128, oc_batch_size=256):
9899
x = x.reshape([1, x.shape[0], -1, group_size])
99100
x = x[:, 0::x.shape[1] // self.n_sample_token]
100101
weight_t = weight_t.reshape([weight_t.shape[0], 1, -1, group_size])
101-
oc_batch_size = oc_batch_size if weight_t.shape[0] % oc_batch_size == 0 else 128 # prevent OOM
102+
oc_batch_size = oc_batch_size if weight_t.shape[
103+
0] % oc_batch_size == 0 else 128 # prevent OOM
102104
assert weight_t.shape[0] % oc_batch_size == 0
103105

104106
w_all = weight_t
105107
best_max_val_all = []
106108

107109
for i_b in range(weight_t.shape[0] // oc_batch_size):
108-
w = w_all[i_b * oc_batch_size: (i_b + 1) * oc_batch_size]
110+
w = w_all[i_b * oc_batch_size:(i_b + 1) * oc_batch_size]
109111

110-
org_max_val = w.abs().max(axis=-1, keepdim=True) # co, 1, n_group, 1
112+
org_max_val = w.abs().max(
113+
axis=-1, keepdim=True) # co, 1, n_group, 1
111114
best_max_val = org_max_val.clone()
112115
min_errs = paddle.ones_like(org_max_val, dtype='float16') * 1e9
113116
org_out = (x * w).sum(axis=-1) # co, n_token, n_group
114117
for i_s in range(int(self.max_shrink * self.n_grid)):
115118
max_val = org_max_val * (1 - i_s / self.n_grid)
116119
max_val_tmp = max_val
117120
cur_w = paddle.where(w > max_val_tmp, max_val_tmp, w)
118-
cur_w = paddle.where(cur_w < - max_val_tmp, - max_val_tmp, cur_w)
121+
cur_w = paddle.where(cur_w < -max_val_tmp, -max_val_tmp,
122+
cur_w)
119123
org_w_shape = cur_w.shape
120-
cur_w_r = cur_w.reshape([-1, self.group_size]).transpose([1, 0])
121-
quant_dequant_weight = fake_quant(cur_w_r, method='abs_max_channel_wise', weight_bits=4)
122-
quant_dequant_weight = quant_dequant_weight.transpose([1, 0]).reshape(org_w_shape)
124+
cur_w_r = cur_w.reshape([-1,
125+
self.group_size]).transpose([1, 0])
126+
quant_dequant_weight = fake_quant(
127+
cur_w_r, method='abs_max_channel_wise', weight_bits=4)
128+
quant_dequant_weight = quant_dequant_weight.transpose(
129+
[1, 0]).reshape(org_w_shape)
123130
cur_out = (x * quant_dequant_weight).sum(axis=-1)
124131
# co, 1, n_group, 1
125132
tmp = (cur_out - org_out).detach().clone()
126-
err = paddle.pow(tmp, 2).mean(axis=1).reshape(min_errs.shape)
127-
print('block {} search s {} err {}'.format(i_b, i_s, err.mean().item()))
133+
err = paddle.pow(tmp,
134+
2).mean(axis=1).reshape(min_errs.shape)
135+
print('block {} search s {} err {}'.format(
136+
i_b, i_s, err.mean().item()))
128137
del cur_w, cur_out, quant_dequant_weight, tmp, cur_w_r
129138
paddle.device.cuda.empty_cache()
130139

@@ -143,16 +152,21 @@ def auto_clip(self, group_size=128, oc_batch_size=256):
143152
if 'w_0' in param.name:
144153
param_tmp = param.transpose(perm=[1, 0]).cast('float16')
145154
tmp_shape = param_tmp.shape
146-
param_tmp = param_tmp.reshape([best_max_val.shape[0], best_max_val.shape[1], -1])
147-
best_max_val = paddle.tile(best_max_val, repeat_times=(1, 1, param_tmp.shape[-1]))
148-
param_tmp = paddle.where(param_tmp > best_max_val, best_max_val, param_tmp)
149-
param_tmp = paddle.where(param_tmp < - best_max_val, - best_max_val, param_tmp)
155+
param_tmp = param_tmp.reshape(
156+
[best_max_val.shape[0], best_max_val.shape[1], -1])
157+
best_max_val = paddle.tile(
158+
best_max_val, repeat_times=(1, 1, param_tmp.shape[-1]))
159+
param_tmp = paddle.where(param_tmp > best_max_val,
160+
best_max_val, param_tmp)
161+
param_tmp = paddle.where(param_tmp < -best_max_val,
162+
-best_max_val, param_tmp)
150163
param_tmp = param_tmp.reshape(tmp_shape).cast(param.dtype)
151164
param_tmp = param_tmp.transpose(perm=[1, 0])
152165
paddle.assign(param_tmp, output=param)
153166
del param_tmp
154167
paddle.device.cuda.empty_cache()
155168
break
156169

157-
del best_max_val, weight_t, x, weight, self.sampled_inputs[name], w_all, best_max_val_all
170+
del best_max_val, weight_t, x, weight, self.sampled_inputs[
171+
name], w_all, best_max_val_all
158172
paddle.device.cuda.empty_cache()

paddleslim/quant/advanced/gptq.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -189,4 +189,4 @@ def fasterquant(self,
189189

190190
self.quantized = True
191191
del H, Q, Hinv, W, Losses
192-
paddle.device.cuda.empty_cache()
192+
paddle.device.cuda.empty_cache()

paddleslim/quant/advanced/piecewise_search.py

-2
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,6 @@ def search(self, layer_name, sampled_input, act_abs_max, weight):
7171
origin_out = paddle.matmul(act, weight)
7272
w_abs_max = weight.abs().max(axis=-1, keepdim=True)
7373
rw_abs_max = w_abs_max.reshape(act_abs_max.shape)
74-
7574
smooth_scale_out = None
7675
global_loss = float('inf')
7776
best_scale = None
@@ -184,5 +183,4 @@ def search(self, layer_name, sampled_input, act_abs_max, weight):
184183
print('Find Better K-Piece {}'.format(k_piece))
185184
if not self.search_piece:
186185
break
187-
188186
return best_scale

0 commit comments

Comments
 (0)