bert model split into many layers after int8 quantization #4397

DamonsJ · 2025-03-24T02:50:19Z

I quantize a pytorch bert model using TensorRT-Model-Optimizer

before quantization, I export this model to tensorrt and there is only one layer

but after quantization there are many layers, why?
can this be fixed?

(only part of these layers)

lix19937 · 2025-03-24T05:52:16Z

I export this model to tensorrt and there is only one layer

What is cmd you used ?

DamonsJ · 2025-03-24T06:06:50Z

first I export torch to onnx using :

torch.onnx.export

then I export onnx to trt engine using python trt

network_from_onnx_path

engine_from_network

lix19937 · 2025-03-24T06:31:26Z

then I export onnx to trt engine using python trt

what flag config.set_flag() you set ?

DamonsJ · 2025-03-24T07:54:18Z

self.config.set_flag(trt.BuilderFlag.FP16)
self.config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)

lix19937 · 2025-03-24T11:45:14Z

Because FP16 (not int8 quant), mylin compiler the some shape-ops as one node.

DamonsJ · 2025-03-24T14:04:04Z

so why it can not compile int8 quant model as one node

lix19937 · 2025-03-25T01:13:53Z

There are 2 situations.

If you use float32 onnx, fp16 flag, or fp16+int8 flag, it is still a myelin node.
If you use moq.quantize for int8 ptq, the onnx structure has actually changed and is not merged into one layer.

DamonsJ · 2025-03-25T08:08:07Z

ok
I used moq.quantize for int8 ptq
is there any way to merged the quantized onnx into one layer?

Provide feedback