Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bert model split into many layers after int8 quantization #4397

Open
DamonsJ opened this issue Mar 24, 2025 · 8 comments
Open

bert model split into many layers after int8 quantization #4397

DamonsJ opened this issue Mar 24, 2025 · 8 comments

Comments

@DamonsJ
Copy link

DamonsJ commented Mar 24, 2025

I first post the issue in
NVIDIA/TensorRT-Model-Optimizer#159

I quantize a pytorch bert model using TensorRT-Model-Optimizer

before quantization, I export this model to tensorrt and there is only one layer

Image

but after quantization there are many layers, why?
can this be fixed?

Image

(only part of these layers)

@lix19937
Copy link

I export this model to tensorrt and there is only one layer

What is cmd you used ?

@DamonsJ
Copy link
Author

DamonsJ commented Mar 24, 2025

first I export torch to onnx using :

torch.onnx.export

then I export onnx to trt engine using python trt

network_from_onnx_path

engine_from_network

@lix19937
Copy link

then I export onnx to trt engine using python trt

what flag config.set_flag() you set ?

@DamonsJ
Copy link
Author

DamonsJ commented Mar 24, 2025

@lix19937

self.config.set_flag(trt.BuilderFlag.FP16)
self.config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)

@lix19937
Copy link

lix19937 commented Mar 24, 2025

Because FP16 (not int8 quant), mylin compiler the some shape-ops as one node.

@DamonsJ
Copy link
Author

DamonsJ commented Mar 24, 2025

so why it can not compile int8 quant model as one node

@lix19937
Copy link

There are 2 situations.

  • If you use float32 onnx, fp16 flag, or fp16+int8 flag, it is still a myelin node.
  • If you use moq.quantize for int8 ptq, the onnx structure has actually changed and is not merged into one layer.

@DamonsJ
Copy link
Author

DamonsJ commented Mar 25, 2025

ok
I used moq.quantize for int8 ptq
is there any way to merged the quantized onnx into one layer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants