Skip to content

Commit 11ed993

Browse files
committed
update QUANTIZED_OP.md
1 parent f0b9446 commit 11ed993

File tree

1 file changed

+28
-15
lines changed

1 file changed

+28
-15
lines changed

QUANTIZED_OP.md

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,26 +15,39 @@ fp generally denotes the data_type of
1515

1616
## Quick Look-Up for Implementations in SNPS Caffe
1717
We support the implementations from different frameworks, which leverages the layer parameter `quantize_method` when their results fail bit-exactness. You can also refer to [FEATURES.md](https://github.com/foss-for-synopsys-dwc-arc-processors/synopsys-caffe/blob/development/FEATURES.md#custom-quantization-related) for other quantization-related parameters.
18-
We denote TFLite/ONNXruntime/Caffe2 implementations by **t**/**o**/**c**. Since some quantized operators may have bit-exactness results between the frameworks, we don't elaborate the specific implementation.
1918

20-
| `operator` \ `quantize_method` | TFLite | ONNX | Caffe2|
19+
| `operator` \ `quantize_method` | TFLite | ONNX | Caffe2 |
2120
| ----------- | ------ | ----- | ----- |
22-
| AveragePooling | **t** | **o** | **c** |
23-
| Bias | | | **c** |
24-
| Convolution | **t** | **o** | **c** |
25-
| EltwiseSum | **t** | **c** | **c** |
26-
| InnerProduct| **t** | **o** | |
27-
| Power* | **t** | **o** | **c** |
28-
| Concat* | | | |
29-
| ResizeBilinear*| | | |
21+
| AveragePool | **t** | **o** | **c** |
22+
| BiasAdd | | **o** | |
23+
| Concat | **~** | | |
24+
| Convolution | **t** | **o** | **c** |
25+
| Deconvolution | | | **c** |
26+
| EltwiseSum | **t** | **c** | **c** |
27+
| InnerProduct | **t** | **t** | |
28+
| LeakyReLU | **t** | | |
29+
| Power* | **t** | **o** | **c** |
30+
| ReLU | **~** | **~** | **~** |
31+
| ResizeBilinear | **~** | | |
32+
| Sigmoid | **~** | | |
33+
| Softmax | **~** | | |
34+
35+
We denote TFLite/ONNXruntime/Caffe2 implementations by `t/o/c`; and the **`~`** entries indicate that the Caffe implementation computes in floating representation such as
3036

37+
```cpp=
38+
// A Dequantize-Op-Quantize procedure, taking ReLU as example.
39+
float_in = Dequantize(int_in, input_scale, input_zero_point);
40+
float_out = ReLU(float_in);
41+
int_out = Quantize(float_out, output_scale, output_zero_point);
42+
```
3143

3244
#### Notes
3345
1. Our model zoo doesn't cover all quantized operators over the frameworks. The entry is left empty if the `(framework,operator)` combination is not seen yet.
34-
* Quantized bias_layer only occurs in ONNX (does not support FC+Bias fusion yet).
35-
2. Only Quantize and Dequantize operators are mapped to Power_layer.
36-
3. For ResizeBilinear/Concat layers, we use Dequantize+Quantize to implment the affine transformation.
37-
46+
* Quantized `bias_layer` only occurs in ONNX (does not support `FC+Bias` fusion yet).
47+
2. Only `Quantize` and `Dequantize` operators are mapped to `Power_layer`.
48+
3. Since some quantized operators may have bit-exactness results between the frameworks, for such entries we adapt the implementation from other framework.
49+
4. `MaxPool`, `ArgMax` are seen, but they do nothing different for quantized/floating numbers.
50+
5. `Convolution` concludes a number of variations, please see the folloing section.
3851

3952

4053
## Quantized Convolutions
@@ -86,7 +99,7 @@ out_acc = (scaled_acc + (1<<(31+shift-1)) >> (31+shift-1)
8699
```
87100

88101
#### **Pointwise Convolution***
89-
When I try to match bit-exactness result, the combination of `PerTensor-A1` and `PerChannel-B2` is found by brute-force.
102+
When I try to match bit-exactness result, the combination of `PerTensor-F2` and `PerChannel-D1` is found by brute-force.
90103

91104
### ONNX runtime
92105
It casts `<int>acc` to `<float>`, multiply by `<float>output_multiplier`, then requantize the result.

0 commit comments

Comments
 (0)