-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce QuantizedType #1352
Introduce QuantizedType #1352
Conversation
d58cfce
to
e073f2c
Compare
docs/spec.md
Outdated
@@ -507,23 +560,35 @@ Performs element-wise addition of two tensors `lhs` and `rhs` and produces a | |||
* For integers: integer addition. | |||
* For floats: `addition` from IEEE-754. | |||
* For complex numbers: complex addition. | |||
* For quantized types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. Relating to the discussion in PR#1149, defining the operation in terms of a floating-point scale parameter can be problematic for integer only processing. An approach that could be consistent with PR#1149 is to define the addition with the restriction that the lhs, rhs and result scales are equal. This would remove need for scaling in the add. It would lead to separate scaling operations (before or after) to change the scale if this is not the case. These scales could be defined using an integer scaling parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current version of the PR, we started with the TFLite quantization spec and were looking to get feedback from users on what they would be looking for. Thank you for providing yours! We've also pinged a few other groups to share their thoughts, so how about we keep this thread open for now and then decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some implementations with efficient different-scale quantized add as well. Imposing same-scale requirement and adding extra rescale ops would actually mean slower code in those implementations. Also, such rescale op either requires a larger immediate type or risk losing some LSBs for the operand with smaller scale (i.e. rounding happens twice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable if the semantics of the computation here are compatible with TFLite, since that is widely used across Mobile hardware. The existing floating-point type allows for the compatibility. Breaking out the rescaling limits optimizations.
Also, some of our implementations indeed use int based computation for scale as an implementation detail to avoid Float computations. But I don't think it needs to show up in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. A different-scale quantized add implementation could still be used by combining the rescale operations with the add. The rescale would need a larger type to avoid loss of precision but quantized add implementations will generally need a larger precision in the calculation for the same reason. With reference to TFLite, one issue is that since a TFLite flatbuffer contains floating-point scale values it is hard to read such a flatbuffer on integer only devices. While definition is at a higher floating-point level this does allow for more optimizations but also there will be less consistency of result.
A way forward could be the additional quantized type suggested by @sngyhan and @burmako in the thread below. This would keep the floating-point-scale quantized type allowing wider optimizations but also provide a mapping to pure integer behavior. This could ensure at least a base behavior is defined consistently in a pure integer way if using this quantization type. As thread below there can be a pass to convert to this fully integer type. Implementations that benefit from consistency can use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! It looks like there's broad consensus among reviewers that exploring an integer-based QuantizedTensorType makes sense, so we opened #1404 and will follow up on it shortly in a separate PR.
docs/spec.md
Outdated
* `storage_min = 0`, `storage_max = 2^d - 1`. | ||
* (C5) For all `i`, `type(scales[i]) = expressed_type`. | ||
* (C6) For all `i`, `scales[i] > 0`. | ||
* (C7) For all `i`, `type(zero_points[i]) = i64`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. i64 type seems a large type for the zero point. Is there a constraint separately limiting the zero-point range, for example to the quantized data type, to restrict the range of (value - zero_point) operations. Additionally, I think zero points are most used for small data types – such 8-bit and below. Can zero point usage be restricted to 8-bit and lower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dominicsymes for your valuable feedback.
Since zero_point
can get out of the storage value range, the proposal is using a wider type i64
than the storage type, with max limit as 32.
Additionally, I think zero points are most used for small data types – such 8-bit and below.
That is pretty useful information and can help in restricting the zero_point
type. I will explore it further and welcome feedback from other reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that adding smaller level types like 8-bit can be useful for optimized kernels. Maybe the constraint can be modified to type(zero_points[i]) = i64 or type(zero_points[i]) = storage_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sdasgup3 and @sngyhan for your reply.
I think the first point is resolved by the comment from @loganchien who points out that zero-point range must be restricted to storage range so the value 0 can be represented. From this I understand that zero-point will be limited to storage range?
On the second point, I’d like to expand a bit more the problems of having zero points on larger data types (greater than 8 bit). For example, an si16 data type with an si16 zero point can give data range of -65535 to +65535. Multiplying two such values will overflow a 32-bit result type. This then has an implementation cost. Since zero_point seems most effective on small data types which have limited range and precision, I’d suggest not to allow on larger data types unless there is a clear use case. I think also in TFLite the zero-point value for integer 16-bit quantized type is set to zero in usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the inputs!
In sync with @loganchien's comment that zero_point lie within the storage_min
and storage_max
, I proposed the following edit in the speficiation:
(C7) For all i, storage_min <= zero_points[i] <= storage_max.
(C8) For all i, type(zero_points[i]) = storage_type.
cc @sngyhan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't agree that we should treat StableHLO as a compiler IR and optimize for this use-case. StableHLO is a model representation schema that is to be shared by many consumers and producers, most of them not compilers (e.g. NN authoring frameworks, visualization and analysis tools, on-device interpreters). Thus, we should optimize StableHLO for the semantics of the models, and strive to make it impossible to have inconsistent or invalid data in the schema.
In TFLite quantization schema, we have the following zero-point cases:
- Unsigned 8-bit quantization, both weights and activations: zero point is in
[0, 255]
range. - Signed 8-bit quantization for activations: zero point is in
[-128, 127]
range. - Signed 8-bit quantization for weights: zero point is 0.
- Signed 16-bit quantization for activations: zero point is 0.
- Signed 32-bit quantization for biases: zero point is 0.
- Signed 64-bit quantization for biases: zero point is 0.
Thus, the maximum range for zero point is [-128, 255]
and it could fit into a 16-bit signed integer. I'd be ok with extending it to 32-bit signed integer, but no further (64-bit signed integer has no use and makes no sense). Alternatively, we can constrain zero point based on the storage type, similarly to how @sdasgup3 suggested:
- For unsigned 8-bit quantized tensors, zero point is unsigned 8-bit.
- For signed 8-bit quantized tensors, zero point is signed 8-bit (must be 0 for Convolution/FullyConnected weights, but it is up to the operators to validate it).
- For signed 16-/32-/40-/48-/64-bit quantized tensors, zero point is 0 (implicit, not stored in the schema).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, an authoring tool/framework doesn't have to use a feature just because the spec can support/express such feature. The authoring tool (producer) can still limit themselves to use zp in [-128, 127]
for i8
/i16
if the narrower zero point range can accommodate their need. Interpreters/validators may have a valid case because they are on the implementor side (consumer). But, like the other parts of the spec, I don't think an interpreter can efficiently cover all features. And, we will still have per-op constraints which can be used to narrow down the scope as needed (all multiplicative ops may need a narrower range, which I agree). For visualization, I don't think it will add constraints to the value range.
My point is not to enact a constraint too early. Maybe someday we will evolve to a stage with real need in real model.
My proposed wording is to limit zero point to storage type, i.e. i8
gets [-128, 127]
, u8
gets [0, 255]
, i16
gets [-32768, 32767]
, etc. Thus, only i64
storage type can get i64
zero point.
But I am fine with adding constraints (albeit not preferring). Actually, the implementation I am working on probably can't support all zero points in all operations either (in case the product overflows as pointed out by @dominicsymes earlier). My comments are more on the design symmetry or my personal view on aesthetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everybody for the comments!
@sngyhan Can you please add your opinion as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not preferring specific constraints based on we've observed from TFLite product. IMO, it potentially might limit the adoptions of new quantization schemes/kernels in the future, e.g., experimental quantization schemes.
Numerically, to cover all the possible cases, "original value range ⊂ zp range" might be the right constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very lively discussion! Given the variety of proposals, I think that this conversation would benefit from a video meeting. I'll work with @theadactyl to set up a community meeting about quantization, and this will be one of the topics. We've opened #1405 to keep track.
2bbd784
to
1b6f292
Compare
docs/spec.md
Outdated
QuantizationScalesAndZeroPoints ::= (QuantizationScaleAndZero | ||
| '{' QuantizationScaleAndZero {',' QuantizationScaleAndZero} '}') | ||
QuantizationScaleAndZero ::= QuantizationScale ':' QuantizationZeroPoint | ||
QuantizationScale ::= FloatConstant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dominicsymes Following up on integer only processing, what if we had two different versions of QuantizedType - one as proposed in the PR, and another one where a floating-point scale is replaced with integer multiplier and shift. Would that work for you with respect to #1149?
This idea was proposed by @sngyhan in an internal discussion, and I think that it's pretty neat. I think that it allows implementations to more easily detect different styles of quantization, which is nice. Also, through FP8, we already have a precedent for providing multiple ways of doing similar things through similar types, and this proposal goes along the same lines. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dominicsymes is out of the office at the moment. I won't speak for him, but I'll add a bit here, and he can add more when he's back.
In my view, it's not a completely bad option. If we implemented this, there should be a pass that was able to convert between the two formats with a fully defined result. If there is agreement on that pass, then all integer based implementations would agree, and an integer based quantized tensor could be converted to a floating-point version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. Maybe in future we can extend QuantizationScale to accept a type for multiplier/shift to support the case with the defined conversion format between the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments. Having a quantization type with integer multiplier and shift does get part of the way there provided (as @eric-k256 mentions) that there is a defined conversion between the two formats.
For implementations to agree it also needs to be defined how to convert these into actual scales to be applied.
In an element-wise multiply operation the scale applied after the multiply is the ratio of the product of the input scales to the output scale. Similarly, in the element-wise add in this PR it needs be defined how the integer scales apply as input and output scales.
An advantage of separating out these scale changes is that they can then be handled in a standard way rather than being defined as part of each operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the spec should facilitate both implementations using floating-point and using integer multiplier + shift representations of the scales. IMO, it is easier to achieve this using floating-point scales, because integer multiplier + shift may represent values outside of the single-precision floating-point range (e.g. integer multiplier has 31/32 bits of precision vs 24 in single-precision floating-point), and can we non-unique while floating-point values are normalized. That said, we need to expose additional constrains on the floating-point scales:
- They must be strictly positive (
>0.0
) - Scales can't take infinity or NaN values.
- Scales can't be subnormal floating-point numbers. There are two motivations for this constraint: it simplifies conversion between floating-point and multiplier + shift representation and it assures compatibility with systems which don't support subnormal numbers where they will treated as 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Maratyszcza
Thanks for the comments!
We already have (1). I have added (2)
* (C6) For all `i`, `scales[i] > 0`.
* (C7) For all `i`, `is_finite(scales[i]) = true`.
About (3): .... assures compatibility with systems which don't support subnormal numbers where they will treated as 0.
Maybe I did not get this.
Isn't the constraint will be restrictive to the systems supporting sub-normals?
If this check is relevant when converting a double multiplier (e.g. S1*S2/S3
in eq (5) in https://arxiv.org/pdf/1712.05877.pdf to a int multiplier/shift), then does it matter if we add constraints to the individual scales? That is, isn't the double multiplier can still be sub-normal is the individual scales are not.
On a side note: For (3), Can we explore this constraint when we deal with the "integer based quantized type" in a separate PR (as suggested in #1352 (review)) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for opening the possibility of both scaling mechanisms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above, let's move the work on integer-based QuantizedTensorType to a separate PR, given that there's broad interest in exploring this. We opened #1404 to track this.
1b6f292
to
ffd53c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in overall.
docs/spec.md
Outdated
* `storage_min = 0`, `storage_max = 2^d - 1`. | ||
* (C5) For all `i`, `type(scales[i]) = expressed_type`. | ||
* (C6) For all `i`, `scales[i] > 0`. | ||
* (C7) For all `i`, `type(zero_points[i]) = i64`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that adding smaller level types like 8-bit can be useful for optimized kernels. Maybe the constraint can be modified to type(zero_points[i]) = i64 or type(zero_points[i]) = storage_type
?
docs/spec.md
Outdated
QuantizationScalesAndZeroPoints ::= (QuantizationScaleAndZero | ||
| '{' QuantizationScaleAndZero {',' QuantizationScaleAndZero} '}') | ||
QuantizationScaleAndZero ::= QuantizationScale ':' QuantizationZeroPoint | ||
QuantizationScale ::= FloatConstant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. Maybe in future we can extend QuantizationScale to accept a type for multiplier/shift to support the case with the defined conversion format between the two.
ffd53c1
to
b380612
Compare
b380612
to
40cf49d
Compare
40cf49d
to
83c006c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
e74d3cf
to
ad82073
Compare
1ab4a44
to
a8f3ef1
Compare
fef795a
to
2ca424d
Compare
…nsor of non-quantized type'
2ca424d
to
88fee82
Compare
88fee82
to
e30a7c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, everyone, for the discussion! Looks like we have alignment on quite a few topics already, so I think it would be beneficial to merge this pull request and continue working out the rest in follow up PRs.
More specifically, let's keep the definition of QuantizedTensorType, remove the definition of quantized AddOp for now and open tickets to follow up on the topics where we haven't reached consensus yet - #1404, #1405, #1406 and #1407.
From here, we'll be able to explore multiple directions simultaneously, including speccing integer-based QuantizedType, as well as discussing the semantics of quantized elementwise ops and convolutions.
Following up on #1352, this pull request documents the unresolved conversations from that PR review.
## Summary The PR proposes the specification for quantized add op. ## A few details At some point we [decided](#1352 (comment)) to drop the introduction of the specification of this op mainly because we were unsure about the fate of #1406. Please have a look at my revised proposal on #1406 and let me know if I am missing something. Otherwise, let us review this op and let me know your feedback. Side note: For those who are already aware of the context of prior introduction of this op, please note that the current proposal is almost same as before except that it does not have any additional constraint imposed by the op's semantics on `storage_min` or `storage_max`.
StableHLO dialect currently supports quantization via:
quant.uniform
element types.uniform_quantize
/uniform_dequantize
.add
/convolution
to take quantized tensors.This support was inherited from MHLO when StableHLO was bootstrapped, and
MHLO support was motivated by mobile use cases and inherited from TFLite.
As pointed out in #1149, StableHLO specification doesn't support quantization
at the moment, and this is an important gap that we would like to fix before
StableHLO v1.0 (see #588).
To continue the discussion started in #1149 and to make progress towards v1.0,
this pull request:
A) Adds QuantizedType to the StableHLO specification, modelled after
TFLite quantization spec.
B) To start a conversation about the applications of QuantizedType and the
semantics of quantized ops, proposes semantics for quantized
add
.TFLite quantization spec doesn't cover everything. It specs constraints on
types (which we captured accordingly in this pull request), but it doesn't go
into describing semantics of quantized ops.
As a result, the proposed semantics for quantized
add
is intentionally naive,as compared with the much more involved implementations in the TensorFlow
repository, e.g.:
In this pull request, we are looking for feedback from the community to
determine the right level of detail for speccing quantized types and quantized
ops. Please let us know!