Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce QuantizedType #1352

Merged
merged 16 commits into from
Apr 14, 2023
Merged

Introduce QuantizedType #1352

merged 16 commits into from
Apr 14, 2023

Conversation

sdasgup3
Copy link
Member

@sdasgup3 sdasgup3 commented Mar 24, 2023

StableHLO dialect currently supports quantization via:

  1. Supporting quant.uniform element types.
  2. Having dedicated ops like uniform_quantize / uniform_dequantize.
  3. Allowing regular ops like add / convolution to take quantized tensors.

This support was inherited from MHLO when StableHLO was bootstrapped, and
MHLO support was motivated by mobile use cases and inherited from TFLite.

As pointed out in #1149, StableHLO specification doesn't support quantization
at the moment, and this is an important gap that we would like to fix before
StableHLO v1.0 (see #588).

To continue the discussion started in #1149 and to make progress towards v1.0,
this pull request:
A) Adds QuantizedType to the StableHLO specification, modelled after
TFLite quantization spec.
B) To start a conversation about the applications of QuantizedType and the
semantics of quantized ops, proposes semantics for quantized add.

TFLite quantization spec doesn't cover everything. It specs constraints on
types (which we captured accordingly in this pull request), but it doesn't go
into describing semantics of quantized ops.

As a result, the proposed semantics for quantized add is intentionally naive,
as compared with the much more involved implementations in the TensorFlow
repository, e.g.:

In this pull request, we are looking for feedback from the community to
determine the right level of detail for speccing quantized types and quantized
ops. Please let us know!

@sdasgup3 sdasgup3 added the Spec label Mar 24, 2023
@burmako burmako self-requested a review March 24, 2023 02:34
@burmako burmako self-assigned this Mar 24, 2023
docs/spec.md Outdated
@@ -507,23 +560,35 @@ Performs element-wise addition of two tensors `lhs` and `rhs` and produces a
* For integers: integer addition.
* For floats: `addition` from IEEE-754.
* For complex numbers: complex addition.
* For quantized types:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Relating to the discussion in PR#1149, defining the operation in terms of a floating-point scale parameter can be problematic for integer only processing. An approach that could be consistent with PR#1149 is to define the addition with the restriction that the lhs, rhs and result scales are equal. This would remove need for scaling in the add. It would lead to separate scaling operations (before or after) to change the scale if this is not the case. These scales could be defined using an integer scaling parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current version of the PR, we started with the TFLite quantization spec and were looking to get feedback from users on what they would be looking for. Thank you for providing yours! We've also pinged a few other groups to share their thoughts, so how about we keep this thread open for now and then decide?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some implementations with efficient different-scale quantized add as well. Imposing same-scale requirement and adding extra rescale ops would actually mean slower code in those implementations. Also, such rescale op either requires a larger immediate type or risk losing some LSBs for the operand with smaller scale (i.e. rounding happens twice).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable if the semantics of the computation here are compatible with TFLite, since that is widely used across Mobile hardware. The existing floating-point type allows for the compatibility. Breaking out the rescaling limits optimizations.

Also, some of our implementations indeed use int based computation for scale as an implementation detail to avoid Float computations. But I don't think it needs to show up in the spec.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. A different-scale quantized add implementation could still be used by combining the rescale operations with the add. The rescale would need a larger type to avoid loss of precision but quantized add implementations will generally need a larger precision in the calculation for the same reason. With reference to TFLite, one issue is that since a TFLite flatbuffer contains floating-point scale values it is hard to read such a flatbuffer on integer only devices. While definition is at a higher floating-point level this does allow for more optimizations but also there will be less consistency of result.

A way forward could be the additional quantized type suggested by @sngyhan and @burmako in the thread below. This would keep the floating-point-scale quantized type allowing wider optimizations but also provide a mapping to pure integer behavior. This could ensure at least a base behavior is defined consistently in a pure integer way if using this quantization type. As thread below there can be a pass to convert to this fully integer type. Implementations that benefit from consistency can use this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! It looks like there's broad consensus among reviewers that exploring an integer-based QuantizedTensorType makes sense, so we opened #1404 and will follow up on it shortly in a separate PR.

docs/spec.md Outdated
* `storage_min = 0`, `storage_max = 2^d - 1`.
* (C5) For all `i`, `type(scales[i]) = expressed_type`.
* (C6) For all `i`, `scales[i] > 0`.
* (C7) For all `i`, `type(zero_points[i]) = i64`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. i64 type seems a large type for the zero point. Is there a constraint separately limiting the zero-point range, for example to the quantized data type, to restrict the range of (value - zero_point) operations. Additionally, I think zero points are most used for small data types – such 8-bit and below. Can zero point usage be restricted to 8-bit and lower?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dominicsymes for your valuable feedback.

Since zero_point can get out of the storage value range, the proposal is using a wider type i64 than the storage type, with max limit as 32.

Additionally, I think zero points are most used for small data types – such 8-bit and below.

That is pretty useful information and can help in restricting the zero_point type. I will explore it further and welcome feedback from other reviewers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that adding smaller level types like 8-bit can be useful for optimized kernels. Maybe the constraint can be modified to type(zero_points[i]) = i64 or type(zero_points[i]) = storage_type?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdasgup3 and @sngyhan for your reply.

I think the first point is resolved by the comment from @loganchien who points out that zero-point range must be restricted to storage range so the value 0 can be represented. From this I understand that zero-point will be limited to storage range?

On the second point, I’d like to expand a bit more the problems of having zero points on larger data types (greater than 8 bit). For example, an si16 data type with an si16 zero point can give data range of -65535 to +65535. Multiplying two such values will overflow a 32-bit result type. This then has an implementation cost. Since zero_point seems most effective on small data types which have limited range and precision, I’d suggest not to allow on larger data types unless there is a clear use case. I think also in TFLite the zero-point value for integer 16-bit quantized type is set to zero in usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the inputs!

In sync with @loganchien's  comment that zero_point lie within the storage_min and storage_max, I proposed the following edit in the speficiation:

(C7) For all i, storage_min <= zero_points[i] <= storage_max.
(C8) For all i, type(zero_points[i]) = storage_type.

cc @sngyhan

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't agree that we should treat StableHLO as a compiler IR and optimize for this use-case. StableHLO is a model representation schema that is to be shared by many consumers and producers, most of them not compilers (e.g. NN authoring frameworks, visualization and analysis tools, on-device interpreters). Thus, we should optimize StableHLO for the semantics of the models, and strive to make it impossible to have inconsistent or invalid data in the schema.

In TFLite quantization schema, we have the following zero-point cases:

  1. Unsigned 8-bit quantization, both weights and activations: zero point is in [0, 255] range.
  2. Signed 8-bit quantization for activations: zero point is in [-128, 127] range.
  3. Signed 8-bit quantization for weights: zero point is 0.
  4. Signed 16-bit quantization for activations: zero point is 0.
  5. Signed 32-bit quantization for biases: zero point is 0.
  6. Signed 64-bit quantization for biases: zero point is 0.

Thus, the maximum range for zero point is [-128, 255] and it could fit into a 16-bit signed integer. I'd be ok with extending it to 32-bit signed integer, but no further (64-bit signed integer has no use and makes no sense). Alternatively, we can constrain zero point based on the storage type, similarly to how @sdasgup3 suggested:

  1. For unsigned 8-bit quantized tensors, zero point is unsigned 8-bit.
  2. For signed 8-bit quantized tensors, zero point is signed 8-bit (must be 0 for Convolution/FullyConnected weights, but it is up to the operators to validate it).
  3. For signed 16-/32-/40-/48-/64-bit quantized tensors, zero point is 0 (implicit, not stored in the schema).

Copy link

@loganchien loganchien Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, an authoring tool/framework doesn't have to use a feature just because the spec can support/express such feature. The authoring tool (producer) can still limit themselves to use zp in [-128, 127] for i8/i16 if the narrower zero point range can accommodate their need. Interpreters/validators may have a valid case because they are on the implementor side (consumer). But, like the other parts of the spec, I don't think an interpreter can efficiently cover all features. And, we will still have per-op constraints which can be used to narrow down the scope as needed (all multiplicative ops may need a narrower range, which I agree). For visualization, I don't think it will add constraints to the value range.

My point is not to enact a constraint too early. Maybe someday we will evolve to a stage with real need in real model.

My proposed wording is to limit zero point to storage type, i.e. i8 gets [-128, 127], u8 gets [0, 255], i16 gets [-32768, 32767], etc. Thus, only i64 storage type can get i64 zero point.

But I am fine with adding constraints (albeit not preferring). Actually, the implementation I am working on probably can't support all zero points in all operations either (in case the product overflows as pointed out by @dominicsymes earlier). My comments are more on the design symmetry or my personal view on aesthetic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everybody for the comments!
@sngyhan Can you please add your opinion as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not preferring specific constraints based on we've observed from TFLite product. IMO, it potentially might limit the adoptions of new quantization schemes/kernels in the future, e.g., experimental quantization schemes.

Numerically, to cover all the possible cases, "original value range ⊂ zp range" might be the right constraint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very lively discussion! Given the variety of proposals, I think that this conversation would benefit from a video meeting. I'll work with @theadactyl to set up a community meeting about quantization, and this will be one of the topics. We've opened #1405 to keep track.

@sdasgup3 sdasgup3 force-pushed the spec-quant-types branch 2 times, most recently from 2bbd784 to 1b6f292 Compare March 27, 2023 23:12
docs/spec.md Outdated
QuantizationScalesAndZeroPoints ::= (QuantizationScaleAndZero
| '{' QuantizationScaleAndZero {',' QuantizationScaleAndZero} '}')
QuantizationScaleAndZero ::= QuantizationScale ':' QuantizationZeroPoint
QuantizationScale ::= FloatConstant
Copy link
Contributor

@burmako burmako Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dominicsymes Following up on integer only processing, what if we had two different versions of QuantizedType - one as proposed in the PR, and another one where a floating-point scale is replaced with integer multiplier and shift. Would that work for you with respect to #1149?

This idea was proposed by @sngyhan in an internal discussion, and I think that it's pretty neat. I think that it allows implementations to more easily detect different styles of quantization, which is nice. Also, through FP8, we already have a precedent for providing multiple ways of doing similar things through similar types, and this proposal goes along the same lines. What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dominicsymes is out of the office at the moment. I won't speak for him, but I'll add a bit here, and he can add more when he's back.
In my view, it's not a completely bad option. If we implemented this, there should be a pass that was able to convert between the two formats with a fully defined result. If there is agreement on that pass, then all integer based implementations would agree, and an integer based quantized tensor could be converted to a floating-point version.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Maybe in future we can extend QuantizationScale to accept a type for multiplier/shift to support the case with the defined conversion format between the two.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @burmako. @sngyhan

Thanks for your comments. Having a quantization type with integer multiplier and shift does get part of the way there provided (as @eric-k256 mentions) that there is a defined conversion between the two formats.
For implementations to agree it also needs to be defined how to convert these into actual scales to be applied.
In an element-wise multiply operation the scale applied after the multiply is the ratio of the product of the input scales to the output scale. Similarly, in the element-wise add in this PR it needs be defined how the integer scales apply as input and output scales.
An advantage of separating out these scale changes is that they can then be handled in a standard way rather than being defined as part of each operation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the spec should facilitate both implementations using floating-point and using integer multiplier + shift representations of the scales. IMO, it is easier to achieve this using floating-point scales, because integer multiplier + shift may represent values outside of the single-precision floating-point range (e.g. integer multiplier has 31/32 bits of precision vs 24 in single-precision floating-point), and can we non-unique while floating-point values are normalized. That said, we need to expose additional constrains on the floating-point scales:

  1. They must be strictly positive (>0.0)
  2. Scales can't take infinity or NaN values.
  3. Scales can't be subnormal floating-point numbers. There are two motivations for this constraint: it simplifies conversion between floating-point and multiplier + shift representation and it assures compatibility with systems which don't support subnormal numbers where they will treated as 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Maratyszcza
Thanks for the comments!

We already have (1). I have added (2)

* (C6) For all `i`, `scales[i] > 0`.
* (C7) For all `i`, `is_finite(scales[i]) = true`.

About (3): .... assures compatibility with systems which don't support subnormal numbers where they will treated as 0.

Maybe I did not get this.
Isn't the constraint will be restrictive to the systems supporting sub-normals?
If this check is relevant when converting a double multiplier (e.g. S1*S2/S3 in eq (5) in https://arxiv.org/pdf/1712.05877.pdf to a int multiplier/shift), then does it matter if we add constraints to the individual scales? That is, isn't the double multiplier can still be sub-normal is the individual scales are not.

On a side note: For (3), Can we explore this constraint when we deal with the "integer based quantized type" in a separate PR (as suggested in #1352 (review)) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for opening the possibility of both scaling mechanisms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above, let's move the work on integer-based QuantizedTensorType to a separate PR, given that there's broad interest in exploring this. We opened #1404 to track this.

@sdasgup3 sdasgup3 requested a review from burmako March 30, 2023 01:26
Copy link

@sngyhan sngyhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in overall.

docs/spec.md Outdated
* `storage_min = 0`, `storage_max = 2^d - 1`.
* (C5) For all `i`, `type(scales[i]) = expressed_type`.
* (C6) For all `i`, `scales[i] > 0`.
* (C7) For all `i`, `type(zero_points[i]) = i64`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that adding smaller level types like 8-bit can be useful for optimized kernels. Maybe the constraint can be modified to type(zero_points[i]) = i64 or type(zero_points[i]) = storage_type?

docs/spec.md Outdated
QuantizationScalesAndZeroPoints ::= (QuantizationScaleAndZero
| '{' QuantizationScaleAndZero {',' QuantizationScaleAndZero} '}')
QuantizationScaleAndZero ::= QuantizationScale ':' QuantizationZeroPoint
QuantizationScale ::= FloatConstant
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Maybe in future we can extend QuantizationScale to accept a type for multiplier/shift to support the case with the defined conversion format between the two.

@burmako burmako assigned sdasgup3 and unassigned burmako Apr 6, 2023
Copy link
Contributor

@nutsiepully nutsiepully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@sdasgup3 sdasgup3 requested a review from burmako April 6, 2023 02:38
@sdasgup3 sdasgup3 force-pushed the spec-quant-types branch 2 times, most recently from 1ab4a44 to a8f3ef1 Compare April 8, 2023 02:07
Copy link
Contributor

@burmako burmako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, everyone, for the discussion! Looks like we have alignment on quite a few topics already, so I think it would be beneficial to merge this pull request and continue working out the rest in follow up PRs.

More specifically, let's keep the definition of QuantizedTensorType, remove the definition of quantized AddOp for now and open tickets to follow up on the topics where we haven't reached consensus yet - #1404, #1405, #1406 and #1407.

From here, we'll be able to explore multiple directions simultaneously, including speccing integer-based QuantizedType, as well as discussing the semantics of quantized elementwise ops and convolutions.

@burmako burmako merged commit e83c5e0 into openxla:main Apr 14, 2023
burmako pushed a commit that referenced this pull request Apr 18, 2023
Following up on #1352, this pull request documents the unresolved
conversations from that PR review.
sdasgup3 added a commit that referenced this pull request May 10, 2023
## Summary 

The PR proposes the specification for quantized add op.

## A few details

At some point we
[decided](#1352 (comment))
to drop the introduction of the specification of this op mainly because
we were unsure about the fate of
#1406.
 
Please have a look at my revised proposal on
#1406 and let me know if I am
missing something. Otherwise, let us review this op and let me know your
feedback.

Side note: For those who are already aware of the context of prior
introduction of this op, please note that the current proposal is almost
same as before except that it does not have any additional constraint
imposed by the op's semantics on `storage_min` or `storage_max`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants