You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most deep learning applications are using 32-bits of floating-point precision for inference. But low precision data types, such as fp8, are getting more focus due to significant performance boost. A key concern in adopting low precision is mitigating accuracy loss while meeting predefined requirements.
6
+
๊ฐ์(Overview)
7
+
--------------
8
8
9
-
Intelยฎ Neural Compressor aims to address the aforementioned concern by extending PyTorch with accuracy-driven automatic tuning strategies to help user quickly find out the best quantized model on Intel hardware.
- **Ease-of-use API:** Intelยฎ Neural Compressor is re-using the PyTorch ``prepare``, ``convert`` API for user usage.
17
-
18
-
- **Accuracy-driven Tuning:** Intelยฎ Neural Compressor supports accuracy-driven automatic tuning process, provides ``autotune`` API for user usage.
17
+
ํน์ง(Features)
18
+
---------------
19
19
20
-
- **Kinds of Quantization:** Intelยฎ Neural Compressor supports a variety of quantization methods, including classic INT8 quantization, weight-only quantization and the popular FP8 quantization. Neural compressor also provides the latest research in simulation work, such as MX data type emulation quantization. For more details, please refer to `Supported Matrix <https://github.com/intel/neural-compressor/blob/master/docs/source/3x/PyTorch.md#supported-matrix>`_.
**Note**: Neural Compressor provides automatic accelerator detection, including HPU, Intel GPU, CUDA, and CPU. To specify the target device, ``INC_TARGET_DEVICE`` is suggested, e.g., ``export INC_TARGET_DEVICE=cpu``.
**FP8 Quantization** is supported by Intelยฎ Gaudiยฎ2&3 AI Accelerator (HPU). To prepare the environment, please refer to `Intelยฎ Gaudiยฎ Documentation <https://docs.habana.ai/en/latest/index.html>`_.
46
+
FP8 ์์ํ(FP8 Quantization)
47
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
46
48
47
-
Run the example,
49
+
FP8 ์์ํ๋ Intelยฎ Gaudiยฎ 2 ๋ฐ 3 AI Accelerator(HPU)์์ ์ง์๋ฉ๋๋ค.
from neural_compressor.torch.quantization import load
90
94
91
-
#The model name comes from HuggingFace Model Hub.
95
+
#๋ชจ๋ธ ์ด๋ฆ์ HuggingFace Model Hub์์ ๊ฐ์ ธ์ต๋๋ค.
92
96
model_name ="TheBloke/Llama-2-7B-GPTQ"
93
97
model = load(
94
98
model_name_or_path=model_name,
@@ -98,45 +102,55 @@ Weight-only Quantization
98
102
)
99
103
..
100
104
101
-
**Note:** Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time.
The PT2E path uses ``torch.dynamo`` to capture the eager model into an FX graph model, and then inserts the observers and Q/QD pairs on it. Finally it uses the ``torch.compile`` to perform the pattern matching and replace the Q/DQ pairs with optimized quantized operators.
To leverage accuracy-driven automatic tuning, a specified tuning space is necessary. The ``autotune`` iterates the tuning space and applies the configuration on given high-precision model then records and compares its evaluation result with the baseline. The tuning process stops when meeting the exit policy.
More detailed tutorials are available in the official Intelยฎ Neural Compressor `doc<https://intel.github.io/neural-compressor/latest/docs/source/Welcome.html>`_.
0 commit comments