Skip to content

Commit 0215c63

Browse files
authored
recipes_source/intel_neural_compressor_for_pytorch.rst ๋ฒˆ์—ญ (#1033)
* ๋ฒˆ์—ญ: recipes_source/intel_neural_compressor_for_pytorch.rst ๋ฌธ์„œ ๋ฒˆ์—ญ ์ถ”๊ฐ€ * ์ˆ˜์ •: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜ ์ตœ์‹  ์—ฐ๊ตฌ ๋ฌธ๊ตฌ ๋ช…ํ™•ํ™” * fix: ๋ฒˆ์—ญ์ž ์ด๋ฆ„์„ ํ•œ๊ธ€๋กœ ํ‘œ๊ธฐ * fix: ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฌธ์žฅ ๋ช…ํ™•ํ™” ๋ฐ ์šฉ์–ด ์ˆœํ™”
1 parent d093172 commit 0215c63

File tree

1 file changed

+67
-53
lines changed

1 file changed

+67
-53
lines changed
Lines changed: 67 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,53 @@
1-
Ease-of-use quantization for PyTorch with Intelยฎ Neural Compressor
1+
PyTorch์—์„œ Intelยฎ Neural Compressor๋ฅผ ํ™œ์šฉํ•œ ์†์‰ฌ์šด ์–‘์žํ™”(Quantization)
22
==================================================================
33

4-
Overview
5-
--------
4+
**๋ฒˆ์—ญ**: `์ •ํœ˜์ˆ˜ <https://github.com/Brdy8294>`_
65

7-
Most deep learning applications are using 32-bits of floating-point precision for inference. But low precision data types, such as fp8, are getting more focus due to significant performance boost. A key concern in adopting low precision is mitigating accuracy loss while meeting predefined requirements.
6+
๊ฐœ์š”(Overview)
7+
--------------
88

9-
Intelยฎ Neural Compressor aims to address the aforementioned concern by extending PyTorch with accuracy-driven automatic tuning strategies to help user quickly find out the best quantized model on Intel hardware.
9+
๋Œ€๋ถ€๋ถ„์˜ ๋”ฅ๋Ÿฌ๋‹ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ์ถ”๋ก (inference)์„ ์œ„ํ•ด 32๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์ (floating-point) ์ •๋ฐ€๋„๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
10+
ํ•˜์ง€๋งŒ FP8๊ณผ ๊ฐ™์€ ์ €์ •๋ฐ€(low-precision) ๋ฐ์ดํ„ฐ ํƒ€์ž…์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ํฌ๊ธฐ ๋•Œ๋ฌธ์— ์ ์  ๋” ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
11+
์ €์ •๋ฐ€ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•  ๋•Œ์˜ ํ•ต์‹ฌ ๊ณผ์ œ๋Š” ์ •ํ™•๋„๋ฅผ ์ตœ๋Œ€ํ•œ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์‚ฌ์ „ ์ •์˜๋œ ์š”๊ตฌ ์‚ฌํ•ญ์„ ์ถฉ์กฑํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
1012

11-
Intelยฎ Neural Compressor is an open-source project at `Github <https://github.com/intel/neural-compressor>`_.
13+
Intelยฎ Neural Compressor๋Š” PyTorch์— ์ •ํ™•๋„ ๊ธฐ๋ฐ˜(accuracy-driven) ์ž๋™ ํŠœ๋‹(auto-tuning) ๊ธฐ๋ฒ•์„ ํ™•์žฅํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , ์‚ฌ์šฉ์ž๊ฐ€ Intel ํ•˜๋“œ์›จ์–ด์—์„œ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ์–‘์žํ™”๋œ ๋ชจ๋ธ์„ ์‰ฝ๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.
1214

13-
Features
14-
--------
15+
Intelยฎ Neural Compressor๋Š” ์˜คํ”ˆ์†Œ์Šค ํ”„๋กœ์ ํŠธ์ด๋ฉฐ, `Github <https://github.com/intel/neural-compressor>`_ ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
1516

16-
- **Ease-of-use API:** Intelยฎ Neural Compressor is re-using the PyTorch ``prepare``, ``convert`` API for user usage.
17-
18-
- **Accuracy-driven Tuning:** Intelยฎ Neural Compressor supports accuracy-driven automatic tuning process, provides ``autotune`` API for user usage.
17+
ํŠน์ง•(Features)
18+
---------------
1919

20-
- **Kinds of Quantization:** Intelยฎ Neural Compressor supports a variety of quantization methods, including classic INT8 quantization, weight-only quantization and the popular FP8 quantization. Neural compressor also provides the latest research in simulation work, such as MX data type emulation quantization. For more details, please refer to `Supported Matrix <https://github.com/intel/neural-compressor/blob/master/docs/source/3x/PyTorch.md#supported-matrix>`_.
20+
- ์‚ฌ์šฉ์ด ๊ฐ„ํŽธํ•œ API(Ease-of-use API): PyTorch์˜ ``prepare`` ๋ฐ ``convert`` API๋ฅผ ์žฌ์‚ฌ์šฉํ•ด ์‰ฝ๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
21+
- ์ •ํ™•๋„ ๊ธฐ๋ฐ˜ ํŠœ๋‹(Accuracy-driven Tuning): ์ •ํ™•๋„ ๊ธฐ๋ฐ˜ ์ž๋™ ํŠœ๋‹ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ง€์›ํ•˜๋ฉฐ ``autotune`` API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
22+
- ๋‹ค์–‘ํ•œ ์–‘์žํ™” ๋ฐฉ์‹(Kinds of Quantization): ๊ณ ์ „์ ์ธ INT8 ์–‘์žํ™”, ๊ฐ€์ค‘์น˜-์ „์šฉ(weight-only) ์–‘์žํ™”, FP8 ์–‘์žํ™”๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
23+
๋˜ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜์˜ ์ตœ์‹  ์—ฐ๊ตฌ๋กœ, MX ๋ฐ์ดํ„ฐ ํƒ€์ž… ์—๋ฎฌ๋ ˆ์ด์…˜(emulation) ์–‘์žํ™”๋„ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
24+
์ž์„ธํ•œ ๋‚ด์šฉ์€ `Supported Matrix <https://github.com/intel/neural-compressor/blob/master/docs/source/3x/PyTorch.md#supported-matrix>`_ ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
2125

22-
Getting Started
23-
---------------
26+
์‹œ์ž‘ํ•˜๊ธฐ(Getting Started)
27+
---------------------------
2428

25-
Installation
26-
~~~~~~~~~~~~
29+
์„ค์น˜(Installation)
30+
~~~~~~~~~~~~~~~~~~
2731

2832
.. code:: bash
2933
30-
# install stable version from pip
34+
# pip์„ ์ด์šฉํ•œ ์•ˆ์ •(stable) ๋ฒ„์ „ ์„ค์น˜
3135
pip install neural-compressor-pt
3236
..
3337
34-
**Note**: Neural Compressor provides automatic accelerator detection, including HPU, Intel GPU, CUDA, and CPU. To specify the target device, ``INC_TARGET_DEVICE`` is suggested, e.g., ``export INC_TARGET_DEVICE=cpu``.
35-
36-
37-
Examples
38-
~~~~~~~~~~~~
38+
์ฐธ๊ณ : Neural Compressor๋Š” HPU, Intel GPU, CUDA, CPU ๋“ฑ ๊ฐ€์†๊ธฐ๋ฅผ ์ž๋™ ๊ฐ์ง€ํ•ฉ๋‹ˆ๋‹ค.
39+
ํŠน์ • ๋””๋ฐ”์ด์Šค๋ฅผ ์ง€์ •ํ•˜๋ ค๋ฉด ํ™˜๊ฒฝ๋ณ€์ˆ˜ ``INC_TARGET_DEVICE`` ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”(์˜ˆ: ``export INC_TARGET_DEVICE=cpu``).
3940

40-
This section shows examples of kinds of quantization with Intelยฎ Neural compressor
41+
์˜ˆ์ œ(Examples)
42+
~~~~~~~~~~~~~~
4143

42-
FP8 Quantization
43-
^^^^^^^^^^^^^^^^
44+
์ด ์„น์…˜์—์„œ๋Š” Intelยฎ Neural Compressor๋กœ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ์–‘์žํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์˜ˆ์ œ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
4445

45-
**FP8 Quantization** is supported by Intelยฎ Gaudiยฎ2&3 AI Accelerator (HPU). To prepare the environment, please refer to `Intelยฎ Gaudiยฎ Documentation <https://docs.habana.ai/en/latest/index.html>`_.
46+
FP8 ์–‘์žํ™”(FP8 Quantization)
47+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4648

47-
Run the example,
49+
FP8 ์–‘์žํ™”๋Š” Intelยฎ Gaudiยฎ 2 ๋ฐ 3 AI Accelerator(HPU)์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค.
50+
ํ™˜๊ฒฝ ์„ค์ • ๋ฐฉ๋ฒ•์€ `Intelยฎ Gaudiยฎ Documentation <https://docs.habana.ai/en/latest/index.html>`_ ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.
4851

4952
.. code-block:: python
5053
@@ -58,37 +61,38 @@ Run the example,
5861
import torch
5962
import torchvision.models as models
6063
61-
# Load a pre-trained ResNet18 model
64+
# ์‚ฌ์ „ ํ•™์Šต๋œ ResNet18 ๋ชจ๋ธ ๋กœ๋“œ
6265
model = models.resnet18()
6366
64-
# Configure FP8 quantization
67+
# FP8 ์–‘์žํ™” ์„ค์ •
6568
qconfig = FP8Config(fp8_config="E4M3")
6669
model = prepare(model, qconfig)
6770
68-
# Perform calibration (replace with actual calibration data)
71+
# ๋ณด์ •(calibration) ์ˆ˜ํ–‰ (์‹ค์ œ ๋ณด์ • ๋ฐ์ดํ„ฐ๋กœ ๊ต์ฒด)
6972
calibration_data = torch.randn(1, 3, 224, 224).to("hpu")
7073
model(calibration_data)
7174
72-
# Convert the model to FP8
75+
# FP8 ๋ชจ๋ธ๋กœ ๋ณ€ํ™˜
7376
model = convert(model)
7477
75-
# Perform inference
78+
# ์ถ”๋ก  ์ˆ˜ํ–‰
7679
input_data = torch.randn(1, 3, 224, 224).to("hpu")
7780
output = model(input_data).to("cpu")
7881
print(output)
7982
8083
..
8184
82-
Weight-only Quantization
83-
^^^^^^^^^^^^^^^^^^^^^^^^
85+
๊ฐ€์ค‘์น˜-์ „์šฉ ์–‘์žํ™”(Weight-only Quantization)
86+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8487

85-
**Weight-only Quantization** is also supported on Intelยฎ Gaudiยฎ2&3 AI Accelerator. The quantized model could be loaded as below.
88+
๊ฐ€์ค‘์น˜-์ „์šฉ ์–‘์žํ™” ์—ญ์‹œ Intelยฎ Gaudiยฎ 2 ๋ฐ 3 AI Accelerator์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค.
89+
์–‘์žํ™”๋œ ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
8690

8791
.. code-block:: python
8892
8993
from neural_compressor.torch.quantization import load
9094
91-
# The model name comes from HuggingFace Model Hub.
95+
# ๋ชจ๋ธ ์ด๋ฆ„์€ HuggingFace Model Hub์—์„œ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
9296
model_name = "TheBloke/Llama-2-7B-GPTQ"
9397
model = load(
9498
model_name_or_path=model_name,
@@ -98,45 +102,55 @@ Weight-only Quantization
98102
)
99103
..
100104
101-
**Note:** Intel Neural Compressor will convert the model format from auto-gptq to hpu format on the first load and save hpu_model.safetensors to the local cache directory for the next load. So it may take a while to load for the first time.
105+
์ฐธ๊ณ : Intel Neural Compressor๋Š” ์ฒ˜์Œ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ auto-gptq ํ˜•์‹์„ HPU ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ ,
106+
๋‹ค์Œ์— ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ๋กœ์ปฌ ์บ์‹œ์— `hpu_model.safetensors` ํŒŒ์ผ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
107+
๋”ฐ๋ผ์„œ ์ฒซ ๋กœ๋“œ์—๋Š” ์‹œ๊ฐ„์ด ๋‹ค์†Œ ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
102108

103-
Static Quantization with PT2E Backend
104-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
109+
PT2E ๋ฐฑ์—”๋“œ ๊ธฐ๋ฐ˜ ์ •์  ์–‘์žํ™”(Static Quantization with PT2E Backend)
110+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
105111

106-
The PT2E path uses ``torch.dynamo`` to capture the eager model into an FX graph model, and then inserts the observers and Q/QD pairs on it. Finally it uses the ``torch.compile`` to perform the pattern matching and replace the Q/DQ pairs with optimized quantized operators.
112+
PT2E ๊ฒฝ๋กœ๋Š” ``torch.dynamo`` ๋กœ Eager ๋ชจ๋ธ์„ FX ๊ทธ๋ž˜ํ”„ ๋ชจ๋ธ๋กœ ์บก์ฒ˜ํ•˜๊ณ ,
113+
๊ทธ ์œ„์— ๊ด€์ฐฐ์ž(observers)์™€ Q/DQ ์Œ์„ ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค.
114+
๋งˆ์ง€๋ง‰์œผ๋กœ ``torch.compile`` ๋กœ ํŒจํ„ด ๋งค์นญ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ Q/DQ ์Œ์„ ์ตœ์ ํ™”๋œ ์–‘์žํ™” ์—ฐ์‚ฐ์ž๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.
107115

108-
There are four steps to perform W8A8 static quantization with PT2E backend: ``export``, ``prepare``, ``convert`` and ``compile``.
116+
W8A8 ์ •์  ์–‘์žํ™” ์ ˆ์ฐจ๋Š” ``export`` โ†’ ``prepare`` โ†’ ``convert`` โ†’ ``compile`` ์˜ ๋„ค ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.
109117

110118
.. code-block:: python
111119
112120
import torch
113121
from neural_compressor.torch.export import export
114122
from neural_compressor.torch.quantization import StaticQuantConfig, prepare, convert
115123
116-
# Prepare the float model and example inputs for export model
124+
# float ๋ชจ๋ธ๊ณผ ์˜ˆ์‹œ ์ž…๋ ฅ ์ค€๋น„
117125
model = UserFloatModel()
118126
example_inputs = ...
119127
120-
# Export eager model into FX graph model
128+
# Eager ๋ชจ๋ธ์„ FX ๊ทธ๋ž˜ํ”„ ๋ชจ๋ธ๋กœ ๋‚ด๋ณด๋‚ด๊ธฐ
121129
exported_model = export(model=model, example_inputs=example_inputs)
122-
# Quantize the model
130+
131+
# ๋ชจ๋ธ ์–‘์žํ™”
123132
quant_config = StaticQuantConfig()
124133
prepared_model = prepare(exported_model, quant_config=quant_config)
125-
# Calibrate
134+
135+
# ๋ณด์ •(calibration)
126136
run_fn(prepared_model)
137+
127138
q_model = convert(prepared_model)
128-
# Compile the quantized model and replace the Q/DQ pattern with Q-operator
139+
140+
# Q/DQ ํŒจํ„ด์„ Q-operator๋กœ ๊ต์ฒดํ•˜๋ฉฐ ์ปดํŒŒ์ผ
129141
from torch._inductor import config
130142
131143
config.freezing = True
132144
opt_model = torch.compile(q_model)
133145
..
134146
135-
Accuracy-driven Tuning
136-
^^^^^^^^^^^^^^^^^^^^^^
137-
138-
To leverage accuracy-driven automatic tuning, a specified tuning space is necessary. The ``autotune`` iterates the tuning space and applies the configuration on given high-precision model then records and compares its evaluation result with the baseline. The tuning process stops when meeting the exit policy.
147+
์ •ํ™•๋„ ๊ธฐ๋ฐ˜ ์ž๋™ ํŠœ๋‹(Accuracy-driven Tuning)
148+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
139149

150+
์ •ํ™•๋„ ๊ธฐ๋ฐ˜ ์ž๋™ ํŠœ๋‹์„ ํ™œ์šฉํ•˜๋ ค๋ฉด ํŠœ๋‹ ๊ณต๊ฐ„(tuning space)์„ ๋ช…์‹œํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
151+
``autotune`` ์€ ํŠœ๋‹ ๊ณต๊ฐ„์„ ์ˆœํšŒํ•˜๋ฉฐ ์ง€์ •๋œ ๊ณ ์ •๋ฐ€(high-precision) ๋ชจ๋ธ์— ์„ค์ •์„ ์ ์šฉํ•˜๊ณ ,
152+
๊ธฐ์ค€์„ (baseline)๊ณผ ๋น„๊ตํ•ด ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.
153+
ํŠœ๋‹์€ ์ข…๋ฃŒ ์ •์ฑ…(exit policy)์— ๋„๋‹ฌํ•˜๋ฉด ์ค‘๋‹จ๋ฉ๋‹ˆ๋‹ค.
140154

141155
.. code-block:: python
142156
@@ -155,7 +169,7 @@ To leverage accuracy-driven automatic tuning, a specified tuning space is necess
155169
q_model = autotune(model, tune_config=tune_config, eval_fn=eval_fn)
156170
..
157171
158-
Tutorials
159-
---------
172+
ํŠœํ† ๋ฆฌ์–ผ(Tutorials)
173+
-------------------
160174

161-
More detailed tutorials are available in the official Intelยฎ Neural Compressor `doc <https://intel.github.io/neural-compressor/latest/docs/source/Welcome.html>`_.
175+
์ž์„ธํ•œ ํŠœํ† ๋ฆฌ์–ผ์€ Intelยฎ Neural Compressor ๊ณต์‹ ๋ฌธ์„œ `์‚ฌ์ดํŠธ <https://intel.github.io/neural-compressor/latest/docs/source/Welcome.html>`_ ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

0 commit comments

Comments
ย (0)