Skip to content

Commit 2e4d306

Browse files
authored
update tutorial for 1.12 release (#942)
update the runtime document update known issue of runtime extension [doc] update supported fusion patterns of fp32/bf16/int8 (#854) * update supported fusion patterns of fp32/bf16/int8 * fix typo doc: editor review of all tutorial docs (#863) - Lots of edits across the tutorial documents for grammar, clarity, simplification, and spelling - Fixed malformed md and rst causing layout issues (including indenting) - Removed trailing whitespace - Fixed UTF-8 characters in code examples (e.g, curley quotes vs. straight quotes) - Changed pygments language (code highlight) to bash for unsupported cmd - Changed absolute links to relative where appropriate. - Added toctree items to make documents visible in navigation menu. Signed-off-by: David B. Kinder <[email protected]> update docs update int8.md update performance page with tunable parameters description update int8 example update torch-ccl package name update version in README update int8.md: change customer qconfig for dynamic quantization Add performance tuning guide for OneDNN primitive cache (#905) * Add performance tuning guide for OneDNN primitive cache * Update docs/tutorials/performance_tuning/tuning_guide.md Co-authored-by: Jiong Gong <[email protected]> * Update tuning_guide.md Co-authored-by: Jiong Gong <[email protected]> update doc for autocast (#899) add 2 known issues of MultiStreamModule update known issues update known issues update int8 doc add 1.12 release notes correct intel_extension_for_pytorch_structure.png update release notes, correct model zoo url in examples update docs update docs update graph_optimization.md
1 parent 1f633c0 commit 2e4d306

26 files changed

+1458
-690
lines changed

docs/design_doc/isa_dyndisp.md

+37-33
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# IPEX CPU ISA Dynamic Dispatch Design Doc
1+
# Intel® Extension for PyTorch\* CPU ISA Dynamic Dispatch Design Doc
22

3-
This document explains the dynamic kernel dispatch mechanism based on CPU ISA. It is an extension to the similar mechanism in PyTorch.
3+
This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch\* (IPEX) based on CPU ISA. It is an extension to the similar mechanism in PyTorch.
44

55
## Overview
6-
---
7-
IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. Besides that, IPEX add more CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`.
6+
7+
IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. IPEX adds additional CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`.
88

99
PyTorch & IPEX CPU ISA support statement:
1010
| | DEFAULT | AVX2 | AVX512 | AVX512_VNNI | AVX512_BF16 | AMX |
@@ -23,19 +23,19 @@ PyTorch & IPEX CPU ISA support statement:
2323
| AVX512_BF16 | GCC 10.3+ |
2424
| AMX | GCC 11.2+ |
2525

26-
\* Detailed compiler check, please check with `cmake/Modules/FindAVX.cmake`
26+
\* Check with `cmake/Modules/FindAVX.cmake` for detailed compiler checks.
2727

2828
## Dynamic Dispatch Design
29-
---
30-
Dynamic dispatch major mechanism is to copy the kernel implementation source file to multiple folders for each ISA level. And then build each file using its ISA specific parameters. Each generated object file will contains its function body(**Kernel Implementation**).
3129

32-
Kernel Implementation use anonymous namespace so that different cpu versions won't conflict.
30+
Dynamic dispatch copies the kernel implementation source files to multiple folders for each ISA level. It then builds each file using its ISA specific parameters. Each generated object file will contain its function body (**Kernel Implementation**).
3331

34-
**Kernel Stub** is a "virtual function" with polymorphic kernel implementations w.r.t. ISA levels.
32+
Kernel Implementation uses an anonymous namespace so that different CPU versions won't conflict.
3533

36-
At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer to best matching function body.
34+
**Kernel Stub** is a "virtual function" with polymorphic kernel implementations pertaining to ISA levels.
3735

38-
### Code Folder Struct
36+
At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer best matches the function body.
37+
38+
### Code Folder Struct
3939
>#### **Kernel implementation:** `intel_extension_for_pytorch/csrc/aten/cpu/kernels/xyzKrnl.cpp`
4040
>#### **Kernel Stub:** `intel_extension_for_pytorch/csrc/aten/cpu/xyz.cpp` and `intel_extension_for_pytorch/csrc/aten/cpu/xyz.h`
4141
>#### **Dispatch Stub implementation:** `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.cpp` and `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h`
@@ -46,8 +46,10 @@ IPEX build system will generate code for each ISA level with specifiy complier p
4646
The CodeGen will copy each cpp files from **Kernel implementation**, and then add ISA level as new file suffix.
4747

4848
> **Sample:**
49+
>
4950
> ----
50-
> **Origin file:**
51+
>
52+
> **Origin file:**
5153
>
5254
> `intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp`
5355
>
@@ -64,7 +66,9 @@ The CodeGen will copy each cpp files from **Kernel implementation**, and then ad
6466
> AVX512_BF16: `build/Release/intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_BF16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -DCPU_CAPABILITY=AVX512_BF16 -DCPU_CAPABILITY_AVX512_BF16`
6567
>
6668
> AMX: `build/Release/intel_extension_for_pytorch/csrc/aten/cpu/kernels/AdaptiveAveragePoolingKrnl.cpp.AMX.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -DCPU_CAPABILITY=AMX -DCPU_CAPABILITY_AMX`
69+
6770
---
71+
6872
>**Note:**
6973
>1. DEFAULT level kernels is not fully implemented in IPEX. In order to align to PyTorch, we build default use AVX2 parameters in stead of that. So, IPEX minimal required executing machine support AVX2.
7074
>2. `-D__AVX__` and `-D__AVX512F__` is defined for depends library [sleef](https://sleef.org/) .
@@ -73,12 +77,12 @@ The CodeGen will copy each cpp files from **Kernel implementation**, and then ad
7377
>5. Higher ISA level is compatible to lower ISA levels, so it needs to contains level ISA feature definitions. Such as AVX512_BF16 need contains `-DCPU_CAPABILITY_AVX512` `-DCPU_CAPABILITY_AVX512_VNNI`. But AVX512 don't contains AVX2 definitions, due to there are different vec register width.
7478
7579
## Add Custom Kernel
76-
---
77-
If you want to add new custom kernel, and the kernel using CPU ISA instruction. Please reference to below steps.
7880

79-
1. Please add CPU ISA related kernel implementation to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/kernels/NewKernelKrnl.cpp`
80-
2. Please add kernel stub to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/NewKernel.cpp`
81-
3. Please include header file: `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h`, and reference to the comment in the header file.
81+
If you want to add a new custom kernel, and the kernel uses CPU ISA instructions, refer to these tips:
82+
83+
1. Add CPU ISA related kernel implementation to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/kernels/NewKernelKrnl.cpp`
84+
2. Add kernel stub to the folder: `intel_extension_for_pytorch/csrc/aten/cpu/NewKernel.cpp`
85+
3. Include header file: `intel_extension_for_pytorch/csrc/dyndisp/DispatchStub.h`, and reference to the comment in the header file.
8286
```c++
8387
// Implements instruction set specific function dispatch.
8488
//
@@ -111,9 +115,9 @@ If you want to add new custom kernel, and the kernel using CPU ISA instruction.
111115

112116
>**Note:**
113117
>
114-
>1. Some kernel only call **oneDNN** or **iDeep** implementation, or other backend implementation. Which is not need to add kernel implementation. (Refer: `BatchNorm.cpp`)
115-
>2. Vec related header file must be included in kernel implementation file, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters.
116-
>3. More intrinsics please check at [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
118+
>1. Some kernels only call **oneDNN** or **iDeep** implementation, or other backend implementation, which is not needed to add kernel implementations. (Refer: `BatchNorm.cpp`)
119+
>2. Vec related header file must be included in kernel implementation files, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters.
120+
>3. For more intrinsics, check the [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).
117121
118122
### ISA intrinics specific kernel example:
119123

@@ -163,7 +167,7 @@ void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len) {
163167
```
164168
Macro `CPU_CAPABILITY_AVX512` and `CPU_CAPABILITY_AVX512_BF16` are defined by compiler check, it is means that current compiler havs capability to generate defined ISA level code.
165169

166-
Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region.
170+
Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region.
167171
```c++
168172
//csrc/aten/cpu/kernels/CvtFp32ToBf16Krnl.cpp
169173

@@ -247,7 +251,7 @@ REGISTER_DISPATCH(cvt_fp32_to_bf16_kernel_stub, &cvt_fp32_to_bf16_kernel_impl);
247251
```
248252
249253
### Vec specific kernel example:
250-
This example show get data type size and Its Vec size. In different ISA, Vec has different register width, and it has different Vec size also.
254+
This example shows how to get the data type size and its Vec size. In different ISA, Vec has a different register width and a different Vec size.
251255
252256
```c++
253257
//csrc/aten/cpu/GetVecLength.h
@@ -354,19 +358,19 @@ REGISTER_DISPATCH(
354358
355359
```
356360
## Private Debug APIs
357-
---
358-
Here three ISA related private APIs could do same debug work. Which contains:
361+
362+
Here are three ISA-related private APIs that can help debugging::
359363
1. Query current ISA level.
360364
2. Query max CPU supported ISA level.
361365
3. Query max binary supported ISA level.
362366
>**Note:**
363367
>
364368
>1. Max CPU supported ISA level only depends on CPU features.
365369
>2. Max binary supported ISA level only depends on built complier version.
366-
>3. Current ISA level, it is equal minimal of `max CPU ISA level` and `max binary ISA level`.
370+
>3. Current ISA level, it is the smaller of `max CPU ISA level` and `max binary ISA level`.
367371

368372
### Example:
369-
```cmd
373+
```bash
370374
python
371375
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
372376
[GCC 7.5.0] :: Anaconda, Inc. on linux
@@ -382,24 +386,24 @@ Type "help", "copyright", "credits" or "license" for more information.
382386
```
383387

384388
## Select ISA level manually.
385-
---
386-
By default, IPEX dispatches to the kernels with maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable from PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware.
389+
390+
By default, IPEX dispatches to the kernels with the maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable as PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware.
387391
### Example:
388-
```cmd
392+
```bash
389393
$ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())'
390394
AMX
391395
$ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())'
392396
AVX2
393397
```
394398
>**Note:**
395399
>
396-
>`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subjects to change.
400+
>`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change.
397401
398402
## CPU feature check
399-
---
403+
400404
An addtional CPU feature check tool in the subfolder: `tests/cpu/isa`
401405

402-
```cmd
406+
```bash
403407
$ cmake .
404408
-- The C compiler identification is GNU 11.2.1
405409
-- The CXX compiler identification is GNU 11.2.1
@@ -466,4 +470,4 @@ amx_tile: true
466470
amx_int8: true
467471
prefetchw: true
468472
prefetchwt1: false
469-
```
473+
```

docs/index.rst

+11-6
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,24 @@
33
You can adapt this file completely to your liking, but it should at least
44
contain the root `toctree` directive.
55
6-
Welcome to Intel® Extension for PyTorch* documentation!
7-
#######################################################
6+
Welcome to Intel® Extension for PyTorch* Documentation
7+
######################################################
88

9-
Intel® Extension for PyTorch* extends PyTorch with optimizations for extra performance boost on Intel hardware. Most of the optimizations will be included in stock PyTorch releases eventually, and the intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX).
9+
Intel® Extension for PyTorch* extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware. Example optimizations use AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX). Over time, most of these optimizations will be included directly into stock PyTorch releases.
1010

11-
Intel® Extension for PyTorch* is structured as the following figure. It is loaded as a Python module for Python programs or linked as a C++ library for C++ programs. Users can enable it dynamically in script by importing `intel_extension_for_pytorch`. It covers optimizations for both imperative mode and graph mode. Optimized operators and kernels are registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel hardware. During execution, Intel® Extension for PyTorch* intercepts invocation of ATen operators, and replace the original ones with these optimized ones. In graph mode, further operator fusions are applied manually by Intel engineers or through a tool named *oneDNN Graph* to reduce operator/kernel invocation overheads, and thus increase performance.
11+
Intel® Extension for PyTorch* provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch normally yields better performance from optimization techniques such as operation fusion, and Intel® Extension for PyTorch* amplified them with more comprehensive graph optimizations. Therefore we recommended you to take advantage of Intel® Extension for PyTorch* with `TorchScript <https://pytorch.org/docs/stable/jit.html>`_ whenever your workload supports it. You could choose to run with `torch.jit.trace()` function or `torch.jit.script()` function, but based on our evaluation, `torch.jit.trace()` supports more workloads so we recommend you to use `torch.jit.trace()` as your first choice. More detailed information can be found at `pytorch.org website <https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html#tracing-modules>`_.
1212

13-
.. image:: ../images/intel_extension_for_pytorch_structure.png
13+
The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts users can enable it dynamically by importing `intel_extension_for_pytorch`.
14+
15+
Intel® Extension for PyTorch* is structured as shown in the following figure:
16+
17+
.. figure:: ../images/intel_extension_for_pytorch_structure.png
1418
:width: 800
1519
:align: center
1620
:alt: Structure of Intel® Extension for PyTorch*
1721

18-
|
22+
23+
PyTorch components are depicted with white boxes while Intel Extensions are with blue boxes. Extra performance of the extension is delivered via both custom addons and overriding existing PyTorch components. In eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers and INT8 quantization API. Further performance boosting is available by converting the eager-mode model into graph mode via the extended graph fusion passes. Intel® Extension for PyTorch* dispatches the operators into their underlying kernels automatically based on ISA that it detects and leverages vectorization and matrix acceleration units available in Intel hardware, as much as possible. oneDNN library is used for computation intensive operations. Intel Extension for PyTorch runtime extension brings better efficiency with finer-grained thread runtime control and weight sharing.
1924

2025
Intel® Extension for PyTorch* has been released as an open–source project at `Github <https://github.com/intel/intel-extension-for-pytorch>`_.
2126

docs/tutorials/api_doc.rst

+1-2
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,7 @@ Quantization
1313
************
1414

1515
.. automodule:: intel_extension_for_pytorch.quantization
16-
.. autofunction:: QuantConf
17-
.. autoclass:: calibrate
16+
.. autofunction:: prepare
1817
.. autofunction:: convert
1918

2019
CPU Runtime

docs/tutorials/blogs_publications.md

+1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
Blogs & Publications
22
====================
33

4+
* [Accelerating PyTorch with Intel® Extension for PyTorch\*](https://medium.com/pytorch/accelerating-pytorch-with-intel-extension-for-pytorch-3aef51ea3722)
45
* [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/intel-facebook-boost-bfloat16.html)
56
* [Accelerate PyTorch with the extension and oneDNN using Intel BF16 Technology](https://medium.com/pytorch/accelerate-pytorch-with-ipex-and-onednn-using-intel-bf16-technology-dca5b8e6b58f)
67
* *Note*: APIs mentioned in it are deprecated.

0 commit comments

Comments
 (0)