15 Oct 23:09

tprimak

v3.9.2

fef4865

v3.9.2 Latest

Latest

This is a patch release containing the following changes to v3.9.1:

Fixed correctness issue in int8 convolution on processors with Intel AVX2 and Intel DL Boost instruction set support (a7c4079, 78e781f)
Fixed performance degradation for f32 convolution primitive on processors with Intel AVX-512 instruction set support (74f23b4)
Fixed performance regression for RNN primitive with LBR GRU cell type on Intel Arc GPUs (ae2844e)
Fixed performance degradation for int8 convolution primitive when using zero points (dbb8484)
Fixed segmentation fault in matmul primitive when using ONEDNN_VERBOSE=all (7310aa2)
Fixed correctness issue in multi-dimensional matmul primitive on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids) (642d18b)
Reduced problem size in test_sdpa_decomp test (9bff06e)
Restricted test_sdpa_decomp and test_mqa_decomp tests to OMP or THREADPOOL CPU runtimes (3cd9170)
Fixed illegal instruction issue in pooling primitive on processors with Intel SSE4.1 support (d907c47)
Fixed segmentation fault issue in f16 backward convolution primitive on processors with Intel AVX2 with Intel DL Boost with float16 and bfloat16 support (50cc228, fcc7e5e)
Restored support for int8 matmul with per_oc scales and zero points on Intel Arc GPUs (1a5a454, 04c22c9)

Assets 2

29 Aug 16:08

vpirogov

v3.9.1

80a3a8e

v3.9.1

This is a patch release containing the following changes to v3.9:

Reduced sizes in Graph API SDPA examples (257d689)
Fixed correctness issue in bf16 depthwise convolution with bf16 bias on AArch64 CPUs (218b41d)
Changed Intel GPU data alignment check from error to warning (5c5008a)
Improved bf16 matmul performance on processors with Intel AMX instruction set support (54b6354, 30c4d8d)
Fixed PowerPC64 build by adding -mcpu=power10 and -mmma flags (02ca915)
Introduced support for f16 destination in int8 matmul and int8 inner product on x64 CPUs (a62ed6b, 53c0a66, 0750043, 4f0f068)
Introduced support per_tensor zero-points in int8 matmul on Intel GPUs (db8e8ff, f783164, 4d458df, 80453a0, 7f90d50, a2200e2)
Fixed correctness issue in int8 reorder for cases with compensation on x64 CPUs (771ca54)

Assets 2

08 Aug 16:17

vgvozdeva

v3.9

d2a5702

v3.9

Performance Optimizations

Intel Architecture Processors

Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variable ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2.
Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable ONEDNN_MAX_CPU_ISA=AVX10_2_512.
Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
Improved performance of fp8 convolution primitive with scales and bf16 output
Improved performance of matmul primitive with post-ops on processors with Intel AMX support
Improved performance of RNN primitive for LBR_GRU and VANILLA_LSTM cell types on processors with Intel AVX2 instruction set support
Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with implicit causal mask.
- Grouped Query Attention (GQA) flavor specific for GEMMA models.

Intel Graphics Products

Improved performance on Intel GPUs based on Xe3 architecture.
Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
Improved RNN primitive performance with LBR_GRU cell type.
Improved int8 convolution performance with plain weights and trivial filter.
Improved convolution performance with NCHW activations with 1x1 filter and unit strides.
Improved fp32 softmax performance.
Improved performance of reorder when used with USM host memory.
Improved performance of the following subgraphs with Graph API:
- fp32 SDPA with implicit causal mask.
- fp16 SDPA on Intel GPUs without Intel XMX cores.

AArch64-based Processors

Improved int8 convolution performance.
Improved bf16 depthwise convolution performance.
Improved f16 matmul performance with Arm Compute Library (ACL).

Functionality

Functional API

Introduced Root Mean Square Normalization (RMSNorm) mode for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
Sparse memory objects and sparse matmul are promoted to production status.

Graph API

Introduced support for tanh approximation in GELU operation.
Extended Graph API Softmax operation to support optional stats output.
Introduced fusion support for SDPA training forward and backward propagation.
Introduced fusion support for SDPA with bottom-right implicit causal mask.
Introduced make_scalar_tensor() API for engine-agnostic scalar tensor creation.

Microkernel API

Introduced support for fp8 data type.

Intel Architecture Processors

Introduced support for select algorithm in binary post-op.
Introduced source, destination, and weight scales support in fp8 convolution and deconvolution primitives.

Intel Graphics Products

Introduced support for select algorithm in binary primitive.

Generic GPU Vendor

Introduced support for RNN Vanilla backward propagation.

Usability

Enabled build with -Wundef compiler flag.
[Experimental] Introduced support for kernel compilation with SYCL kernel compiler extension.

Validation

Improved benchdnn performance by optimizing input data filling and testing results comparison steps.
Improved benchdnn graph driver performance mode via adding CPU memory pool for allocator.

Known Limitations

The group normalization with normalization_flags::use_scale specified produces incorrect results for backward propagation kind in oneDNN v3.9 and earlier.
Binary primitive with certain shapes and Graph API SDPA with bottom right causal mask may hang with SYCL debug runtime on Windows.
fp8 matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.
int8 inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Data Сenter GPU Max series.
bf16 pooling with tensors exceeding 4 Gb in size may produce incorrect results on Intel Data Сenter GPU Max series.
bf16/fp16 matmul with large inner dimension has a performance regression on Intel Data Сenter GPU Max Series.
bf16/fp16 convolution with NCHW activations has a performance regression on Intel Data Сenter GPU Max Series.
Softmax with non-trivial strides and blocked format may produce incorrect results.
bf16 layer normalization backpropagation may produce incorrect results on Intel Data Сenter GPU Max Series.

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues.

Contributors

CaoZhongZ, Tiwari-Avanish, and 18 other contributors

Assets 2

25 Jul 18:49

tprimak

v3.9-rc

56e1053

v3.9-rc Pre-release

Pre-release

Performance Optimizations

Intel Architecture Processors

Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variable ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2.
Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable ONEDNN_MAX_CPU_ISA=AVX10_2_512.
Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
Improved performance of fp8 convolution primitive with scales and bf16 output
Improved performance of matmul primitive with post-ops on processors with Intel AMX support
Improved performance of RNN primitive for LBR_GRU and VANILLA_LSTM cell types on processors with Intel AVX2 instruction set support
Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with implicit causal mask.
- Grouped Query Attention (GQA) flavor specific for GEMMA models.

Intel Graphics Products

Improved performance on Intel GPUs based on Xe3 architecture.
Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
Improved RNN primitive performance with LBR_GRU cell type.
Improved int8 convolution performance with plain weights and trivial filter.
Improved convolution performance with NCHW activations with 1x1 filter and unit strides.
Improved fp32 softmax performance.
Improved performance of reorder when used with USM host memory.
Improved performance of the following subgraphs with Graph API:
- SDPA with implicit causal mask.
- SDPA with bottom-right implicit causal mask.
- fp32 SDPA.
- fp16 SDPA on Intel GPUs without Intel XMX cores.

AArch64-based Processors

Improved int8 convolution performance.
Improved bf16 depthwise convolution performance.
Improved f16 matmul performance with Arm Compute Library (ACL).

Functionality

Functional API

Introduced Root Mean Square Normalization (RMSNorm) mode for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
Sparse memory objects and sparse matmul are promoted to production status.

Graph API

Introduced support for tanh approximation in GELU operation.
Extended Graph API Softmax operation to support optional stats output.
Introduced support for SDPA training forward propagation and backpropagation.

Microkernel API

Introduced support for fp8 data type.

Intel Architecture Processors

Introduced support for select algorithm in binary post-op.
Introduced source, destination, and weight scales support in fp8 convolution and deconvolution primitives.

Intel Graphics Products

Introduced support for select algorithm in binary primitive.

Generic GPU Vendor

Introduced support for RNN Vanilla backward propagation.

Usability

Enabled build with -Wundef compiler flag.
[Experimental] Introduced support for kernel compilation with SYCL kernel compiler extension.

Validation

Improved benchdnn performance by optimizing input data filling and testing results comparison steps.

Known Limitations

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Thanks to our Contributors

Contributors

CaoZhongZ, Tiwari-Avanish, and 18 other contributors

Assets 2

27 May 22:56

vpirogov

v3.8.1

df786fa

v3.8.1

This is a patch release containing the following changes to v3.8:

Fixed correctness issue in reorder primitive with non-trivial strides on Intel CPUs (a762d32)
Fixed runtime error in convolution weight gradient on Xe2 architecture-based Intel GPUs (a8fac73, c409ef9)
Fixed performance regression in bf16 convolution on Intel Datacenter GPU Max Series (98170d0, c6bae4a, c5edd53, bb1a591)
Improved performance of fp16 matmul with fp8 compressed weights on Intel GPUs (58f3ec1, abff176, ffd7dd3, 3b1e855, 2e140de, 3429f79)
Fixed runtime error in fp16 pooling primitive on Xe2 architecture based Intel GPUs (c0f6b6d)
Improved performance of fp16 matmul with int4 weights and 32 < m <= 64 on Intel GPUs (2fa7072)
Fixed correctness issues in bf16 matmul with 3 or more dimensional tensors on processors with Intel AMX support (dd20965, ea1b4a1)
Fixed performance regression in fp16 or bf16 matmul with transposed source and weight tensors on Intel Datacenter GPU Max Series (e45e1aa)
Improved performance of bf16 matmul with int4 weights on Intel GPUs (7a15c23)
Fixed runtime error in fp16 SDPA subgraph with head size 512 on Intel Core Ultra (Series 2) processor integrated GPU (bde6985)

Assets 2

10 May 00:29

vpirogov

v3.8

a22ae94

v3.8

Performance Optimizations

Intel Architecture Processors

Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
Improved performance of int8 convolution support with zero points.
Improved fp32 convolution performance with fp16 and bf16 compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support.
Improved fp16/bf16 depthwise convolution performance with fp32 bias or sum post-ops or dilation.
Improved bf16 pooling backpropagation performance.
Improved binary post-ops performance with per_w broadcast.

Intel Graphics Products

Improved performance on Intel Arc graphics for future Intel Core Ultra processors (code name Panther Lake).
Improved convolution performance on:
- Intel Arc Graphics for Intel Core Ultra processor series 2 (formerly Lunar Lake).
- Intel Arc B-series discrete graphics (formerly Battlemage).
Improved int8 matmul performance with zero-points support for source and weight tensors.
Improved f4_e2m1 and f4_e3m0 matmul and reorder performance.
Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with int4 and int8 compressed key and value.
- fp16/bf16 SDPA with fp32 intermediate data types. Using fp32 intermediate data types is recommended.
- SDPA with head size 512 and 576.
- Grouped Query Attention (GQA) with 5D input tensors.

AArch64-based Processors

Improved fp16 reorder performance.
Improved int8 matmul performance.
Improved bf16 inner product forward propagation performance with Arm Compute Library (ACL).
Improved bf16 eltwise performance.
Improved convolution performance on processors with SVE support with ACL.

Functionality

Common

Extended Graph API Softmax operation to support inf_as_zero mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.

Intel Architecture Processors

Introduced support for f32 convolution with fp16 compressed weights.
Enabled int8/int4 compressed weights support in matmul primitive.

Intel Graphics Products

Introduced select algorithm support in binary primitive.
Introduced support for f4_e2m1 and f4_e3m0 data types in convolution primitive.
Introduced support for the GenIndex operation in Graph API.

Generic GPU Vendor

Introduced support for:
- Vanilla RNN forward propagation.
- Inner product backpropagation.
- Group normalization.
Improved accuracy of inner product primitive with sum post-ops for large shapes.

NVIDIA GPUs

Introduced Graph API support.

Usability

Added support for group normalization primitive with ONEDNN_ENABLE_PRIMITIVE build option.
Enabled support for ROCm 6 on AMD GPUs.
Improved CMake integration for oneDNN installation with Nvidia backend enabled.
Reduced memory footprint for matmul primitive when using ACL.

Validation

Added benchdnn option --execution-mode to test oneDNN functionality with SYCL Graph record/execute mode.
Extended benchdnn option --cold-cache with support for cold TLB mode.
Added benchdnn option --bia-dt to control bias data type for matmul, inner product, convolution, and deconvolution primitives.
Extended syntax of benchdnn --dt option in Graph API driver to manage data types of individual tensors in a pattern.

Deprecated Functionality

BLAS-like API including dnnl::sgemm, dnnl::gemm_u8s8s32, and dnnl::gemm_s8s8s32 functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.

Breaking Changes

Removed the experimental Graph Compiler backend for Graph API.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, Denis @redradist, Dmitriy Ovchinnikov @inteldimitrius, Eliezer Weissmann @eliezerweissmann, Hubert Maciak @hmaciak, Ilya Lavrenov @ilya-lavrenov, James McGregor @Jmc18134, @jstachowintel, Marek Michalowski @michalowski-arm, Maria Zhukova @mzhukova, Orel Yehuda @yehudaorel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, @Shreyas-fuj, Shu Chen @shu1chen, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, and @zhangfeiv0.

Contributors

ilya-lavrenov, redradist, and 17 other contributors

Assets 2

21 Apr 16:24

vpirogov

v3.7.3

34d53b4

v3.7.3

This is a patch release containing the following changes to v3.7.2:

Fixed correctness issue in matmul with non-trivial strides for the first tensor on processors with Intel AMX instruction set support (e18c622)
Removed spurious warning messages for SDPA subgraph on Intel GPUs (05541bb, 9e9a3a6)
Fixed segfault in fp32 matmul with bf16 math mode on processors with Intel AVX2 instruction set support (7d495ae)
Fixed performance regression in bf16 3D convolution backpropagation on processors with Intel AVX-512 and Intel DL Boost instruction set support (c38e02c, 67afc74)
Worked around GCC 12.3 bug causing accuracy issues in fp8 functionality on Intel GPUs (69b38d7)
Removed -fcf-protection build option for GCC 7 and earlier versions (813725d)

Assets 2

18 Apr 22:46

vpirogov

v3.8-rc

e571132

v3.8-rc Pre-release

Pre-release

Performance Optimizations

Intel Architecture Processors

Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
Improved performance of int8 convolution support with zero points.
Improved fp32 convolution performance with fp16 and bf16 compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support.
Improved fp16/bf16 depthwise convolution performance with fp32 bias or sum post-ops or dilation.
Improved bf16 pooling backpropagation performance.
Improved binary post-ops performance with per_w broadcast.

Intel Graphics Products

Improved performance on Intel GPUs based on Xe3 architecture.
Improved convolution performance on:
- Intel Arc Graphics for Intel Core Ultra (Series 2, formerly Lunar Lake).
- Intel Arc B-series discrete graphics (formerly Battlemage).
Improved int8 matmul performance with zero-points support for source and weight tensors.
Improved f4_e2m1 and f4_e3m0 matmul and reorder performance.
Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with int4 and int8 compressed key and value.
- fp16/bf16 SDPA with fp32 intermediate data types. Using fp32 intermediate data types is recommended.
- SDPA with head size 512 and 576.
- Grouped Query Attention (GQA) with 5D input tensors.

AArch64-based Processors

Improved fp16 reorder performance.
Improved int8 matmul performance.
Improved bf16 inner product forward propagation performance with Arm Compute Library (ACL).
Improved convolution performance on processors with SVE support with ACL.

Functionality

Common

Extended Graph API Softmax operation to support inf_as_zero mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.

Intel Architecture Processors

Introduced support for f32 convolution with fp16 compressed weights.
Enabled int8/int4 compressed weights support in matmul primitive.

Intel Graphics Products

Introduced select algorithm support in binary primitive.
Introduced support for f4_e2m1 and f4_e3m0 data types in convolution.
Introduced support for the GenIndex operation in Graph API.

Generic GPU Vendor

Introduced support for:
- Vanilla RNN forward propagation
- Inner product backpropagation
- Group normalization
Improved accuracy of inner product primitive with sum post-ops for large shapes.

NVIDIA GPUs

Introduced Graph API support.

Usability

Added support for Group Normalization primitive with ONEDNN_ENABLE_PRIMITIVE build option.
Enabled support for ROCm 6 on AMD GPUs.
Improved CMake integration for oneDNN installation with Nvidia backend enabled.
Reduced memory footprint for matmul primitive when using ACL.

Validation

Added benchdnn option --execution-mode to test oneDNN functionality with SYCL Graph record/execute mode.
Extended benchdnn option --cold-cache with support for cold TLB mode.
Added benchdnn option --bia-dt to control bias data type for matmul, inner product, convolution, and deconvolution.
Extended syntax of benchdnn --dt option in Graph API driver to manage data types of individual tensors in a pattern.

Breaking Changes

Removed the experimental Graph Compiler backend for Graph API.

Thanks to our Contributors

Contributors

ilya-lavrenov, redradist, and 17 other contributors

Assets 2

18 Mar 23:47

vpirogov

v3.7.2

dfce0a1

v3.7.2

This is a patch release containing the following changes to v3.7.1:

Fixed hang in matmul with odd shapes on Intel Arc GPUs (46e7499)
Fixed out-of-registers error in matmul on Intel Arc GPUs (599c839)
Fixed incorrect results in SDPA pattern on Intel GPUs (6343c73)
Fixed integer overflow in convolution with large shapes on x64 CPUs (c541100)
Fixed access violation issue in experimental Graph Compiler (8b0e626)
Fixed access violation in pooling on Intel GPUs (cd2cd5d)
Improved performance of int8 matmul with int4 weights on Intel GPUs (d6c98ec)

Assets 2

01 Mar 16:07

vpirogov

v3.7.1

8d263e6

v3.7.1

This is a patch release containing the following changes to v3.7:

Fixed correctness issue in int8 matmul primitive with int4 weights on on Intel Arc graphics (b16184d)
Fixed matmul performance regression on Intel Arc graphics (41e406b)
Fixed potential integer overflow in bf16 convolution for processors with Intel AVX-512 instruction set support (f882861)
Fixed functional issue in matmul with dropout attribute on generic GPUs (8303330)
Fixed functional issues in matmul with scales on NVIDIA GPUs (e8d8594)
Fixed integer overflows for large shapes in convolution for x64 processors (fc3f17a, 31b079f)
Worked around an MSVC 19.29.30158.0 bug that results in a crash at binary primitive creation on x64 processors (50dd6cc)

Assets 2

Releases: uxlfoundation/oneDNN

v3.9.2

Uh oh!

v3.9.1

Uh oh!

v3.9

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Functional API

Graph API

Microkernel API

Intel Architecture Processors

Intel Graphics Products

Generic GPU Vendor

Usability

Validation

Known Limitations

Deprecated Functionality

Thanks to our Contributors

Contributors

Uh oh!

v3.9-rc

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Functional API

Graph API

Microkernel API

Intel Architecture Processors

Intel Graphics Products

Generic GPU Vendor

Usability

Validation

Known Limitations

Deprecated Functionality

Thanks to our Contributors

Contributors

Uh oh!

v3.8.1

Uh oh!

v3.8

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Common

Intel Architecture Processors

Intel Graphics Products

Generic GPU Vendor

NVIDIA GPUs

Usability

Validation

Deprecated Functionality

Breaking Changes

Thanks to our Contributors

Contributors

Uh oh!

v3.7.3

Uh oh!

v3.8-rc

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Common

Intel Architecture Processors

Intel Graphics Products

Generic GPU Vendor

NVIDIA GPUs

Usability

Validation

Breaking Changes

Thanks to our Contributors