uxlfoundation · vgvozdeva · Oct 10, 2025 · Oct 10, 2025 · Oct 10, 2025 · Oct 10, 2025
@@ -0,0 +1,90 @@
+# Performance Optimizations
+## Intel Architecture Processors
+* Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
+  This functionality is not dispatched by default and requires opt-in with environment variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2`.
+* Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable `ONEDNN_MAX_CPU_ISA=AVX10_2_512`.
+* Improved performance of matmul primitive on processors with Intel AMX support.
+* Improved performance of `f32` matmul primitive for GEMV cases on on processors with Intel AVX2 instruction set support.
+* Improved matmul performance with `int4` and `int8` compressed weights and per-channel zero-points.
+* Improved `f32` matmul performance with `int4` and `int8` compressed weights on processors with Intel AVX2 and Intel AVX512 instruction set support.
+* Improved `bf16` matmul performance with `int4` and `int8` compressed weights on processors with Intel AVX512, Intel DL Boost and bfloat16 instruction set support.
+* Improved performance of `int8` convolution primitive when using zero points.
+* Improved performance of `int8` matmul and inner product primitives with `fp16` destination.
+* Improved performance of `f32` and `bf16` convolution primitive with `int8` destination.
+* Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime.
+* Improved performance of subgraphs containing sequence of multiple binary ops with Graph API.
+
+## Intel Graphics Products
+* Improve GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
+* Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
+* Improved `int8` matmul performance with `int4` weights and per-tensor zero-points.
+* Improved `bf16` matmul performance with `fp8` weights.
+* Graph API optimizations:
+  * Improved [Scaled Dot Product Attention (SDPA)] subgraph performance for inference when relaxed accumulation mode is enabled on Intel Core Ultra processors (formerly Meteor Lake).
+  * Improved SDPA and GQA subgraphs performance when using host-side scalars.
+  * Improved performance of GQA subgraph for 2nd token scenarios.
+  * Improved performance of subgraphs containing sequence of multiple binary ops.
+  * Improved performance of [Grouped Query Attention (GQA)] subgraphs for training forward and backward propagation.
+
+[Grouped Query Attention (GQA)]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_graph_gqa.html#gqa-for-training-forward-propagation
+[Scaled Dot Product Attention (SDPA)]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_graph_sdpa.html
+## AArch64-based Processors
+* Improved performance of reorder primitive
+* Improved performance of `bf16` convolutions
+* Improved performance of convolutions on 128-bit SVE platforms
+* Improved performance of eltwise on Arm® Neoverse™ N1
+
+# Functionality
+## Functional API
+
+* Introduced [host-side scalar memory objects]. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul and convolution primitives on Intel GPUs.
+
+[host-side scalar memory objects]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_host_side_scalars.html
+* Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve performance in case of `int8` activations and `int8` weights with zero-point.
+
+## Graph API
+
+* Introduced [`host_scalar` property] for logical tensors. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs.
+* Introduced [accumulation mode attribute] support in `Matmul` op. This attribute allows relaxing `fp32` accumulation requirements to achieve performance benefits on some platforms.
+
+[`host_scalar` property]: https://uxlfoundation.github.io/oneDNN/v3.10/enum_dnnl_graph_logical_tensor_property_type.html
+[accumulation mode attribute]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_op_matmul.html
+
+## Intel Graphics Products
+* Introduced support for `fp4` weights in matmul primitive.
+* Introduced support for grouped quantization with group size 16 in matmul with `int8` compressed weights.
+* Introduced support group size 16 `int8` for decompressed weight with regular weights decompression.
+
+## Intel Architecture Processors
+* Introduced `fp4` weights support for `fp32` matmul and convolution  for future Intel Xeon processors with Intel AVX10.2 instruction set support.
+
+# Usability
+* Extended diagnostics available in verbose mode for primitive descriptor creation issues.
+* Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs.
+
+## AArch64-based Processors
+* Fixed crashes in backward-pass convolutions
+* Fixed numerical errors in 4D matmul primitives
+* Fixed numerical errors in low-precision convolutions
+* Fixed numerical errors in reorders with compensation
+* Fixed illegal-instruction crashes on Arm® Neoverse™ N1
+* Fixed crashes in binary primitive in Debug builds
+* Fixed segmentation fault in `eltwise_log` post-ops for large kernels
+
+# Validation
+
+# Deprecated Functionality
+* [BLAS-like API] including `dnnl::sgemm`, `dnnl::gemm_u8s8s32`, and `dnnl::gemm_s8s8s32` functions is deprecated and will be removed in future releases. If you are using this API consider switching to [matmul primitive].
+
+[BLAS-like API]: https://uxlfoundation.github.io/oneDNN/v3.10/group_dnnl_api_blas.html
+[matmul primitive]: https://uxlfoundation.github.io/oneDNN/v3.10/dev_guide_matmul.html
+
+# Breaking Changes
+## AArch64-based Processors
+* Bumped the minimum required [Arm® Compute Library](https://github.com/ARM-software/ComputeLibrary) version to 52.4.0
+
+# Thanks to our Contributors
+This release contains contributions from the [project core team] as well as Andrei Hutu @Anndrey24, 
+Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301, Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117, Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva, Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc, @focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0, @zhangjian29, and @xiazhuozhao.
+
+[project core team]: https://github.com/uxlfoundation/oneDNN/blob/rls-v3.10/MAINTAINERS.md