Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658

chunhuanMeng · 2025-05-14T02:27:31Z

Refactors and enhances the adaptive_avg_pool2d_backward_kernel implementation in the src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp file. Key changes include removing redundant template parameters, adding a new kernel functor for channels-last memory format, and optimizing memory usage and thread configurations for better performance and maintainability.

Refactoring and Simplification:

Removed the is_channels_last template parameter from both AdaptiveAvgPool2dBwdKernelFunctor and AdaptiveAvgPool2dBwdSLMKernelFunctor, simplifying their implementations. This eliminates conditional logic based on memory format

New Kernel Functor:

Introduced AdaptiveAvgPool2dBwdSLMChannelsLastKernelFunctor, specifically designed to handle the channels-last memory format. This functor precomputes indices and pooling factors for efficient gradient computation, leveraging shared memory for intermediate storage.

Memory and Thread Optimization:

Added constants (XPU_MAX_THREADS, GROUP_STRIDE) and optimized thread group configurations to improve performance and reduce the number of groups launched.
Updated shared memory usage calculations and introduced logic to dynamically adjust thread configurations if memory limits are exceeded.

General Improvements:

Replaced hardcoded dimensions with dynamically calculated values (isizeH, isizeW, osizeH, osizeW) for better readability and maintainability.
Removed unused or redundant code.

Copilot

Pull Request Overview

This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance, simplify code logic, and add a new optimized kernel for channels-last format.

Removed the now redundant is_channels_last template parameter and its branches.
Introduced a new kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory and group-based processing for enhanced performance.
Updated kernel launch configurations and added utility macros for standardized index calculations.

Copilot · 2025-05-14T02:29:52Z

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp

+#define START_IND_INT(a, b, c) ((a * c) / b)
+#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b)
+
+#define XPU_MAX_THREADS 1024 // this is safe, in reality 256 is our limit


[nitpick] Consider clarifying the comment on XPU_MAX_THREADS to explain why 1024 is used despite the realistic limit being 256, to avoid future confusion for maintainers.

Copilot · 2025-05-14T02:29:53Z

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp

-    grad_input = at::empty_like(input_, smf);
-  }
+template <typename index_t, typename scalar_t>
+struct AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast


[nitpick] It would be beneficial to add inline comments describing the strategy of shared memory caching and the layout calculation in this new channels-last kernel to help future readers understand the complex index and memory computations.

Copilot

Pull Request Overview

This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance and simplify the code by removing redundant paths and introducing a new kernel optimized for channels-last memory format. Key changes include:

Removal of the is_channels_last template parameter to streamline the kernel functors.
Addition of a new channels-last kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory caching.
Dynamic kernel launch configuration adjustments that ensure shared memory limits are respected.

Comments suppressed due to low confidence (1)

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp:440

[nitpick] Consider adding an inline comment explaining the rationale behind dynamically reducing max_threads in the do-while loop to aid clarity and future maintenance.

do { ... max_threads adjustment ... } while (!done && max_threads);

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp

Co-authored-by: Copilot <[email protected]>

chunhuanMeng · 2025-05-14T07:06:05Z

dtype	op	shape	ChannelsLast	output_size	original	optimized
torch.bfloat16	adaptive_avg_pool2d_backward	(8, 512, 32, 32)	TRUE	(7, 7)	153.176	53.671
torch.float16	adaptive_avg_pool2d_backward	(8, 512, 32, 32)	TRUE	(7, 7)	151.984	48.032
torch.float32	adaptive_avg_pool2d_backward	(8, 512, 32, 32)	TRUE	(7, 7)	152.44	45.624
torch.bfloat16	adaptive_avg_pool2d_backward	(8, 256, 56, 56)	TRUE	(14, 14)	211.68	76.855
torch.float16	adaptive_avg_pool2d_backward	(8, 256, 56, 56)	TRUE	(14, 14)	210.32	71.912
torch.float32	adaptive_avg_pool2d_backward	(8, 256, 56, 56)	TRUE	(14, 14)	210.312	67.248

Copilot

Pull Request Overview

This PR enhances the backward kernel for adaptive average pooling 2D on XPU by refactoring the code, removing the redundant templated memory format parameter, and introducing a dedicated kernel functor for channels-last format. Key changes include:

Simplification of the AdaptiveAvgPool2dBwdKernelFunctor and AdaptiveAvgPool2dBwdSLMKernelFunctor by removing the is_channels_last template parameter.
Addition of AdaptiveAvgPool2dBwdSLMChannelsLastKernelFunctor, which precomputes indices and pooling factors for efficient gradient accumulation using shared memory.
Optimization of thread, group, and shared memory configurations to improve performance.

Comments suppressed due to low confidence (1)

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp:512

Consider adding a safeguard to break out of the loop if max_threads becomes too small or reaches zero, to prevent potential infinite loops when the shared memory requirements exceed the available limit.

max_threads /= 2;

Copilot · 2025-05-23T01:18:20Z

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp

+           START_IND_INT(i, osizeW_, isizeW_));
+    }
+
+    // each cta handles a portion of a single slice on batch dimension;


[nitpick] Clarify the assumptions behind the computation of batch_id and channel_id from the group index with an inline comment to aid future maintainers, ensuring that the ordering remains valid if the group configuration changes.

Suggested change

// each cta handles a portion of a single slice on batch dimension;

// each cta handles a portion of a single slice on batch dimension;

// Assumption: The group index (item.get_group(2)) is ordered such that

// the batch dimension is the least significant, and the channel dimension

// is the next. If the group configuration changes, this computation

// must be revisited to ensure correctness.

Copilot

Pull Request Overview

This PR refactors and enhances the Adaptive Average Pooling 2D backward kernel for improved performance, clarity, and maintainability.

Removed the redundant is_channels_last template parameter from kernel functors.
Introduced a new channels-last specific kernel functor and optimized thread and memory configurations.
Replaced hardcoded dimensions with dynamically computed values and streamlined shared memory usage.

Copilot · 2025-05-23T01:27:44Z

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp

+#define START_IND(a, b, c) ((int64_t)((a / b) * c + ((a % b) * c) / b))
+#define END_IND(a, b, c) (1 + ((int64_t)(a + 1) * c - 1) / b)
+
+#define START_IND_INT(a, b, c) ((a * c) / b)
+#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b)
+


[nitpick] Consider replacing the macro definitions (e.g., START_IND, END_IND, and their INT variants) with inline functions for improved type safety and easier debugging.

Suggested change

#define START_IND(a, b, c) ((int64_t)((a / b) * c + ((a % b) * c) / b))

#define END_IND(a, b, c) (1 + ((int64_t)(a + 1) * c - 1) / b)

#define START_IND_INT(a, b, c) ((a * c) / b)

#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b)

namespace {

inline constexpr int64_t start_ind(int64_t a, int64_t b, int64_t c) {

return (a / b) * c + ((a % b) * c) / b;

}

inline constexpr int64_t end_ind(int64_t a, int64_t b, int64_t c) {

return 1 + ((a + 1) * c - 1) / b;

}

inline constexpr int start_ind_int(int a, int b, int c) {

return (a * c) / b;

}

inline constexpr int end_ind_int(int a, int b, int c) {

return ((a + 1) * c + b - 1) / b;

}

} // anonymous namespace

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp

Co-authored-by: Copilot <[email protected]>

Update AdaptiveAveragePooling2dKernels.cpp

010e698

chunhuanMeng changed the title ~~Update AdaptiveAveragePooling2dKernels.cpp~~ Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification May 14, 2025

chunhuanMeng requested a review from Copilot May 14, 2025 02:29

Copilot AI reviewed May 14, 2025

View reviewed changes

chunhuanMeng added 2 commits May 14, 2025 10:33

Update AdaptiveAveragePooling2dKernels.cpp

1574931

add comments

14a21e9

chunhuanMeng requested a review from Copilot May 14, 2025 06:35

Copilot AI reviewed May 14, 2025

View reviewed changes

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp Outdated Show resolved Hide resolved

chunhuanMeng and others added 2 commits May 14, 2025 14:37

Update src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp

1f45764

Co-authored-by: Copilot <[email protected]>

rename

821430f

chunhuanMeng added 4 commits May 20, 2025 10:40

Merge branch 'main' into meng_opt_adavg

8d077ee

fix ut

458b6a6

fix ut

5bb3d2b

Merge branch 'main' into meng_opt_adavg

e1f9c8c

chunhuanMeng requested a review from Copilot May 23, 2025 01:17

Copilot AI reviewed May 23, 2025

View reviewed changes

Update AdaptiveAveragePooling2dKernels.cpp

0b7847c

chunhuanMeng requested a review from Copilot May 23, 2025 01:27

Copilot AI reviewed May 23, 2025

View reviewed changes

chunhuanMeng and others added 2 commits May 23, 2025 09:29

Update src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp

1d65d56

Co-authored-by: Copilot <[email protected]>

Update apply_torch_pr.py

f3239a3

xytintel added the kernel_optimization label May 27, 2025

xytintel and others added 7 commits May 27, 2025 10:54

Update skip_list_common.py

21e070e

Merge branch 'main' into meng_opt_adavg

5ea2138

Update apply_torch_pr.py

f32c1f9

Update skip_list_common.py

47803fd

Merge branch 'main' into meng_opt_adavg

724cb67

test ci

8540e79

remove printf

2d0c176

chunhuanMeng and others added 3 commits May 29, 2025 09:21

Merge branch 'main' into meng_opt_adavg

dbb05ba

Merge branch 'main' into meng_opt_adavg

a6bb764

Merge branch 'main' into meng_opt_adavg

24f1e0e

xytintel approved these changes May 30, 2025

View reviewed changes

xytintel added this pull request to the merge queue May 30, 2025

Merged via the queue into main with commit dbbb432 May 30, 2025
7 checks passed

xytintel deleted the meng_opt_adavg branch May 30, 2025 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658

Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658

Uh oh!

chunhuanMeng commented May 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 14, 2025

Uh oh!

Copilot AI May 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

chunhuanMeng commented May 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

-    // each cta handles a portion of a single slice on batch dimension;
+    // each cta handles a portion of a single slice on batch dimension;
+    // Assumption: The group index (item.get_group(2)) is ordered such that
+    // the batch dimension is the least significant, and the channel dimension
+    // is the next. If the group configuration changes, this computation
+    // must be revisited to ensure correctness.

-#define START_IND(a, b, c) ((int64_t)((a / b) * c + ((a % b) * c) / b))
-#define END_IND(a, b, c) (1 + ((int64_t)(a + 1) * c - 1) / b)
-#define START_IND_INT(a, b, c) ((a * c) / b)
-#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b)
+namespace {
+inline constexpr int64_t start_ind(int64_t a, int64_t b, int64_t c) {
+    return (a / b) * c + ((a % b) * c) / b;
+}
+inline constexpr int64_t end_ind(int64_t a, int64_t b, int64_t c) {
+    return 1 + ((a + 1) * c - 1) / b;
+}
+inline constexpr int start_ind_int(int a, int b, int c) {
+    return (a * c) / b;
+}
+inline constexpr int end_ind_int(int a, int b, int c) {
+    return ((a + 1) * c + b - 1) / b;
+}
+} // anonymous namespace

Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658

Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658

Uh oh!

Conversation

chunhuanMeng commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactoring and Simplification:

New Kernel Functor:

Memory and Thread Optimization:

General Improvements:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

chunhuanMeng commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chunhuanMeng commented May 14, 2025 •

edited

Loading

chunhuanMeng commented May 14, 2025 •

edited

Loading