-
Notifications
You must be signed in to change notification settings - Fork 39
Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance, simplify code logic, and add a new optimized kernel for channels-last format.
- Removed the now redundant is_channels_last template parameter and its branches.
- Introduced a new kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory and group-based processing for enhanced performance.
- Updated kernel launch configurations and added utility macros for standardized index calculations.
#define START_IND_INT(a, b, c) ((a * c) / b) | ||
#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b) | ||
|
||
#define XPU_MAX_THREADS 1024 // this is safe, in reality 256 is our limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider clarifying the comment on XPU_MAX_THREADS to explain why 1024 is used despite the realistic limit being 256, to avoid future confusion for maintainers.
Copilot uses AI. Check for mistakes.
grad_input = at::empty_like(input_, smf); | ||
} | ||
template <typename index_t, typename scalar_t> | ||
struct AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] It would be beneficial to add inline comments describing the strategy of shared memory caching and the layout calculation in this new channels-last kernel to help future readers understand the complex index and memory computations.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance and simplify the code by removing redundant paths and introducing a new kernel optimized for channels-last memory format. Key changes include:
- Removal of the is_channels_last template parameter to streamline the kernel functors.
- Addition of a new channels-last kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory caching.
- Dynamic kernel launch configuration adjustments that ensure shared memory limits are respected.
Comments suppressed due to low confidence (1)
src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp:440
- [nitpick] Consider adding an inline comment explaining the rationale behind dynamically reducing max_threads in the do-while loop to aid clarity and future maintenance.
do { ... max_threads adjustment ... } while (!done && max_threads);
Co-authored-by: Copilot <[email protected]>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the backward kernel for adaptive average pooling 2D on XPU by refactoring the code, removing the redundant templated memory format parameter, and introducing a dedicated kernel functor for channels-last format. Key changes include:
- Simplification of the AdaptiveAvgPool2dBwdKernelFunctor and AdaptiveAvgPool2dBwdSLMKernelFunctor by removing the is_channels_last template parameter.
- Addition of AdaptiveAvgPool2dBwdSLMChannelsLastKernelFunctor, which precomputes indices and pooling factors for efficient gradient accumulation using shared memory.
- Optimization of thread, group, and shared memory configurations to improve performance.
Comments suppressed due to low confidence (1)
src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp:512
- Consider adding a safeguard to break out of the loop if max_threads becomes too small or reaches zero, to prevent potential infinite loops when the shared memory requirements exceed the available limit.
max_threads /= 2;
START_IND_INT(i, osizeW_, isizeW_)); | ||
} | ||
|
||
// each cta handles a portion of a single slice on batch dimension; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Clarify the assumptions behind the computation of batch_id and channel_id from the group index with an inline comment to aid future maintainers, ensuring that the ordering remains valid if the group configuration changes.
// each cta handles a portion of a single slice on batch dimension; | |
// each cta handles a portion of a single slice on batch dimension; | |
// Assumption: The group index (item.get_group(2)) is ordered such that | |
// the batch dimension is the least significant, and the channel dimension | |
// is the next. If the group configuration changes, this computation | |
// must be revisited to ensure correctness. |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors and enhances the Adaptive Average Pooling 2D backward kernel for improved performance, clarity, and maintainability.
- Removed the redundant is_channels_last template parameter from kernel functors.
- Introduced a new channels-last specific kernel functor and optimized thread and memory configurations.
- Replaced hardcoded dimensions with dynamically computed values and streamlined shared memory usage.
#define START_IND(a, b, c) ((int64_t)((a / b) * c + ((a % b) * c) / b)) | ||
#define END_IND(a, b, c) (1 + ((int64_t)(a + 1) * c - 1) / b) | ||
|
||
#define START_IND_INT(a, b, c) ((a * c) / b) | ||
#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider replacing the macro definitions (e.g., START_IND, END_IND, and their INT variants) with inline functions for improved type safety and easier debugging.
#define START_IND(a, b, c) ((int64_t)((a / b) * c + ((a % b) * c) / b)) | |
#define END_IND(a, b, c) (1 + ((int64_t)(a + 1) * c - 1) / b) | |
#define START_IND_INT(a, b, c) ((a * c) / b) | |
#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b) | |
namespace { | |
inline constexpr int64_t start_ind(int64_t a, int64_t b, int64_t c) { | |
return (a / b) * c + ((a % b) * c) / b; | |
} | |
inline constexpr int64_t end_ind(int64_t a, int64_t b, int64_t c) { | |
return 1 + ((a + 1) * c - 1) / b; | |
} | |
inline constexpr int start_ind_int(int a, int b, int c) { | |
return (a * c) / b; | |
} | |
inline constexpr int end_ind_int(int a, int b, int c) { | |
return ((a + 1) * c + b - 1) / b; | |
} | |
} // anonymous namespace |
Copilot uses AI. Check for mistakes.
Refactors and enhances the
adaptive_avg_pool2d_backward_kernel
implementation in thesrc/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp
file. Key changes include removing redundant template parameters, adding a new kernel functor for channels-last memory format, and optimizing memory usage and thread configurations for better performance and maintainability.Refactoring and Simplification:
is_channels_last
template parameter from bothAdaptiveAvgPool2dBwdKernelFunctor
andAdaptiveAvgPool2dBwdSLMKernelFunctor
, simplifying their implementations. This eliminates conditional logic based on memory formatNew Kernel Functor:
AdaptiveAvgPool2dBwdSLMChannelsLastKernelFunctor
, specifically designed to handle the channels-last memory format. This functor precomputes indices and pooling factors for efficient gradient computation, leveraging shared memory for intermediate storage.Memory and Thread Optimization:
XPU_MAX_THREADS
,GROUP_STRIDE
) and optimized thread group configurations to improve performance and reduce the number of groups launched.General Improvements:
isizeH
,isizeW
,osizeH
,osizeW
) for better readability and maintainability.