Skip to content

Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification #1658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

chunhuanMeng
Copy link
Contributor

@chunhuanMeng chunhuanMeng commented May 14, 2025

Refactors and enhances the adaptive_avg_pool2d_backward_kernel implementation in the src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp file. Key changes include removing redundant template parameters, adding a new kernel functor for channels-last memory format, and optimizing memory usage and thread configurations for better performance and maintainability.

Refactoring and Simplification:

  • Removed the is_channels_last template parameter from both AdaptiveAvgPool2dBwdKernelFunctor and AdaptiveAvgPool2dBwdSLMKernelFunctor, simplifying their implementations. This eliminates conditional logic based on memory format

New Kernel Functor:

  • Introduced AdaptiveAvgPool2dBwdSLMChannelsLastKernelFunctor, specifically designed to handle the channels-last memory format. This functor precomputes indices and pooling factors for efficient gradient computation, leveraging shared memory for intermediate storage.

Memory and Thread Optimization:

  • Added constants (XPU_MAX_THREADS, GROUP_STRIDE) and optimized thread group configurations to improve performance and reduce the number of groups launched.
  • Updated shared memory usage calculations and introduced logic to dynamically adjust thread configurations if memory limits are exceeded.

General Improvements:

  • Replaced hardcoded dimensions with dynamically calculated values (isizeH, isizeW, osizeH, osizeW) for better readability and maintainability.
  • Removed unused or redundant code.

@chunhuanMeng chunhuanMeng changed the title Update AdaptiveAveragePooling2dKernels.cpp Enhanced Adaptive Average Pooling 2D Backward Kernel: Performance Improvements and Code Simplification May 14, 2025
@chunhuanMeng chunhuanMeng requested a review from Copilot May 14, 2025 02:29
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance, simplify code logic, and add a new optimized kernel for channels-last format.

  • Removed the now redundant is_channels_last template parameter and its branches.
  • Introduced a new kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory and group-based processing for enhanced performance.
  • Updated kernel launch configurations and added utility macros for standardized index calculations.

#define START_IND_INT(a, b, c) ((a * c) / b)
#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b)

#define XPU_MAX_THREADS 1024 // this is safe, in reality 256 is our limit
Copy link
Preview

Copilot AI May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider clarifying the comment on XPU_MAX_THREADS to explain why 1024 is used despite the realistic limit being 256, to avoid future confusion for maintainers.

Copilot uses AI. Check for mistakes.

grad_input = at::empty_like(input_, smf);
}
template <typename index_t, typename scalar_t>
struct AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast
Copy link
Preview

Copilot AI May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] It would be beneficial to add inline comments describing the strategy of shared memory caching and the layout calculation in this new channels-last kernel to help future readers understand the complex index and memory computations.

Copilot uses AI. Check for mistakes.

@chunhuanMeng chunhuanMeng requested a review from Copilot May 14, 2025 06:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Adaptive Average Pooling 2D backward kernel to improve performance and simplify the code by removing redundant paths and introducing a new kernel optimized for channels-last memory format. Key changes include:

  • Removal of the is_channels_last template parameter to streamline the kernel functors.
  • Addition of a new channels-last kernel (AdaptiveAvgPool2dBwdSLMKernelFunctorChannelLast) that leverages shared memory caching.
  • Dynamic kernel launch configuration adjustments that ensure shared memory limits are respected.
Comments suppressed due to low confidence (1)

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp:440

  • [nitpick] Consider adding an inline comment explaining the rationale behind dynamically reducing max_threads in the do-while loop to aid clarity and future maintenance.
do { ... max_threads adjustment ... } while (!done && max_threads);

@chunhuanMeng
Copy link
Contributor Author

chunhuanMeng commented May 14, 2025

dtype op shape ChannelsLast output_size original optimized
torch.bfloat16 adaptive_avg_pool2d_backward (8, 512, 32, 32) TRUE (7, 7) 153.176 53.671
torch.float16 adaptive_avg_pool2d_backward (8, 512, 32, 32) TRUE (7, 7) 151.984 48.032
torch.float32 adaptive_avg_pool2d_backward (8, 512, 32, 32) TRUE (7, 7) 152.44 45.624
torch.bfloat16 adaptive_avg_pool2d_backward (8, 256, 56, 56) TRUE (14, 14) 211.68 76.855
torch.float16 adaptive_avg_pool2d_backward (8, 256, 56, 56) TRUE (14, 14) 210.32 71.912
torch.float32 adaptive_avg_pool2d_backward (8, 256, 56, 56) TRUE (14, 14) 210.312 67.248

@chunhuanMeng chunhuanMeng requested a review from Copilot May 23, 2025 01:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the backward kernel for adaptive average pooling 2D on XPU by refactoring the code, removing the redundant templated memory format parameter, and introducing a dedicated kernel functor for channels-last format. Key changes include:

  • Simplification of the AdaptiveAvgPool2dBwdKernelFunctor and AdaptiveAvgPool2dBwdSLMKernelFunctor by removing the is_channels_last template parameter.
  • Addition of AdaptiveAvgPool2dBwdSLMChannelsLastKernelFunctor, which precomputes indices and pooling factors for efficient gradient accumulation using shared memory.
  • Optimization of thread, group, and shared memory configurations to improve performance.
Comments suppressed due to low confidence (1)

src/ATen/native/xpu/sycl/AdaptiveAveragePooling2dKernels.cpp:512

  • Consider adding a safeguard to break out of the loop if max_threads becomes too small or reaches zero, to prevent potential infinite loops when the shared memory requirements exceed the available limit.
max_threads /= 2;

START_IND_INT(i, osizeW_, isizeW_));
}

// each cta handles a portion of a single slice on batch dimension;
Copy link
Preview

Copilot AI May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Clarify the assumptions behind the computation of batch_id and channel_id from the group index with an inline comment to aid future maintainers, ensuring that the ordering remains valid if the group configuration changes.

Suggested change
// each cta handles a portion of a single slice on batch dimension;
// each cta handles a portion of a single slice on batch dimension;
// Assumption: The group index (item.get_group(2)) is ordered such that
// the batch dimension is the least significant, and the channel dimension
// is the next. If the group configuration changes, this computation
// must be revisited to ensure correctness.

Copilot uses AI. Check for mistakes.

@chunhuanMeng chunhuanMeng requested a review from Copilot May 23, 2025 01:27
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors and enhances the Adaptive Average Pooling 2D backward kernel for improved performance, clarity, and maintainability.

  • Removed the redundant is_channels_last template parameter from kernel functors.
  • Introduced a new channels-last specific kernel functor and optimized thread and memory configurations.
  • Replaced hardcoded dimensions with dynamically computed values and streamlined shared memory usage.

Comment on lines +13 to +18
#define START_IND(a, b, c) ((int64_t)((a / b) * c + ((a % b) * c) / b))
#define END_IND(a, b, c) (1 + ((int64_t)(a + 1) * c - 1) / b)

#define START_IND_INT(a, b, c) ((a * c) / b)
#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b)

Copy link
Preview

Copilot AI May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider replacing the macro definitions (e.g., START_IND, END_IND, and their INT variants) with inline functions for improved type safety and easier debugging.

Suggested change
#define START_IND(a, b, c) ((int64_t)((a / b) * c + ((a % b) * c) / b))
#define END_IND(a, b, c) (1 + ((int64_t)(a + 1) * c - 1) / b)
#define START_IND_INT(a, b, c) ((a * c) / b)
#define END_IND_INT(a, b, c) (((a + 1) * c + b - 1) / b)
namespace {
inline constexpr int64_t start_ind(int64_t a, int64_t b, int64_t c) {
return (a / b) * c + ((a % b) * c) / b;
}
inline constexpr int64_t end_ind(int64_t a, int64_t b, int64_t c) {
return 1 + ((a + 1) * c - 1) / b;
}
inline constexpr int start_ind_int(int a, int b, int c) {
return (a * c) / b;
}
inline constexpr int end_ind_int(int a, int b, int c) {
return ((a + 1) * c + b - 1) / b;
}
} // anonymous namespace

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants