Skip to content

Enhance Performance by Adding index_t Template Parameter #1610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

chunhuanMeng
Copy link
Contributor

@chunhuanMeng chunhuanMeng commented Apr 25, 2025

Select int64_t or int data types at the start of GPU kernel computation, using templates to pass the chosen type. This optimization improves performance for smaller shapes.

Performance Improvement Reasons:

  1. Better Hardware Support: GPUs handle 32-bit operations more efficiently than 64-bit.
  2. Reduced Memory Usage: 32-bit integers use less memory, enhancing bandwidth and cache efficiency.
  3. Faster Instructions: 32-bit operations require fewer instructions, speeding up computation.

@EikanWang EikanWang requested a review from Copilot April 26, 2025 16:01
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances performance by introducing a new index_t template parameter for average pooling kernels, allowing more efficient indexing.

  • Added index_t as a template parameter in AvgPool2dKernelFunctor and AvgPool2dChannelsLastKernelFunctor
  • Updated associated launching functions and constructor parameters to use index_t instead of int64_t
Comments suppressed due to low confidence (2)

src/ATen/native/xpu/sycl/AveragePool2dKernels.cpp:25

  • [nitpick] Consider adding a space after the comma before 'typename index_t' for improved readability and consistency in template parameter lists.
template <typename scalar_t, typename accscalar_t,typename index_t>

src/ATen/native/xpu/sycl/AveragePool2dKernels.cpp:667

  • [nitpick] The dispatch macro identifier 'avg_pool2d_backward_xpu' appears inconsistent with the forward kernel functionality; consider renaming it to better reflect the context, such as 'avg_pool2d_xpu'.
AT_DISPATCH_INDEX_TYPES(at::native::canUse32BitIndexMath(output, INT_MAX) ? ScalarType::Int : ScalarType::Long, "avg_pool2d_backward_xpu", [&] {

@EikanWang
Copy link
Contributor

@chunhuanMeng , pls. share the detailed performance improvement number here. Meanwhile, the pr description is meaningless. Pls elaborate on index_t and why index_t can improve performance.

Copy link
Contributor

@EikanWang EikanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update PR description.

@chunhuanMeng
Copy link
Contributor Author

Performance Comparison

dtype               shape                                                                 original/us optimized/us
fp16                ['float16[16, 24, 112, 112]', [3, 3], [2, 2], [0, 0], False, True, None] 45.904      38.368      
fp16                ['float16[16, 1984, 7, 7]', [3, 2], [2, 1], [0, 0], False, True, None] 30.336      24.664      
fp16                ['float16[64, 1024, 112, 112]', [6, 6], [4, 4], [0, 0], False, True, None] 2970.816    2686.872    
fp16                ['float16[16, 2048, 224, 224]', [3, 3], [2, 2], [0, 0], False, True, None] 12278.568   10279.36    
fp16_ChannelsLast   ['float16[16, 24, 112, 112]', [3, 3], [2, 2], [0, 0], False, True, None] 62.472      42.744      
fp16_ChannelsLast   ['float16[16, 1984, 7, 7]', [3, 2], [2, 1], [0, 0], False, True, None] 31.128      25.272      
fp16_ChannelsLast   ['float16[64, 1024, 112, 112]', [6, 6], [4, 4], [0, 0], False, True, None] 8104.144    6541.616    
fp16_ChannelsLast   ['float16[16, 2048, 224, 224]', [3, 3], [2, 2], [0, 0], False, True, None] 23269.616   17246.784   
fp32                ['float32[16, 24, 112, 112]', [3, 3], [2, 2], [0, 0], False, True, None] 49.616      38.816      
fp32                ['float32[16, 1984, 7, 7]', [3, 2], [2, 1], [0, 0], False, True, None] 28.032      20.392      
fp32                ['float32[64, 1024, 112, 112]', [6, 6], [4, 4], [0, 0], False, True, None] 3353.68     3324.376    
fp32                ['float32[16, 2048, 224, 224]', [3, 3], [2, 2], [0, 0], False, True, None] 12431.688   10453.808   
fp32_ChannelsLast   ['float32[16, 24, 112, 112]', [3, 3], [2, 2], [0, 0], False, True, None] 69.2        50.416      
fp32_ChannelsLast   ['float32[16, 1984, 7, 7]', [3, 2], [2, 1], [0, 0], False, True, None] 36.056      28.848      
fp32_ChannelsLast   ['float32[64, 1024, 112, 112]', [6, 6], [4, 4], [0, 0], False, True, None] 8016.92     6475.76     
fp32_ChannelsLast   ['float32[16, 2048, 224, 224]', [3, 3], [2, 2], [0, 0], False, True, None] 22995.984   17047.776   

@chunhuanMeng chunhuanMeng requested a review from EikanWang April 28, 2025 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants