Enhance Performance by Adding `index_t` Template Parameter #1610

chunhuanMeng · 2025-04-25T07:52:08Z

Select int64_t or int data types at the start of GPU kernel computation, using templates to pass the chosen type. This optimization improves performance for smaller shapes.

Performance Improvement Reasons:

Better Hardware Support: GPUs handle 32-bit operations more efficiently than 64-bit.
Reduced Memory Usage: 32-bit integers use less memory, enhancing bandwidth and cache efficiency.
Faster Instructions: 32-bit operations require fewer instructions, speeding up computation.

Copilot

Pull Request Overview

This PR enhances performance by introducing a new index_t template parameter for average pooling kernels, allowing more efficient indexing.

Added index_t as a template parameter in AvgPool2dKernelFunctor and AvgPool2dChannelsLastKernelFunctor
Updated associated launching functions and constructor parameters to use index_t instead of int64_t

Comments suppressed due to low confidence (2)

src/ATen/native/xpu/sycl/AveragePool2dKernels.cpp:25

[nitpick] Consider adding a space after the comma before 'typename index_t' for improved readability and consistency in template parameter lists.

template <typename scalar_t, typename accscalar_t,typename index_t>

src/ATen/native/xpu/sycl/AveragePool2dKernels.cpp:667

[nitpick] The dispatch macro identifier 'avg_pool2d_backward_xpu' appears inconsistent with the forward kernel functionality; consider renaming it to better reflect the context, such as 'avg_pool2d_xpu'.

AT_DISPATCH_INDEX_TYPES(at::native::canUse32BitIndexMath(output, INT_MAX) ? ScalarType::Int : ScalarType::Long, "avg_pool2d_backward_xpu", [&] {

EikanWang · 2025-04-26T16:04:07Z

@chunhuanMeng , pls. share the detailed performance improvement number here. Meanwhile, the pr description is meaningless. Pls elaborate on index_t and why index_t can improve performance.

EikanWang

Update PR description.

chunhuanMeng · 2025-04-28T01:43:31Z

Performance Comparison

dtype	shape	original/us	optimized/us
fp16	['float16[16, 24, 112, 112]', [3, 3], [2, 2], [0, 0], False, True, None]	45.904	38.368
fp16	['float16[16, 1984, 7, 7]', [3, 2], [2, 1], [0, 0], False, True, None]	30.336	24.664
fp16	['float16[64, 1024, 112, 112]', [6, 6], [4, 4], [0, 0], False, True, None]	2970.816	2686.872
fp16	['float16[16, 2048, 224, 224]', [3, 3], [2, 2], [0, 0], False, True, None]	12278.568	10279.36
fp16_ChannelsLast	['float16[16, 24, 112, 112]', [3, 3], [2, 2], [0, 0], False, True, None]	62.472	42.744
fp16_ChannelsLast	['float16[16, 1984, 7, 7]', [3, 2], [2, 1], [0, 0], False, True, None]	31.128	25.272
fp16_ChannelsLast	['float16[64, 1024, 112, 112]', [6, 6], [4, 4], [0, 0], False, True, None]	8104.144	6541.616
fp16_ChannelsLast	['float16[16, 2048, 224, 224]', [3, 3], [2, 2], [0, 0], False, True, None]	23269.616	17246.784
fp32	['float32[16, 24, 112, 112]', [3, 3], [2, 2], [0, 0], False, True, None]	49.616	38.816
fp32	['float32[16, 1984, 7, 7]', [3, 2], [2, 1], [0, 0], False, True, None]	28.032	20.392
fp32	['float32[64, 1024, 112, 112]', [6, 6], [4, 4], [0, 0], False, True, None]	3353.68	3324.376
fp32	['float32[16, 2048, 224, 224]', [3, 3], [2, 2], [0, 0], False, True, None]	12431.688	10453.808
fp32_ChannelsLast	['float32[16, 24, 112, 112]', [3, 3], [2, 2], [0, 0], False, True, None]	69.2	50.416
fp32_ChannelsLast	['float32[16, 1984, 7, 7]', [3, 2], [2, 1], [0, 0], False, True, None]	36.056	28.848
fp32_ChannelsLast	['float32[64, 1024, 112, 112]', [6, 6], [4, 4], [0, 0], False, True, None]	8016.92	6475.76
fp32_ChannelsLast	['float32[16, 2048, 224, 224]', [3, 3], [2, 2], [0, 0], False, True, None]	22995.984	17047.776

Enhance Performance by Adding index_t Template Parameter

b9c8bd9

EikanWang requested a review from Copilot April 26, 2025 16:01

Copilot AI reviewed Apr 26, 2025

View reviewed changes

EikanWang requested changes Apr 26, 2025

View reviewed changes

chunhuanMeng requested a review from EikanWang April 28, 2025 01:43

chunhuanMeng added 2 commits April 28, 2025 10:34

correct macro identifier

9edfc2e

Merge branch 'main' into meng_avg

84054cf

chunhuanMeng added the kernel_optimization label Apr 28, 2025

chunhuanMeng and others added 3 commits April 29, 2025 10:52

format

204f3df

format

78e0a14

Merge branch 'main' into meng_avg

c4d2267

xytintel approved these changes Apr 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Performance by Adding `index_t` Template Parameter #1610

Enhance Performance by Adding `index_t` Template Parameter #1610

chunhuanMeng commented Apr 25, 2025 •

edited

Loading

Copilot AI left a comment

EikanWang commented Apr 26, 2025

EikanWang left a comment

chunhuanMeng commented Apr 28, 2025

Enhance Performance by Adding index_t Template Parameter #1610

Are you sure you want to change the base?

Enhance Performance by Adding index_t Template Parameter #1610

Conversation

chunhuanMeng commented Apr 25, 2025 • edited Loading

Performance Improvement Reasons:

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

EikanWang commented Apr 26, 2025

EikanWang left a comment

Choose a reason for hiding this comment

chunhuanMeng commented Apr 28, 2025

Enhance Performance by Adding `index_t` Template Parameter #1610

Enhance Performance by Adding `index_t` Template Parameter #1610

chunhuanMeng commented Apr 25, 2025 •

edited

Loading