[PHI] Fix local index in strided elementwise kernel#79162
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Pull request overview
This PR fixes an out-of-bounds write in the CUDA/HIP strided elementwise kernels by correcting how per-thread local arrays are indexed, and adds a CUDA regression test that reproduces the failure scenario using a non-contiguous strided slice compared with a scalar.
Changes:
- Fix
BinaryElementwiseKernel/UnaryElementwiseKernelto write thread-local cached inputs intoargs[0](instead of incorrectly using the global linear indexidx). - Add CUDA-only regression tests covering
!=and==with a non-contiguous tensor slice having stride[64, 1, 1].
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| test/legacy_test/test_compare_op_stride.py | Adds CUDA regression tests for scalar comparisons on a non-contiguous strided slice. |
| paddle/phi/kernels/stride/elementwise_stride_base.cu.h | Fixes incorrect indexing into per-thread local args arrays to prevent OOB writes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
0accd9e to
b3b085a
Compare
risemeup1111
left a comment
There was a problem hiding this comment.
已复查本次改动,未发现需要阻塞合入的问题。修复点与新增 CUDA 回归用例匹配;当前仍有审批类检查未通过,代码相关 CI 请以 GitHub 最新结果为准。
b3b085a to
3c8811c
Compare
ShigureNyako
left a comment
There was a problem hiding this comment.
本轮先 request changes。核心 CUDA/HIP kernel 修复方向是对的:args 是线程本地、长度为 STRIDE_VEC_SIZE(当前为 1),这里写 args[0] 才能和后续 SameDimsElementwisePrimitiveCaller(..., read_lens=1) / result[0] 对齐,避免用全局 idx 写越界。
阻塞点:新增回归用例当前没有显式放到 CUDA/custom device 上,可能无法覆盖本 PR 实际修改的 stride CUDA/HIP kernel。请先补齐这一点后再合入。
|
|
||
| class TestCompareStridedSliceWithScalar(unittest.TestCase): | ||
| def test_not_equal_with_size_one_dim_stride(self): | ||
| base = paddle.arange(64, dtype="int64").reshape([1, 64, 1]) |
There was a problem hiding this comment.
这里没有显式把输入放到 CUDA/custom device 上;paddle.arange 会按当前默认 device 创建 Tensor(通常是 CPU),因此这个回归用例可能只跑到 CPU elementwise 路径,覆盖不到本 PR 修改的 PADDLE_WITH_CUDA/HIP stride kernel。建议像上面的 TestBinaryElementwiseOp_Stride 一样加 device guard,并用 get_device_place()/paddle.to_tensor(..., place=...)(或显式 paddle.set_device(get_device()))确保 label != -100 / label == 1 在目标 GPU/custom place 上执行。
risemeup1111
left a comment
There was a problem hiding this comment.
已复查最新 force-push。内核修复本身仍然保持上一轮判断,但新增回归测试在当前版本里可能落到 static 模式,无法稳定覆盖这次修复的 eager/CUDA stride kernel;具体修改建议已放在行级评论中。
| class TestCompareStridedSliceWithScalar(unittest.TestCase): | ||
| def test_not_equal_with_size_one_dim_stride(self): | ||
| base = paddle.arange(64, dtype="int64").reshape([1, 64, 1]) | ||
| label = base[:, :63, :] | ||
|
|
||
| self.assertEqual(list(label.shape), [1, 63, 1]) | ||
| self.assertEqual(label.strides, [64, 1, 1]) | ||
| self.assertFalse(label.is_contiguous()) | ||
|
|
||
| out = label != -100 | ||
| np.testing.assert_array_equal( | ||
| out.numpy(), np.ones([1, 63, 1], dtype=np.bool_) | ||
| ) | ||
|
|
||
| def test_equal_with_size_one_dim_stride(self): | ||
| base = paddle.arange(64, dtype="int64").reshape([1, 64, 1]) | ||
| label = base[:, :63, :] | ||
|
|
||
| out = label == 1 | ||
| expected = np.arange(63, dtype=np.int64).reshape([1, 63, 1]) == 1 | ||
| np.testing.assert_array_equal(out.numpy(), expected) |
There was a problem hiding this comment.
这个新增用例依赖动态图中的 strided Tensor,但当前没有显式进入 dygraph。 同文件已有 TestBinaryElementwiseOp_Stride.test_dygraph_api_arithmetic 在用例结束时调用 paddle.enable_static(),整文件运行时后续这里的 paddle.arange(...).reshape(...) 可能落在 static 模式,out.numpy() 也就不能稳定作为 eager/CUDA stride kernel 的回归检查执行。请恢复 CUDA-only guard,并在每个用例内显式切到目标 place 的动态图模式后再恢复 static。
| class TestCompareStridedSliceWithScalar(unittest.TestCase): | |
| def test_not_equal_with_size_one_dim_stride(self): | |
| base = paddle.arange(64, dtype="int64").reshape([1, 64, 1]) | |
| label = base[:, :63, :] | |
| self.assertEqual(list(label.shape), [1, 63, 1]) | |
| self.assertEqual(label.strides, [64, 1, 1]) | |
| self.assertFalse(label.is_contiguous()) | |
| out = label != -100 | |
| np.testing.assert_array_equal( | |
| out.numpy(), np.ones([1, 63, 1], dtype=np.bool_) | |
| ) | |
| def test_equal_with_size_one_dim_stride(self): | |
| base = paddle.arange(64, dtype="int64").reshape([1, 64, 1]) | |
| label = base[:, :63, :] | |
| out = label == 1 | |
| expected = np.arange(63, dtype=np.int64).reshape([1, 63, 1]) == 1 | |
| np.testing.assert_array_equal(out.numpy(), expected) | |
| @unittest.skipIf( | |
| not paddle.core.is_compiled_with_cuda(), | |
| "core is not compiled with CUDA", | |
| ) | |
| class TestCompareStridedSliceWithScalar(unittest.TestCase): | |
| def setUp(self): | |
| self.place = get_device_place() | |
| def test_not_equal_with_size_one_dim_stride(self): | |
| paddle.disable_static(self.place) | |
| try: | |
| base = paddle.arange(64, dtype="int64").reshape([1, 64, 1]) | |
| label = base[:, :63, :] | |
| self.assertEqual(list(label.shape), [1, 63, 1]) | |
| self.assertEqual(label.strides, [64, 1, 1]) | |
| self.assertFalse(label.is_contiguous()) | |
| out = label != -100 | |
| np.testing.assert_array_equal( | |
| out.numpy(), np.ones([1, 63, 1], dtype=np.bool_) | |
| ) | |
| finally: | |
| paddle.enable_static() | |
| def test_equal_with_size_one_dim_stride(self): | |
| paddle.disable_static(self.place) | |
| try: | |
| base = paddle.arange(64, dtype="int64").reshape([1, 64, 1]) | |
| label = base[:, :63, :] | |
| out = label == 1 | |
| expected = np.arange(63, dtype=np.int64).reshape([1, 63, 1]) == 1 | |
| np.testing.assert_array_equal(out.numpy(), expected) | |
| finally: | |
| paddle.enable_static() |
There was a problem hiding this comment.
最新提交删除了同文件旧用例里的 paddle.enable_static() / paddle.disable_static() 切换,这部分已经消除了我上一条评论里“后续用例可能落到 static 模式”的证据。但新增回归用例仍然直接用 paddle.arange(...) 创建 base,没有跳过非 CUDA/custom 环境,也没有把输入放到 get_device_place();当默认 device 是 CPU 时,它仍可能只覆盖 CPU elementwise 路径,不能稳定验证本 PR 修改的 PADDLE_WITH_CUDA/HIP stride kernel。请保留当前 dynamic 处理,同时把这个用例显式绑定到目标 place,例如两个测试都复用下面的 helper:
@unittest.skipIf(
not (paddle.core.is_compiled_with_cuda() or is_custom_device()),
"core is not compiled with CUDA",
)
class TestCompareStridedSliceWithScalar(unittest.TestCase):
def setUp(self):
self.place = get_device_place()
def _make_label(self):
base = paddle.to_tensor(
np.arange(64, dtype=np.int64).reshape([1, 64, 1]),
place=self.place,
)
return base[:, :63, :]3c8811c to
af2ae49
Compare
risemeup1111
left a comment
There was a problem hiding this comment.
已复查最新提交。static/dygraph 模式风险已通过移除同文件旧用例里的显式模式切换得到处理;但新增回归测试仍未显式运行在 CUDA/custom place 上,无法稳定覆盖本 PR 修改的 stride CUDA/HIP kernel。具体建议已更新在原行级线程中。
Co-authored-by: Codex <noreply@openai.com>
af2ae49 to
db107cf
Compare
risemeup1111
left a comment
There was a problem hiding this comment.
已复查最新提交。之前指出的回归测试 device 覆盖问题已通过 get_device_place() 和 paddle.to_tensor(..., place=self.place) 处理;本轮未发现需要阻塞合入的新问题。CI 仍有任务进行中,请以 GitHub 最新结果为准。
ShigureNyako
left a comment
There was a problem hiding this comment.
已复查最新提交 db107cf2e0a72c3e120f526d9c258bff4a97d62a,我这边可以 approve。
本轮新增的 _make_label() 已通过 paddle.to_tensor(..., place=self.place) 将输入显式放到 get_device_place(),在 CUDA/custom CI 环境下会覆盖目标 stride CUDA/HIP kernel;同时移除了旧用例里的 static/dygraph 显式切换,避免后续回归用例落到 static/PIR 模式。核心 kernel 中线程本地 args[0] 的修复仍然正确。
当前 Check approval 属于 paddle/phi 变更的审批 gate;其余 CI 以最新 head 结果为准。
ShigureNyako
left a comment
There was a problem hiding this comment.
本轮需要重新 request changes。
最新 head db107cf2e0a72c3e120f526d9c258bff4a97d62a 的 CUDA/custom place 绑定方向是对的,但新增用例仍没有像同文件原有 stride 用例一样加 CUDA/custom guard。当前 Linux-XPU / Test 已在 test_compare_op_stride 上失败:test_not_equal_with_size_one_dim_stride 里 label.strides 实际为 [63, 1, 1],断言期望 [64, 1, 1]。这说明该回归用例会在非目标环境中运行,且没有稳定覆盖本 PR 修改的 CUDA/HIP stride kernel。
请按现有线程建议,为 TestCompareStridedSliceWithScalar 加上与上方用例一致的 guard,例如 not (paddle.core.is_compiled_with_cuda() or is_custom_device()) 时 skip。当前 Check approval 仍是 RD approval gate,单独等待即可。
Co-authored-by: Codex <codex@openai.com>
risemeup1111
left a comment
There was a problem hiding this comment.
已复查最新提交。新增回归用例现在与同文件现有 stride 用例一致,在非 CUDA/custom 环境会跳过,并且在目标环境中通过 paddle.to_tensor(..., place=self.place) 绑定到 get_device_place();此前的测试覆盖问题已解决。本轮未发现需要阻塞合入的新问题,CI 仍请以 GitHub 最新结果为准。
PR Category
Operator Mechanism
PR Types
Bug fixes
Description
修复 stride elementwise CUDA kernel 中线程本地数组索引错误的问题。
在
BinaryElementwiseKernel/UnaryElementwiseKernel中,线程本地数组args的长度为STRIDE_VEC_SIZE,当前为 1,但原实现使用全局线性下标idx写入args[idx]。当处理非 contiguous 输入的 compare/logical/elementwise stride kernel 时,线程号大于 0 会写越线程本地数组边界,在 CUDA 13.2 / sm_103 环境中可由label != -100触发 illegal memory access。本 PR 将本地输入缓存写入改为
args[0],并补充base[:, :63, :]形成 shape[1, 63, 1]、stride[64, 1, 1]后执行!=/==的 CUDA 回归用例。本地验证:
prekpython3 -m py_compile test/legacy_test/test_compare_op_stride.pygit diff --cached --checkrg "args\\[idx\\]|result\\[idx\\]" paddle/phi/kernels/stride/elementwise_stride_base.cu.h无命中是否引起精度变化
否。该修复只更正 stride elementwise kernel 的线程本地数组索引,不改变数学计算逻辑。