Skip to content

[PHI] Fix local index in strided elementwise kernel#79162

Open
SigureMo wants to merge 2 commits into
PaddlePaddle:developfrom
cattidea:codex/fix-stride-elementwise-local-index
Open

[PHI] Fix local index in strided elementwise kernel#79162
SigureMo wants to merge 2 commits into
PaddlePaddle:developfrom
cattidea:codex/fix-stride-elementwise-local-index

Conversation

@SigureMo
Copy link
Copy Markdown
Member

@SigureMo SigureMo commented May 27, 2026

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

修复 stride elementwise CUDA kernel 中线程本地数组索引错误的问题。

BinaryElementwiseKernel / UnaryElementwiseKernel 中,线程本地数组 args 的长度为 STRIDE_VEC_SIZE,当前为 1,但原实现使用全局线性下标 idx 写入 args[idx]。当处理非 contiguous 输入的 compare/logical/elementwise stride kernel 时,线程号大于 0 会写越线程本地数组边界,在 CUDA 13.2 / sm_103 环境中可由 label != -100 触发 illegal memory access。

本 PR 将本地输入缓存写入改为 args[0],并补充 base[:, :63, :] 形成 shape [1, 63, 1]、stride [64, 1, 1] 后执行 != / == 的 CUDA 回归用例。

本地验证:

  • prek
  • python3 -m py_compile test/legacy_test/test_compare_op_stride.py
  • git diff --cached --check
  • rg "args\\[idx\\]|result\\[idx\\]" paddle/phi/kernels/stride/elementwise_stride_base.cu.h 无命中
  • 未在本机执行 CUDA 回归测试:本地没有 CUDA build/runtime 环境。

是否引起精度变化

否。该修复只更正 stride elementwise kernel 的线程本地数组索引,不改变数学计算逻辑。

This PR is authored by @codex (gpt-5.5 xhigh)

Copilot AI review requested due to automatic review settings May 27, 2026 18:46
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 27, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an out-of-bounds write in the CUDA/HIP strided elementwise kernels by correcting how per-thread local arrays are indexed, and adds a CUDA regression test that reproduces the failure scenario using a non-contiguous strided slice compared with a scalar.

Changes:

  • Fix BinaryElementwiseKernel / UnaryElementwiseKernel to write thread-local cached inputs into args[0] (instead of incorrectly using the global linear index idx).
  • Add CUDA-only regression tests covering != and == with a non-contiguous tensor slice having stride [64, 1, 1].

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
test/legacy_test/test_compare_op_stride.py Adds CUDA regression tests for scalar comparisons on a non-contiguous strided slice.
paddle/phi/kernels/stride/elementwise_stride_base.cu.h Fixes incorrect indexing into per-thread local args arrays to prevent OOB writes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SigureMo SigureMo changed the title [CUDA] Fix local index in strided elementwise kernel [PHI] Fix local index in strided elementwise kernel May 27, 2026
@SigureMo SigureMo force-pushed the codex/fix-stride-elementwise-local-index branch 2 times, most recently from 0accd9e to b3b085a Compare May 27, 2026 19:01
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查本次改动,未发现需要阻塞合入的问题。修复点与新增 CUDA 回归用例匹配;当前仍有审批类检查未通过,代码相关 CI 请以 GitHub 最新结果为准。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

@SigureMo SigureMo force-pushed the codex/fix-stride-elementwise-local-index branch from b3b085a to 3c8811c Compare May 27, 2026 19:06
Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本轮先 request changes。核心 CUDA/HIP kernel 修复方向是对的:args 是线程本地、长度为 STRIDE_VEC_SIZE(当前为 1),这里写 args[0] 才能和后续 SameDimsElementwisePrimitiveCaller(..., read_lens=1) / result[0] 对齐,避免用全局 idx 写越界。

阻塞点:新增回归用例当前没有显式放到 CUDA/custom device 上,可能无法覆盖本 PR 实际修改的 stride CUDA/HIP kernel。请先补齐这一点后再合入。


class TestCompareStridedSliceWithScalar(unittest.TestCase):
def test_not_equal_with_size_one_dim_stride(self):
base = paddle.arange(64, dtype="int64").reshape([1, 64, 1])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里没有显式把输入放到 CUDA/custom device 上;paddle.arange 会按当前默认 device 创建 Tensor(通常是 CPU),因此这个回归用例可能只跑到 CPU elementwise 路径,覆盖不到本 PR 修改的 PADDLE_WITH_CUDA/HIP stride kernel。建议像上面的 TestBinaryElementwiseOp_Stride 一样加 device guard,并用 get_device_place()/paddle.to_tensor(..., place=...)(或显式 paddle.set_device(get_device()))确保 label != -100 / label == 1 在目标 GPU/custom place 上执行。

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查最新 force-push。内核修复本身仍然保持上一轮判断,但新增回归测试在当前版本里可能落到 static 模式,无法稳定覆盖这次修复的 eager/CUDA stride kernel;具体修改建议已放在行级评论中。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Comment on lines +207 to +227
class TestCompareStridedSliceWithScalar(unittest.TestCase):
def test_not_equal_with_size_one_dim_stride(self):
base = paddle.arange(64, dtype="int64").reshape([1, 64, 1])
label = base[:, :63, :]

self.assertEqual(list(label.shape), [1, 63, 1])
self.assertEqual(label.strides, [64, 1, 1])
self.assertFalse(label.is_contiguous())

out = label != -100
np.testing.assert_array_equal(
out.numpy(), np.ones([1, 63, 1], dtype=np.bool_)
)

def test_equal_with_size_one_dim_stride(self):
base = paddle.arange(64, dtype="int64").reshape([1, 64, 1])
label = base[:, :63, :]

out = label == 1
expected = np.arange(63, dtype=np.int64).reshape([1, 63, 1]) == 1
np.testing.assert_array_equal(out.numpy(), expected)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

这个新增用例依赖动态图中的 strided Tensor,但当前没有显式进入 dygraph。 同文件已有 TestBinaryElementwiseOp_Stride.test_dygraph_api_arithmetic 在用例结束时调用 paddle.enable_static(),整文件运行时后续这里的 paddle.arange(...).reshape(...) 可能落在 static 模式,out.numpy() 也就不能稳定作为 eager/CUDA stride kernel 的回归检查执行。请恢复 CUDA-only guard,并在每个用例内显式切到目标 place 的动态图模式后再恢复 static。

Suggested change
class TestCompareStridedSliceWithScalar(unittest.TestCase):
def test_not_equal_with_size_one_dim_stride(self):
base = paddle.arange(64, dtype="int64").reshape([1, 64, 1])
label = base[:, :63, :]
self.assertEqual(list(label.shape), [1, 63, 1])
self.assertEqual(label.strides, [64, 1, 1])
self.assertFalse(label.is_contiguous())
out = label != -100
np.testing.assert_array_equal(
out.numpy(), np.ones([1, 63, 1], dtype=np.bool_)
)
def test_equal_with_size_one_dim_stride(self):
base = paddle.arange(64, dtype="int64").reshape([1, 64, 1])
label = base[:, :63, :]
out = label == 1
expected = np.arange(63, dtype=np.int64).reshape([1, 63, 1]) == 1
np.testing.assert_array_equal(out.numpy(), expected)
@unittest.skipIf(
not paddle.core.is_compiled_with_cuda(),
"core is not compiled with CUDA",
)
class TestCompareStridedSliceWithScalar(unittest.TestCase):
def setUp(self):
self.place = get_device_place()
def test_not_equal_with_size_one_dim_stride(self):
paddle.disable_static(self.place)
try:
base = paddle.arange(64, dtype="int64").reshape([1, 64, 1])
label = base[:, :63, :]
self.assertEqual(list(label.shape), [1, 63, 1])
self.assertEqual(label.strides, [64, 1, 1])
self.assertFalse(label.is_contiguous())
out = label != -100
np.testing.assert_array_equal(
out.numpy(), np.ones([1, 63, 1], dtype=np.bool_)
)
finally:
paddle.enable_static()
def test_equal_with_size_one_dim_stride(self):
paddle.disable_static(self.place)
try:
base = paddle.arange(64, dtype="int64").reshape([1, 64, 1])
label = base[:, :63, :]
out = label == 1
expected = np.arange(63, dtype=np.int64).reshape([1, 63, 1]) == 1
np.testing.assert_array_equal(out.numpy(), expected)
finally:
paddle.enable_static()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

最新提交删除了同文件旧用例里的 paddle.enable_static() / paddle.disable_static() 切换,这部分已经消除了我上一条评论里“后续用例可能落到 static 模式”的证据。但新增回归用例仍然直接用 paddle.arange(...) 创建 base,没有跳过非 CUDA/custom 环境,也没有把输入放到 get_device_place();当默认 device 是 CPU 时,它仍可能只覆盖 CPU elementwise 路径,不能稳定验证本 PR 修改的 PADDLE_WITH_CUDA/HIP stride kernel。请保留当前 dynamic 处理,同时把这个用例显式绑定到目标 place,例如两个测试都复用下面的 helper:

@unittest.skipIf(
    not (paddle.core.is_compiled_with_cuda() or is_custom_device()),
    "core is not compiled with CUDA",
)
class TestCompareStridedSliceWithScalar(unittest.TestCase):
    def setUp(self):
        self.place = get_device_place()

    def _make_label(self):
        base = paddle.to_tensor(
            np.arange(64, dtype=np.int64).reshape([1, 64, 1]),
            place=self.place,
        )
        return base[:, :63, :]

@SigureMo SigureMo force-pushed the codex/fix-stride-elementwise-local-index branch from 3c8811c to af2ae49 Compare May 28, 2026 01:39
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查最新提交。static/dygraph 模式风险已通过移除同文件旧用例里的显式模式切换得到处理;但新增回归测试仍未显式运行在 CUDA/custom place 上,无法稳定覆盖本 PR 修改的 stride CUDA/HIP kernel。具体建议已更新在原行级线程中。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Co-authored-by: Codex <noreply@openai.com>
@SigureMo SigureMo force-pushed the codex/fix-stride-elementwise-local-index branch from af2ae49 to db107cf Compare May 28, 2026 01:54
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查最新提交。之前指出的回归测试 device 覆盖问题已通过 get_device_place()paddle.to_tensor(..., place=self.place) 处理;本轮未发现需要阻塞合入的新问题。CI 仍有任务进行中,请以 GitHub 最新结果为准。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

已复查最新提交 db107cf2e0a72c3e120f526d9c258bff4a97d62a,我这边可以 approve。

本轮新增的 _make_label() 已通过 paddle.to_tensor(..., place=self.place) 将输入显式放到 get_device_place(),在 CUDA/custom CI 环境下会覆盖目标 stride CUDA/HIP kernel;同时移除了旧用例里的 static/dygraph 显式切换,避免后续回归用例落到 static/PIR 模式。核心 kernel 中线程本地 args[0] 的修复仍然正确。

当前 Check approval 属于 paddle/phi 变更的审批 gate;其余 CI 以最新 head 结果为准。

Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本轮需要重新 request changes。

最新 head db107cf2e0a72c3e120f526d9c258bff4a97d62a 的 CUDA/custom place 绑定方向是对的,但新增用例仍没有像同文件原有 stride 用例一样加 CUDA/custom guard。当前 Linux-XPU / Test 已在 test_compare_op_stride 上失败:test_not_equal_with_size_one_dim_stridelabel.strides 实际为 [63, 1, 1],断言期望 [64, 1, 1]。这说明该回归用例会在非目标环境中运行,且没有稳定覆盖本 PR 修改的 CUDA/HIP stride kernel。

请按现有线程建议,为 TestCompareStridedSliceWithScalar 加上与上方用例一致的 guard,例如 not (paddle.core.is_compiled_with_cuda() or is_custom_device()) 时 skip。当前 Check approval 仍是 RD approval gate,单独等待即可。

Co-authored-by: Codex <codex@openai.com>
Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🎬+🐾
LGTMeow 🐾

已确认最新提交在新增回归类上补上了与同文件原有 stride 用例一致的 CUDA/custom guard,并继续通过 paddle.to_tensor(..., place=self.place) 显式绑定目标 place。之前 Linux-XPU / Test 暴露的非目标环境运行问题已修复;当前剩余的 Check approval 属于 RD approval gate。

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查最新提交。新增回归用例现在与同文件现有 stride 用例一致,在非 CUDA/custom 环境会跳过,并且在目标环境中通过 paddle.to_tensor(..., place=self.place) 绑定到 get_device_place();此前的测试覆盖问题已解决。本轮未发现需要阻塞合入的新问题,CI 仍请以 GitHub 最新结果为准。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants