[Execute Infrastructure] Add cublasSetStream to support non-default stream in cublas kernel by xuanyuanminzheng · Pull Request #79155 · PaddlePaddle/Paddle

xuanyuanminzheng · 2026-05-27T02:44:11Z

PR Category

Execute Infrastructure

PR Types

Improvements

Description

FD在进行MOE的共享专家双流并行提速时，发现Paddle的cublas算子缺少设置stream的能力，加以补充。
增加了一个cublasSetStream_v2，以支持cublas算子在非主stream运行.

是否引起精度变化

否

paddle-bot · 2026-05-27T02:44:17Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

risemeup1111

本次处理事件：poll（当前 head dcba329f8b40dd0c4043772db64a4f62f3312aee）。

结论：REQUEST_CHANGES。已发布 2 条 inline 评论，细节请看行内评论；主要阻塞点是当前实现会破坏 HIP/ROCm 编译，并且 CUDA 分支在 cublasSetStream_v2 后会丢失已设置的 cuBLAS workspace。

优先级：P3 非行级：PR 描述和 CI 模板检查无法绑定到 diff 行。当前 Check PR Template 仍失败，日志显示 PR Category 被解析成包含开头描述文本的异常内容；请按最新 PR 模板重整 body，使 ### PR Category 独立填写一个合法类别，并补充对应测试/风险说明。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-05-27T02:59:05Z

    });
    if (blas_tf32_tensor_core_handle_ && phi::AllowTF32Cublas()) {
      std::lock_guard<std::mutex> guard(blas_tf32_mtx_);
+      PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(


优先级：P0

这里无条件调用 CUDA 的 phi::dynload::cublasSetStream_v2，但 gpu_context.cc 同时会在 PADDLE_WITH_HIP 下编译；HIP 分支只 include 了 rocblas.h，blasHandle_t 也是 rocblas_handle，因此 ROCm/HIP 构建会找不到 cublasSetStream_v2 并编译失败。请把设置 stream 的逻辑抽成同时覆盖 CUDA/HIP 的 helper，并在下面所有新增调用点统一使用它，例如：

void SetBlasStream(blasHandle_t handle) { #ifdef PADDLE_WITH_HIP PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::rocblas_set_stream(handle, stream())); #else PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(handle, stream())); #endif }

risemeup1111 · 2026-05-27T02:59:30Z

    });
    if (blas_tf32_tensor_core_handle_ && phi::AllowTF32Cublas()) {
      std::lock_guard<std::mutex> guard(blas_tf32_mtx_);
+      PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(


优先级：P1

这里在 std::call_once 中已经为非 legacy GEMM 调过 SetCublasWorkspace(...)，但 CUDA 的 cublasSetStream 会把用户设置的 workspace 重置为默认 workspace pool。现在每次进入 CublasCall 都先 cublasSetStream_v2 再执行 GEMM，会导致前面配置的持久 cublas_workspace_ 不再生效，回退现有大 workspace 优化。请在 CUDA 分支设置 stream 后重新设置 workspace，或把两者封装成同一个 helper，保证 cublasSetStream_v2 之后立即恢复 workspace：

void SetBlasStream(blasHandle_t handle) { #ifdef PADDLE_WITH_HIP PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::rocblas_set_stream(handle, stream())); #else PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(handle, stream())); #if !defined(_WIN32) if (!FLAGS_use_legacy_gemm) { SetCublasWorkspace(handle); } #endif #endif }

risemeup1111

已复查新的提交。之前关于 HIP/ROCm 编译分支和 cublasSetStream 后 workspace 被重置的两个阻塞点，看起来已经通过 SetBlasStream 统一处理并重新设置 workspace 修复。

但当前实现仍漏掉了直接通过 dev_ctx.cublas_handle() 获取裸 handle 的 cuBLAS 调用路径，相关细节已放在新的行内评论里；这个问题会让部分 cuBLAS kernel 在切到非默认 stream 后仍可能使用旧 stream，需要继续修正后再合入。

优先级：P3 非行级：PR 模板检查仍失败，CI 日志显示 PR Category 仍被解析成包含开头描述文本的异常内容。请按仓库最新模板整理 PR body，使 ### PR Category 独立填写合法类别，并补充测试/风险说明。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-05-27T03:48:42Z

+  // On CUDA (non-Windows, non-legacy_gemm) the pre-allocated workspace is
+  // re-applied after the stream switch because cublasSetStream_v2 resets the
+  // handle's workspace pointer to the default pool.
+  void SetBlasStream(blasHandle_t handle) {


优先级：P1

这个 helper 目前只在 CublasCall / TensorCoreCublasCallIfAvailable 路径里调用，但仓库里仍有 cuBLAS kernel 直接通过 dev_ctx.cublas_handle() 获取裸 handle，例如 paddle/phi/kernels/legacy/gpu/batched_gemm.cu 的 grouped GEMM 和 paddle/phi/kernels/gpu/eig_grad_kernel.cu。这些路径不会经过这里的 stream 绑定；当同一个 GPUContext 被 SetCUDAStream 切到非默认 stream 后，直接使用 cublas_handle() 的算子仍可能继续在 handle 上一次绑定的 stream 上执行，MOE 的 batched_gemm 场景也就没有被这个 PR 覆盖完整。

请把裸 handle 获取路径也纳入 stream 绑定，至少在 GetBlasHandle() 返回前按当前 stream 调用同一套逻辑；更稳妥的做法是把这些直接使用 cuBLAS 的 kernel 迁移到带 callback 的接口，保证设置 stream 和实际 cuBLAS 调用处在同一个受保护区间。例如修复形态可以是：

blasHandle_t GetBlasHandle() { // lazy init existing handles... std::lock_guard<std::mutex> guard(blas_mtx_); gpuStream_t cur = stream(); if (cur != blas_handle_stream_) { SetBlasStream(blas_handle_); blas_handle_stream_ = cur; } return blas_handle_; }

或者将直接调用方改成：

dev_ctx.CublasCall([&](blasHandle_t handle) { // all cuBLAS calls using this handle });

这是算子实现者选用的方式问题，如果修改涉及面较广，且在当前场景下没有实际影响。

risemeup1111

已复查新的提交。作者已经把 cublas_handle() 获取裸 handle 的路径纳入了 stream 绑定，之前指出的覆盖缺口有推进。

不过当前实现只在返回裸 handle 前短暂持锁，真正的 cuBLAS 调用仍在锁外执行，同一个 handle 仍可能被其他 stream 的调用切换；这个并发问题已在新的行内评论里说明，需要继续修正后再合入。

当前可见 CI 失败主要是 paddle/phi 目录变更需要 RD 审批；未看到新的代码格式问题。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-05-27T04:26:29Z

+    // Bind the handle to the current stream so callers that obtain the raw
+    // handle via cublas_handle() (e.g. batched_gemm, eig_grad) dispatch to
+    // the correct stream even when SetCUDAStream has switched the context.
+    std::lock_guard<std::mutex> guard(blas_mtx_);


优先级：P1

这里虽然在返回裸 blas_handle_ 前设置了当前 stream，但锁会在 GetBlasHandle() 返回时释放，调用方随后执行真正的 cuBLAS 调用时不再受 blas_mtx_ 保护。batched_gemm 会先保存 dev_ctx.cublas_handle() 的返回值，再循环发多次 GEMM；如果另一个线程/stream 同时进入 CublasCall()、TensorCoreCublasCallIfAvailable() 或再次调用 cublas_handle()，同一个 blas_handle_ 的 stream 仍可能在这些 GEMM 之间被切走，非默认 stream 场景仍不可靠。

请不要用“返回裸 handle 前短暂加锁”来修复这类调用。更稳妥的做法是让直接使用 cuBLAS 的调用点改为 CublasCall 回调接口，使设置 stream 和所有 cuBLAS 调用处在同一个锁保护区间；如果必须保留裸 handle API，也需要提供一个持锁的 RAII/回调接口而不是单独返回 handle。例如 batched_gemm 这类路径应改成：

dev_ctx.CublasCall([&](blasHandle_t handle) { for (int64_t i = 0; i < num_experts; ++i) { CublasGemm(handle, /* existing arguments */); } });

risemeup1111

已复查新的提交。之前关于 HIP/ROCm 编译分支和 cublasSetStream 后 workspace 恢复的问题仍看起来已处理；新增的锁内 stream 绑定也覆盖了 CublasCall / TensorCoreCublasCallIfAvailable 路径。

但上一轮关于裸 cublas_handle() 调用路径的覆盖缺口仍未解决，新的行内评论里已说明需要把这些直接 cuBLAS 调用迁移到持锁回调或等价接口后再合入。当前 CI 也仍有审批失败和若干运行中任务需要等待。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-05-27T04:48:31Z

+  // On CUDA (non-Windows, non-legacy_gemm) the pre-allocated workspace is
+  // re-applied after the stream switch because cublasSetStream_v2 resets the
+  // handle's workspace pointer to the default pool.
+  void SetBlasStream(blasHandle_t handle) {


优先级：P1

最新提交移除了 GetBlasHandle() 里的短锁绑定，但 SetBlasStream() 仍只在 CublasCall() / TensorCoreCublasCallIfAvailable() 回调路径中调用；现有 paddle/phi/kernels/legacy/gpu/batched_gemm.cu 和 paddle/phi/kernels/gpu/eig_grad_kernel.cu 仍直接通过 dev_ctx.cublas_handle() 取裸 handle。这样在 SetCUDAStream() 切换同一个 GPUContext 的 stream 后，这些直接调用路径不会重新绑定当前 stream，batched_gemm 的 MOE 场景仍可能继续跑在 handle 上一次绑定的 stream 上。

请把直接使用 cuBLAS 的调用点迁移到持锁的回调接口，或者提供等价的持锁 RAII/回调 API，保证设置 stream 和实际 cuBLAS 调用处于同一个保护区间。例如：

dev_ctx.CublasCall([&](blasHandle_t handle) { for (int64_t i = 0; i < num_experts; ++i) { CublasGemm(handle, /* existing arguments */); } });

eig_grad 中连续的 cublas{C,Z}geam 调用也需要按同样方式放进 CublasCall，避免继续绕过这里新增的 stream 绑定。

这是算子实现者选用的方式问题，如果修改涉及面较广，且在当前场景下没有实际影响。

codecov-commenter · 2026-05-27T08:19:37Z

Codecov Report

❌ Patch coverage is 61.90476% with 8 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@05e095f). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
paddle/phi/backends/gpu/gpu_context.cc	61.90%	8 Missing ⚠️

❌ Your patch status has failed because the patch coverage (61.90%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop   #79155   +/-   ##
==========================================
  Coverage           ?   61.90%           
==========================================
  Files              ?        1           
  Lines              ?       21           
  Branches           ?        0           
==========================================
  Hits               ?       13           
  Misses             ?        8           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

risemeup1111

已复查新的提交。新增测试覆盖了 GetBlas(...).MatMul 通过回调接口走非默认 stream 的场景，但上一轮关于裸 cublas_handle() 直接调用路径的阻塞点仍未通过代码修复解决，细节已放在新的行内评论里。

另外新增的 stream cache 测试当前没有真正复用同一个 GPUContext 切换 stream，测试覆盖也需要调整。当前 CI 仍有审批失败和多个运行中任务需要等待。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

risemeup1111 · 2026-05-27T08:46:10Z

+  // Alternate between stream_a and stream_b for 4 iterations.
+  for (int iter = 0; iter < 4; ++iter) {
+    gpuStream_t cur_raw = (iter % 2 == 0) ? stream_a : stream_b;
+    auto ctx = MakeCtxOnStream(gpu_place, cur_raw);


优先级：P2

这个测试注释说要“Switch the same GPUContext between two auxiliary streams”，但循环内每次都 MakeCtxOnStream(...) 新建一个 GPUContext，因此每个 context 的 blas_handle_stream_ 初始都是 nullptr，实际没有覆盖“同一个 cuBLAS handle 在 stream_a 和 stream_b 之间反复切换时缓存失效并重新 SetBlasStream”的路径。这样即使缓存字段没有在同一 context 切流时正确更新，这个测试也不会失败。

请把 ctx 移到循环外，只创建一个 GPUContext，然后在每轮对同一个 context 调用 SetCUDAStream(...) 切到 stream_a / stream_b 后再执行 MatMul；或者直接重写为一个 helper，确保同一个 blas_handle_ 被复用。例如：

auto ctx = MakeCtxOnStream(gpu_place, stream_a); for (int iter = 0; iter < 4; ++iter) { gpuStream_t cur_raw = (iter % 2 == 0) ? stream_a : stream_b; ctx->SetCUDAStream(new phi::CUDAStream(gpu_place, cur_raw), /*clear=*/true); // run MatMul with the same ctx and verify result }

如果采用 new CUDAStream，还需要同时处理 wrapper 的所有权，避免测试循环泄漏 CUDAStream 对象。

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-27 16:59 CST

📋 Review 摘要

PR 概述：为 GPUContext 中的 cuBLAS/rocBLAS handle 增加延迟 stream 绑定能力，支持在非默认 stream 上运行 cublas 算子。
变更范围：paddle/phi/backends/gpu/gpu_context.cc、test/cpp/phi/kernels/test_math_function.cu
影响面 Tag：[Operator Mechanism]

问题

级别	文件	概述
❓ 疑问	`paddle/phi/backends/gpu/gpu_context.cc`	[F2] HIP/rocBLAS 切换 stream 后未说明是否需要重新 apply workspace
❓ 疑问	`paddle/phi/backends/gpu/gpu_context.cc`	[F3] Windows 路径下 cublasSetStream_v2 后未重新 apply workspace，原因未说明

历史 Findings 修复情况

Finding	问题	状态
F1	每次 cublas 调用都执行 cublasSetStream_v2，即使 stream 未变化	✅ 已修复
F2	HIP/rocBLAS 路径切换 stream 后是否需要重新 apply workspace	⚠️ 仍存在
F3	Windows 路径下 cublasSetStream_v2 后未重新 apply workspace	⚠️ 仍存在

📝 PR 规范检查

标题 Tag [Execute Infrastructure] 与变更路径（paddle/phi/backends/gpu/）不符，paddle/phi/backends/ 属于 [Operator Mechanism] 范畴，建议修正。PR 描述四个必填 section 均已填写，结构合规。

标题建议（可直接复制）：

[Operator Mechanism] Add cublasSetStream to support non-default stream in cublas kernel

总体评价

本 PR 核心逻辑清晰，通过缓存每个 handle 最后绑定的 stream 来避免重复调用 cublasSetStream_v2，F1 已得到修复。F2（rocBLAS workspace）和 F3（Windows workspace）两个疑问建议在代码注释中补充说明，以提升可维护性。测试覆盖了正确性、多次流切换缓存失效和并发流独立性三个场景，质量较好。

risemeup1111

已复查新的提交。上一轮关于 stream cache 测试没有复用同一个 GPUContext 的问题看起来已处理，新测试现在会在同一个 context 上切换 stream。

但裸 cublas_handle() 直接调用路径仍未通过代码修复解决，当前 head 里已补充新的行内评论说明仍需把 batched_gemm / eig_grad 这类路径纳入持锁回调或等价接口。当前 CI 仍有审批失败和多个运行中任务需要等待。

^{Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.}

xuanyuanminzheng · 2026-05-27T11:17:46Z

/re-run all-failed

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-27T00:00:00Z

📋 Review 摘要

PR 概述：为 GPUContext 的 cuBLAS/rocBLAS handle 增加非默认 stream 支持，通过缓存上次绑定的 stream 避免冗余的 cublasSetStream_v2 调用。
变更范围：paddle/phi/backends/gpu/gpu_context.cc、test/cpp/phi/kernels/test_math_function.cu
影响面 Tag：[Operator Mechanism] [Execute Infrastructure]

问题

级别	文件	概述
🟡 建议	`paddle/phi/backends/gpu/gpu_context.cc:349`	`SetBlasStream` 内部重复调用 `stream()`，存在 TOCTOU 隐患，建议直接传入 stream 参数

历史 Findings 修复情况

Finding	问题	状态
F1	每次 cublas 调用都执行 `cublasSetStream_v2`，即使 stream 未变化也会产生额外开销	✅ 已修复
F2	HIP/rocBLAS 路径切换 stream 后是否需要重新 apply workspace	⚠️ 仍存在
F3	Windows 路径下 `cublasSetStream_v2` 后未重新 apply workspace	⚠️ 仍存在

📝 PR 规范检查

标题 Tag [Execute Infrastructure] 与变更文件路径不符（paddle/phi/backends/gpu/ 更接近 [Operator Mechanism]），建议修正（历史已报告，未修复）。

标题建议（可直接复制）：

[Operator Mechanism] add cublasSetStream to support non-default stream in cublas kernel

总体评价

F1 缓存优化已落地，代码整体质量较好，测试覆盖了单流、多流切换和并发三个场景。F2/F3（HIP workspace 和 Windows workspace 重新 apply）仍未处理，建议作者确认是否有意为之并在 PR 描述中说明。SetBlasStream 内部重复调用 stream() 存在轻微 TOCTOU 隐患，建议改为传参方式。

PaddlePaddle-bot · 2026-05-27T15:45:09Z

+
  // Persistent cublasLt workspace: grow-only, freed in destructor.
  // Returns {ptr, size}. Thread-safe via mutex for grow path.
  std::pair<void*, size_t> GetCublasLtWorkspace(size_t required_size) {


🟡 建议 SetBlasStream 内部再次调用 stream() 存在 TOCTOU 隐患。

调用方已经通过 gpuStream_t cur = stream() 获取了当前 stream，但 SetBlasStream 内部又调用了一次 stream()。若两次调用之间 stream 发生变化（虽然概率低），cublas handle 会被绑定到新 stream，但 blas_handle_stream_ 却被设置为旧的 cur，导致缓存状态不一致。

建议修改 SetBlasStream 签名，直接接受 stream 参数：

void SetBlasStream(blasHandle_t handle, gpuStream_t s) { #ifdef PADDLE_WITH_HIP PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::rocblas_set_stream(handle, s)); #else PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(handle, s)); #if !defined(_WIN32) if (!FLAGS_use_legacy_gemm) { SetCublasWorkspace(handle); } #endif #endif }

调用处改为 SetBlasStream(blas_handle_, cur);，消除重复调用和 TOCTOU 风险。

This comment was marked as outdated.

Sign in to view

xuanyuanminzheng force-pushed the feature_cublas_stream branch from 29cf712 to dcba329 Compare May 27, 2026 02:54

xuanyuanminzheng changed the title ~~[Execute Infrastructure] add cublasSetStream for change stream in cublas kernel.~~ [Operator Mechanism] add cublasSetStream to support non-default stream in cublas kernel May 27, 2026

risemeup1111 suggested changes May 27, 2026

View reviewed changes

xuanyuanminzheng changed the title ~~[Operator Mechanism] add cublasSetStream to support non-default stream in cublas kernel~~ [Execute Infrastructure] add cublasSetStream to support non-default stream in cublas kernel May 27, 2026

This comment was marked as outdated.

Sign in to view

xuanyuanminzheng changed the title ~~[Execute Infrastructure] add cublasSetStream to support non-default stream in cublas kernel~~ 【Execute Infrastructure】 add cublasSetStream to support non-default stream in cublas kernel May 27, 2026

This comment was marked as outdated.

Sign in to view

xuanyuanminzheng changed the title ~~【Execute Infrastructure】 add cublasSetStream to support non-default stream in cublas kernel~~ [Execute Infrastructure] Add cublasSetStream to support non-default stream in cublas kernel May 27, 2026

risemeup1111 suggested changes May 27, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

xuanyuanminzheng force-pushed the feature_cublas_stream branch from 869b162 to 003f6e5 Compare May 27, 2026 04:42

This comment was marked as outdated.

Sign in to view

risemeup1111 suggested changes May 27, 2026

View reviewed changes

PaddlePaddle-bot reviewed May 27, 2026

View reviewed changes

upgrade change stream in cublas.

7d9372d

xuanyuanminzheng force-pushed the feature_cublas_stream branch from db08478 to 7d9372d Compare May 27, 2026 09:05

risemeup1111 suggested changes May 27, 2026

View reviewed changes

Comment thread paddle/phi/backends/gpu/gpu_context.cc

PaddlePaddle-bot reviewed May 27, 2026

View reviewed changes

Conversation

xuanyuanminzheng commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

是否引起精度变化

Uh oh!

paddle-bot Bot commented May 27, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 27, 2026

Codecov Report

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

risemeup1111 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuanyuanminzheng commented May 27, 2026

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

xuanyuanminzheng commented May 27, 2026 •

edited

Loading