Skip to content

[Execute Infrastructure] Add cublasSetStream to support non-default stream in cublas kernel#79155

Open
xuanyuanminzheng wants to merge 1 commit into
PaddlePaddle:developfrom
xuanyuanminzheng:feature_cublas_stream
Open

[Execute Infrastructure] Add cublasSetStream to support non-default stream in cublas kernel#79155
xuanyuanminzheng wants to merge 1 commit into
PaddlePaddle:developfrom
xuanyuanminzheng:feature_cublas_stream

Conversation

@xuanyuanminzheng
Copy link
Copy Markdown
Contributor

@xuanyuanminzheng xuanyuanminzheng commented May 27, 2026

PR Category

Execute Infrastructure

PR Types

Improvements

Description

FD在进行MOE的共享专家双流并行提速时,发现Paddle的cublas算子缺少设置stream的能力,加以补充。
增加了一个cublasSetStream_v2,以支持cublas算子在非主stream运行.

是否引起精度变化

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 27, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

PaddlePaddle-bot

This comment was marked as outdated.

@xuanyuanminzheng xuanyuanminzheng force-pushed the feature_cublas_stream branch from 29cf712 to dcba329 Compare May 27, 2026 02:54
@xuanyuanminzheng xuanyuanminzheng changed the title [Execute Infrastructure] add cublasSetStream for change stream in cublas kernel. [Operator Mechanism] add cublasSetStream to support non-default stream in cublas kernel May 27, 2026
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本次处理事件:poll(当前 head dcba329f8b40dd0c4043772db64a4f62f3312aee)。

结论:REQUEST_CHANGES。已发布 2 条 inline 评论,细节请看行内评论;主要阻塞点是当前实现会破坏 HIP/ROCm 编译,并且 CUDA 分支在 cublasSetStream_v2 后会丢失已设置的 cuBLAS workspace。

  • P3 优先级:P3 非行级:PR 描述和 CI 模板检查无法绑定到 diff 行。当前 Check PR Template 仍失败,日志显示 PR Category 被解析成包含开头描述文本的异常内容;请按最新 PR 模板重整 body,使 ### PR Category 独立填写一个合法类别,并补充对应测试/风险说明。
Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Comment thread paddle/phi/backends/gpu/gpu_context.cc Outdated
});
if (blas_tf32_tensor_core_handle_ && phi::AllowTF32Cublas()) {
std::lock_guard<std::mutex> guard(blas_tf32_mtx_);
PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 优先级:P0

这里无条件调用 CUDA 的 phi::dynload::cublasSetStream_v2,但 gpu_context.cc 同时会在 PADDLE_WITH_HIP 下编译;HIP 分支只 include 了 rocblas.hblasHandle_t 也是 rocblas_handle,因此 ROCm/HIP 构建会找不到 cublasSetStream_v2 并编译失败。请把设置 stream 的逻辑抽成同时覆盖 CUDA/HIP 的 helper,并在下面所有新增调用点统一使用它,例如:

void SetBlasStream(blasHandle_t handle) {
#ifdef PADDLE_WITH_HIP
  PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::rocblas_set_stream(handle, stream()));
#else
  PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(handle, stream()));
#endif
}

Comment thread paddle/phi/backends/gpu/gpu_context.cc Outdated
});
if (blas_tf32_tensor_core_handle_ && phi::AllowTF32Cublas()) {
std::lock_guard<std::mutex> guard(blas_tf32_mtx_);
PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

这里在 std::call_once 中已经为非 legacy GEMM 调过 SetCublasWorkspace(...),但 CUDA 的 cublasSetStream 会把用户设置的 workspace 重置为默认 workspace pool。现在每次进入 CublasCall 都先 cublasSetStream_v2 再执行 GEMM,会导致前面配置的持久 cublas_workspace_ 不再生效,回退现有大 workspace 优化。请在 CUDA 分支设置 stream 后重新设置 workspace,或把两者封装成同一个 helper,保证 cublasSetStream_v2 之后立即恢复 workspace:

void SetBlasStream(blasHandle_t handle) {
#ifdef PADDLE_WITH_HIP
  PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::rocblas_set_stream(handle, stream()));
#else
  PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(handle, stream()));
#if !defined(_WIN32)
  if (!FLAGS_use_legacy_gemm) {
    SetCublasWorkspace(handle);
  }
#endif
#endif
}

@xuanyuanminzheng xuanyuanminzheng changed the title [Operator Mechanism] add cublasSetStream to support non-default stream in cublas kernel [Execute Infrastructure] add cublasSetStream to support non-default stream in cublas kernel May 27, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@xuanyuanminzheng xuanyuanminzheng changed the title [Execute Infrastructure] add cublasSetStream to support non-default stream in cublas kernel 【Execute Infrastructure】 add cublasSetStream to support non-default stream in cublas kernel May 27, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@xuanyuanminzheng xuanyuanminzheng changed the title 【Execute Infrastructure】 add cublasSetStream to support non-default stream in cublas kernel [Execute Infrastructure] Add cublasSetStream to support non-default stream in cublas kernel May 27, 2026
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查新的提交。之前关于 HIP/ROCm 编译分支和 cublasSetStream 后 workspace 被重置的两个阻塞点,看起来已经通过 SetBlasStream 统一处理并重新设置 workspace 修复。

但当前实现仍漏掉了直接通过 dev_ctx.cublas_handle() 获取裸 handle 的 cuBLAS 调用路径,相关细节已放在新的行内评论里;这个问题会让部分 cuBLAS kernel 在切到非默认 stream 后仍可能使用旧 stream,需要继续修正后再合入。

  • P3 优先级:P3 非行级:PR 模板检查仍失败,CI 日志显示 PR Category 仍被解析成包含开头描述文本的异常内容。请按仓库最新模板整理 PR body,使 ### PR Category 独立填写合法类别,并补充测试/风险说明。
Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

// On CUDA (non-Windows, non-legacy_gemm) the pre-allocated workspace is
// re-applied after the stream switch because cublasSetStream_v2 resets the
// handle's workspace pointer to the default pool.
void SetBlasStream(blasHandle_t handle) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

这个 helper 目前只在 CublasCall / TensorCoreCublasCallIfAvailable 路径里调用,但仓库里仍有 cuBLAS kernel 直接通过 dev_ctx.cublas_handle() 获取裸 handle,例如 paddle/phi/kernels/legacy/gpu/batched_gemm.cu 的 grouped GEMM 和 paddle/phi/kernels/gpu/eig_grad_kernel.cu。这些路径不会经过这里的 stream 绑定;当同一个 GPUContextSetCUDAStream 切到非默认 stream 后,直接使用 cublas_handle() 的算子仍可能继续在 handle 上一次绑定的 stream 上执行,MOE 的 batched_gemm 场景也就没有被这个 PR 覆盖完整。

请把裸 handle 获取路径也纳入 stream 绑定,至少在 GetBlasHandle() 返回前按当前 stream 调用同一套逻辑;更稳妥的做法是把这些直接使用 cuBLAS 的 kernel 迁移到带 callback 的接口,保证设置 stream 和实际 cuBLAS 调用处在同一个受保护区间。例如修复形态可以是:

blasHandle_t GetBlasHandle() {
  // lazy init existing handles...
  std::lock_guard<std::mutex> guard(blas_mtx_);
  gpuStream_t cur = stream();
  if (cur != blas_handle_stream_) {
    SetBlasStream(blas_handle_);
    blas_handle_stream_ = cur;
  }
  return blas_handle_;
}

或者将直接调用方改成:

dev_ctx.CublasCall([&](blasHandle_t handle) {
  // all cuBLAS calls using this handle
});

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是算子实现者选用的方式问题,如果修改涉及面较广,且在当前场景下没有实际影响。

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查新的提交。作者已经把 cublas_handle() 获取裸 handle 的路径纳入了 stream 绑定,之前指出的覆盖缺口有推进。

不过当前实现只在返回裸 handle 前短暂持锁,真正的 cuBLAS 调用仍在锁外执行,同一个 handle 仍可能被其他 stream 的调用切换;这个并发问题已在新的行内评论里说明,需要继续修正后再合入。

当前可见 CI 失败主要是 paddle/phi 目录变更需要 RD 审批;未看到新的代码格式问题。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Comment thread paddle/phi/backends/gpu/gpu_context.cc Outdated
// Bind the handle to the current stream so callers that obtain the raw
// handle via cublas_handle() (e.g. batched_gemm, eig_grad) dispatch to
// the correct stream even when SetCUDAStream has switched the context.
std::lock_guard<std::mutex> guard(blas_mtx_);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

这里虽然在返回裸 blas_handle_ 前设置了当前 stream,但锁会在 GetBlasHandle() 返回时释放,调用方随后执行真正的 cuBLAS 调用时不再受 blas_mtx_ 保护。batched_gemm 会先保存 dev_ctx.cublas_handle() 的返回值,再循环发多次 GEMM;如果另一个线程/stream 同时进入 CublasCall()TensorCoreCublasCallIfAvailable() 或再次调用 cublas_handle(),同一个 blas_handle_ 的 stream 仍可能在这些 GEMM 之间被切走,非默认 stream 场景仍不可靠。

请不要用“返回裸 handle 前短暂加锁”来修复这类调用。更稳妥的做法是让直接使用 cuBLAS 的调用点改为 CublasCall 回调接口,使设置 stream 和所有 cuBLAS 调用处在同一个锁保护区间;如果必须保留裸 handle API,也需要提供一个持锁的 RAII/回调接口而不是单独返回 handle。例如 batched_gemm 这类路径应改成:

dev_ctx.CublasCall([&](blasHandle_t handle) {
  for (int64_t i = 0; i < num_experts; ++i) {
    CublasGemm(handle, /* existing arguments */);
  }
});

PaddlePaddle-bot

This comment was marked as outdated.

@xuanyuanminzheng xuanyuanminzheng force-pushed the feature_cublas_stream branch from 869b162 to 003f6e5 Compare May 27, 2026 04:42
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查新的提交。之前关于 HIP/ROCm 编译分支和 cublasSetStream 后 workspace 恢复的问题仍看起来已处理;新增的锁内 stream 绑定也覆盖了 CublasCall / TensorCoreCublasCallIfAvailable 路径。

但上一轮关于裸 cublas_handle() 调用路径的覆盖缺口仍未解决,新的行内评论里已说明需要把这些直接 cuBLAS 调用迁移到持锁回调或等价接口后再合入。当前 CI 也仍有审批失败和若干运行中任务需要等待。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

// On CUDA (non-Windows, non-legacy_gemm) the pre-allocated workspace is
// re-applied after the stream switch because cublasSetStream_v2 resets the
// handle's workspace pointer to the default pool.
void SetBlasStream(blasHandle_t handle) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 优先级:P1

最新提交移除了 GetBlasHandle() 里的短锁绑定,但 SetBlasStream() 仍只在 CublasCall() / TensorCoreCublasCallIfAvailable() 回调路径中调用;现有 paddle/phi/kernels/legacy/gpu/batched_gemm.cupaddle/phi/kernels/gpu/eig_grad_kernel.cu 仍直接通过 dev_ctx.cublas_handle() 取裸 handle。这样在 SetCUDAStream() 切换同一个 GPUContext 的 stream 后,这些直接调用路径不会重新绑定当前 stream,batched_gemm 的 MOE 场景仍可能继续跑在 handle 上一次绑定的 stream 上。

请把直接使用 cuBLAS 的调用点迁移到持锁的回调接口,或者提供等价的持锁 RAII/回调 API,保证设置 stream 和实际 cuBLAS 调用处于同一个保护区间。例如:

dev_ctx.CublasCall([&](blasHandle_t handle) {
  for (int64_t i = 0; i < num_experts; ++i) {
    CublasGemm(handle, /* existing arguments */);
  }
});

eig_grad 中连续的 cublas{C,Z}geam 调用也需要按同样方式放进 CublasCall,避免继续绕过这里新增的 stream 绑定。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是算子实现者选用的方式问题,如果修改涉及面较广,且在当前场景下没有实际影响。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 61.90476% with 8 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@05e095f). Learn more about missing BASE report.

Files with missing lines Patch % Lines
paddle/phi/backends/gpu/gpu_context.cc 61.90% 8 Missing ⚠️

❌ Your patch status has failed because the patch coverage (61.90%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #79155   +/-   ##
==========================================
  Coverage           ?   61.90%           
==========================================
  Files              ?        1           
  Lines              ?       21           
  Branches           ?        0           
==========================================
  Hits               ?       13           
  Misses             ?        8           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查新的提交。新增测试覆盖了 GetBlas(...).MatMul 通过回调接口走非默认 stream 的场景,但上一轮关于裸 cublas_handle() 直接调用路径的阻塞点仍未通过代码修复解决,细节已放在新的行内评论里。

另外新增的 stream cache 测试当前没有真正复用同一个 GPUContext 切换 stream,测试覆盖也需要调整。当前 CI 仍有审批失败和多个运行中任务需要等待。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Comment thread paddle/phi/backends/gpu/gpu_context.cc
// Alternate between stream_a and stream_b for 4 iterations.
for (int iter = 0; iter < 4; ++iter) {
gpuStream_t cur_raw = (iter % 2 == 0) ? stream_a : stream_b;
auto ctx = MakeCtxOnStream(gpu_place, cur_raw);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 优先级:P2

这个测试注释说要“Switch the same GPUContext between two auxiliary streams”,但循环内每次都 MakeCtxOnStream(...) 新建一个 GPUContext,因此每个 context 的 blas_handle_stream_ 初始都是 nullptr,实际没有覆盖“同一个 cuBLAS handle 在 stream_a 和 stream_b 之间反复切换时缓存失效并重新 SetBlasStream”的路径。这样即使缓存字段没有在同一 context 切流时正确更新,这个测试也不会失败。

请把 ctx 移到循环外,只创建一个 GPUContext,然后在每轮对同一个 context 调用 SetCUDAStream(...) 切到 stream_a / stream_b 后再执行 MatMul;或者直接重写为一个 helper,确保同一个 blas_handle_ 被复用。例如:

auto ctx = MakeCtxOnStream(gpu_place, stream_a);
for (int iter = 0; iter < 4; ++iter) {
  gpuStream_t cur_raw = (iter % 2 == 0) ? stream_a : stream_b;
  ctx->SetCUDAStream(new phi::CUDAStream(gpu_place, cur_raw), /*clear=*/true);
  // run MatMul with the same ctx and verify result
}

如果采用 new CUDAStream,还需要同时处理 wrapper 的所有权,避免测试循环泄漏 CUDAStream 对象。

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-27 16:59 CST

📋 Review 摘要

PR 概述:为 GPUContext 中的 cuBLAS/rocBLAS handle 增加延迟 stream 绑定能力,支持在非默认 stream 上运行 cublas 算子。
变更范围paddle/phi/backends/gpu/gpu_context.cctest/cpp/phi/kernels/test_math_function.cu
影响面 Tag[Operator Mechanism]

问题

级别 文件 概述
❓ 疑问 paddle/phi/backends/gpu/gpu_context.cc [F2] HIP/rocBLAS 切换 stream 后未说明是否需要重新 apply workspace
❓ 疑问 paddle/phi/backends/gpu/gpu_context.cc [F3] Windows 路径下 cublasSetStream_v2 后未重新 apply workspace,原因未说明

历史 Findings 修复情况

Finding 问题 状态
F1 每次 cublas 调用都执行 cublasSetStream_v2,即使 stream 未变化 ✅ 已修复
F2 HIP/rocBLAS 路径切换 stream 后是否需要重新 apply workspace ⚠️ 仍存在
F3 Windows 路径下 cublasSetStream_v2 后未重新 apply workspace ⚠️ 仍存在

📝 PR 规范检查

标题 Tag [Execute Infrastructure] 与变更路径(paddle/phi/backends/gpu/)不符,paddle/phi/backends/ 属于 [Operator Mechanism] 范畴,建议修正。PR 描述四个必填 section 均已填写,结构合规。

标题建议(可直接复制):

  • [Operator Mechanism] Add cublasSetStream to support non-default stream in cublas kernel

总体评价

本 PR 核心逻辑清晰,通过缓存每个 handle 最后绑定的 stream 来避免重复调用 cublasSetStream_v2,F1 已得到修复。F2(rocBLAS workspace)和 F3(Windows workspace)两个疑问建议在代码注释中补充说明,以提升可维护性。测试覆盖了正确性、多次流切换缓存失效和并发流独立性三个场景,质量较好。

@xuanyuanminzheng xuanyuanminzheng force-pushed the feature_cublas_stream branch from db08478 to 7d9372d Compare May 27, 2026 09:05
Copy link
Copy Markdown

@risemeup1111 risemeup1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已复查新的提交。上一轮关于 stream cache 测试没有复用同一个 GPUContext 的问题看起来已处理,新测试现在会在同一个 context 上切换 stream。

但裸 cublas_handle() 直接调用路径仍未通过代码修复解决,当前 head 里已补充新的行内评论说明仍需把 batched_gemm / eig_grad 这类路径纳入持锁回调或等价接口。当前 CI 仍有审批失败和多个运行中任务需要等待。

Powered by Nyanpasu with gpt-5.5 xhigh, please check the suggestions carefully.

Comment thread paddle/phi/backends/gpu/gpu_context.cc
@xuanyuanminzheng
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-27T00:00:00Z

📋 Review 摘要

PR 概述:为 GPUContext 的 cuBLAS/rocBLAS handle 增加非默认 stream 支持,通过缓存上次绑定的 stream 避免冗余的 cublasSetStream_v2 调用。
变更范围paddle/phi/backends/gpu/gpu_context.cctest/cpp/phi/kernels/test_math_function.cu
影响面 Tag[Operator Mechanism] [Execute Infrastructure]

问题

级别 文件 概述
🟡 建议 paddle/phi/backends/gpu/gpu_context.cc:349 SetBlasStream 内部重复调用 stream(),存在 TOCTOU 隐患,建议直接传入 stream 参数

历史 Findings 修复情况

Finding 问题 状态
F1 每次 cublas 调用都执行 cublasSetStream_v2,即使 stream 未变化也会产生额外开销 ✅ 已修复
F2 HIP/rocBLAS 路径切换 stream 后是否需要重新 apply workspace ⚠️ 仍存在
F3 Windows 路径下 cublasSetStream_v2 后未重新 apply workspace ⚠️ 仍存在

📝 PR 规范检查

标题 Tag [Execute Infrastructure] 与变更文件路径不符(paddle/phi/backends/gpu/ 更接近 [Operator Mechanism]),建议修正(历史已报告,未修复)。

标题建议(可直接复制):

  • [Operator Mechanism] add cublasSetStream to support non-default stream in cublas kernel

总体评价

F1 缓存优化已落地,代码整体质量较好,测试覆盖了单流、多流切换和并发三个场景。F2/F3(HIP workspace 和 Windows workspace 重新 apply)仍未处理,建议作者确认是否有意为之并在 PR 描述中说明。SetBlasStream 内部重复调用 stream() 存在轻微 TOCTOU 隐患,建议改为传参方式。


// Persistent cublasLt workspace: grow-only, freed in destructor.
// Returns {ptr, size}. Thread-safe via mutex for grow path.
std::pair<void*, size_t> GetCublasLtWorkspace(size_t required_size) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 SetBlasStream 内部再次调用 stream() 存在 TOCTOU 隐患。

调用方已经通过 gpuStream_t cur = stream() 获取了当前 stream,但 SetBlasStream 内部又调用了一次 stream()。若两次调用之间 stream 发生变化(虽然概率低),cublas handle 会被绑定到新 stream,但 blas_handle_stream_ 却被设置为旧的 cur,导致缓存状态不一致。

建议修改 SetBlasStream 签名,直接接受 stream 参数:

void SetBlasStream(blasHandle_t handle, gpuStream_t s) {
#ifdef PADDLE_WITH_HIP
  PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::rocblas_set_stream(handle, s));
#else
  PADDLE_RETRY_CUDA_SUCCESS(phi::dynload::cublasSetStream_v2(handle, s));
#if !defined(_WIN32)
  if (!FLAGS_use_legacy_gemm) {
    SetCublasWorkspace(handle);
  }
#endif
#endif
}

调用处改为 SetBlasStream(blas_handle_, cur);,消除重复调用和 TOCTOU 风险。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants