Error with driver API's lazy load of cuStream ops #3907

samnordmann · 2025-02-17T11:08:59Z

I am trying to use cuStreamWriteValue32 which is part of the cuda driver API (context: #3894). Even though I can build, I am getting a runtime error CUDA_ERROR_NOT_SUPPORTED. This should be supported as I am using a DGX H100 node with cuda 12.8, inside the pjnl latest docker.
Repro:

error: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/125237761
PR: [WIP] test with cuStreamWriteValue32 #3496
docker image: gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest
Driver Version: 550.90.07 CUDA Version: 12.8 (I also tried with more recent drivers)

The source of problem can be narrowed down to lazy loading /usr/local/cuda/compat/lib.real/libcuda.so.1 in the pjnl container -- the bug comes either from lazy loading or from the library itself.

To prove this, note that the following patch (which explicitly links to cuda, non-lazily) solves the bug:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9d7d7b32..3e51bce8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -334,6 +334,7 @@ target_link_libraries(codegen_internal PUBLIC
   ${LIBCUPTI}
   ${TORCH_LIBRARIES}
   dl
+  cuda
)

add_library(nvfuser_codegen SHARED $<TARGET_OBJECTS:codegen_internal>)
diff --git a/csrc/driver_api.h b/csrc/driver_api.h
index 41072a22..b8c413a4 100644
--- a/csrc/driver_api.h
+++ b/csrc/driver_api.h
@@ -37,7 +37,6 @@ namespace nvfuser {
#if (CUDA_VERSION >= 12000)
#define ALL_DRIVER_API_WRAPPER(fn)   \
   ALL_DRIVER_API_WRAPPER_CUDA11(fn); \
-  fn(cuStreamWriteValue32);          \
   fn(cuTensorMapEncodeTiled)
#else
#define ALL_DRIVER_API_WRAPPER ALL_DRIVER_API_WRAPPER_CUDA11
diff --git a/tests/cpp/test_gpu3.cpp b/tests/cpp/test_gpu3.cpp
index 9570bb9b..a7236c6a 100644
--- a/tests/cpp/test_gpu3.cpp
+++ b/tests/cpp/test_gpu3.cpp
@@ -56,7 +56,8 @@
#include <sstream>
#include "parallel_dimension_map.h"

-#include <driver_api.h>
+// #include <driver_api.h>
+#include <cuda.h>
namespace nvfuser {

using namespace at::indexing;

and note also that cuda-gdb gives the following backtrace of the error:

#0  0x00007fff37f740f0 in cudbgReportDriverApiError () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#1  0x00007fff381e312b in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#2  0x00007fff2f4c0d47 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#3  0x00007fff2f49c29e in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#4  0x00007fff2f4af56d in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#5  0x00007fff2f5aebd6 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#6  0x00007fff380c05d0 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x0000555555a67b3e in lazilyLoadAndInvoke (args#0=0x7fff2ad0d618, args#1=140724802682880, args#2=3, args#3=0) at /opt/pytorch/Fuser2/csrc/driver_api.cpp:95

The text was updated successfully, but these errors were encountered:

samnordmann · 2025-02-17T11:11:34Z

@naoyam thanks a lot for helping so far
@wujingyue I am stuck on this one, could you please take a look or tag someone who might know about it ?

This comment has been minimized.

Sign in to view

This was referenced Feb 17, 2025

[CudaIpc Tuto] Minimal snippet example #3912

Open

[CudaIpc 3/3]: p2p get-Zcopy #3911

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with driver API's lazy load of cuStream ops #3907

Error with driver API's lazy load of cuStream ops #3907

samnordmann commented Feb 17, 2025 •

edited

Loading

This comment has been minimized.

samnordmann commented Feb 17, 2025

Error with driver API's lazy load of cuStream ops #3907

Error with driver API's lazy load of cuStream ops #3907

Comments

samnordmann commented Feb 17, 2025 • edited Loading

This comment has been minimized.

samnordmann commented Feb 17, 2025

samnordmann commented Feb 17, 2025 •

edited

Loading