Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with driver API's lazy load of cuStream ops #3907

Open
samnordmann opened this issue Feb 17, 2025 · 2 comments
Open

Error with driver API's lazy load of cuStream ops #3907

samnordmann opened this issue Feb 17, 2025 · 2 comments

Comments

@samnordmann
Copy link
Collaborator

samnordmann commented Feb 17, 2025

related Team's thread

I am trying to use cuStreamWriteValue32 which is part of the cuda driver API (context: #3894). Even though I can build, I am getting a runtime error CUDA_ERROR_NOT_SUPPORTED. This should be supported as I am using a DGX H100 node with cuda 12.8, inside the pjnl latest docker.
Repro:

The source of problem can be narrowed down to lazy loading /usr/local/cuda/compat/lib.real/libcuda.so.1 in the pjnl container -- the bug comes either from lazy loading or from the library itself.

To prove this, note that the following patch (which explicitly links to cuda, non-lazily) solves the bug:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9d7d7b32..3e51bce8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -334,6 +334,7 @@ target_link_libraries(codegen_internal PUBLIC
   ${LIBCUPTI}
   ${TORCH_LIBRARIES}
   dl
+  cuda
)

add_library(nvfuser_codegen SHARED $<TARGET_OBJECTS:codegen_internal>)
diff --git a/csrc/driver_api.h b/csrc/driver_api.h
index 41072a22..b8c413a4 100644
--- a/csrc/driver_api.h
+++ b/csrc/driver_api.h
@@ -37,7 +37,6 @@ namespace nvfuser {
#if (CUDA_VERSION >= 12000)
#define ALL_DRIVER_API_WRAPPER(fn)   \
   ALL_DRIVER_API_WRAPPER_CUDA11(fn); \
-  fn(cuStreamWriteValue32);          \
   fn(cuTensorMapEncodeTiled)
#else
#define ALL_DRIVER_API_WRAPPER ALL_DRIVER_API_WRAPPER_CUDA11
diff --git a/tests/cpp/test_gpu3.cpp b/tests/cpp/test_gpu3.cpp
index 9570bb9b..a7236c6a 100644
--- a/tests/cpp/test_gpu3.cpp
+++ b/tests/cpp/test_gpu3.cpp
@@ -56,7 +56,8 @@
#include <sstream>
#include "parallel_dimension_map.h"

-#include <driver_api.h>
+// #include <driver_api.h>
+#include <cuda.h>
namespace nvfuser {

using namespace at::indexing;

and note also that cuda-gdb gives the following backtrace of the error:

#0  0x00007fff37f740f0 in cudbgReportDriverApiError () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#1  0x00007fff381e312b in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#2  0x00007fff2f4c0d47 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#3  0x00007fff2f49c29e in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#4  0x00007fff2f4af56d in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#5  0x00007fff2f5aebd6 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#6  0x00007fff380c05d0 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x0000555555a67b3e in lazilyLoadAndInvoke (args#0=0x7fff2ad0d618, args#1=140724802682880, args#2=3, args#3=0) at /opt/pytorch/Fuser2/csrc/driver_api.cpp:95
@samnordmann

This comment has been minimized.

@samnordmann
Copy link
Collaborator Author

@naoyam thanks a lot for helping so far
@wujingyue I am stuck on this one, could you please take a look or tag someone who might know about it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant