Skip to content

rubber-duck-debug/gpu-lite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpulite

A lightweight C++ library for dynamic CUDA runtime compilation and kernel caching. gpulite simplifies building and deploying CUDA-dependent applications by providing runtime symbol resolution and automated kernel compilation with caching.

No CUDA SDK Required at Build Time! - Compile your applications without installing the CUDA SDK.

Features

  • No Build-Time CUDA Dependencies: Compiles without CUDA SDK installed - only requires C++17 compiler
  • Dynamic Symbol Resolution: Loads CUDA libraries (libcuda.so, libcudart.so, libnvrtc.so) at runtime
  • Runtime Compilation: Compiles CUDA kernels using NVRTC with automatic compute capability detection
  • Kernel Caching: Intelligent caching system to avoid recompilation of identical kernels
  • Easy Integration: Header-only design for simple project integration
  • Cross-Platform Support: Supports Linux and Windows platforms

Why gpulite?

Traditional CUDA applications require the CUDA SDK to be installed at build time and often have complex deployment requirements. gpulite solves this by:

  1. Eliminating build-time CUDA dependencies - No need for CUDA SDK during compilation
  2. Simplifying deployment - Applications can run on any system with CUDA drivers installed
  3. Reducing compilation overhead - Kernels are compiled once and cached for subsequent runs
  4. Providing runtime flexibility - Kernels can be modified or optimized at runtime

Quick Start

Basic Usage

#include <gpulite/gpulite.hpp>

int main() {
    // Check if CUDA is available
    if (!CUDA_DRIVER_INSTANCE.loaded() || !NVRTC_INSTANCE.loaded()) {
        throw std::runtime_error("CUDA runtime not available");
    }

    // Your CUDA kernel code as a string
    const char* kernel_source = R"(
        extern "C" __global__ void vector_add(float* a, float* b, float* c, int n) {
            int idx = blockIdx.x * blockDim.x + threadIdx.x;
            if (idx < n) {
                c[idx] = a[idx] + b[idx];
            }
        }
    )";

    // Create kernel factory and cache the kernel
    auto& factory = KernelFactory::instance();
    auto* kernel = factory.create(
        "vector_add",           // kernel name
        kernel_source,          // kernel source code
        "vector_add.cu",        // virtual source filename
        {"-std=c++17"}          // compilation options
    );

    // Allocate device memory and launch kernel
    float *d_a, *d_b, *d_c;
    int n = 1024;

    // Use the cleaner GPULITE_CUDART_CALL macro (hides the global instance)
    GPULITE_CUDART_CALL(cudaMalloc(reinterpret_cast<void**>(&d_a), n * sizeof(float)));
    GPULITE_CUDART_CALL(cudaMalloc(reinterpret_cast<void**>(&d_b), n * sizeof(float)));
    GPULITE_CUDART_CALL(cudaMalloc(reinterpret_cast<void**>(&d_c), n * sizeof(float)));

    // Prepare kernel arguments
    std::vector<void*> args = {&d_a, &d_b, &d_c, &n};

    // Launch kernel
    kernel->launch(
        dim3((n + 127) / 128),  // grid size
        dim3(128),              // block size
        0,                      // shared memory size
        nullptr,                // stream
        args,                   // kernel arguments
        true                    // synchronize after launch
    );

    // Clean up
    GPULITE_CUDART_CALL(cudaFree(d_a));
    GPULITE_CUDART_CALL(cudaFree(d_b));
    GPULITE_CUDART_CALL(cudaFree(d_c));

    return 0;
}

Compilation:

Linux:

# Save the above code as main.cpp, then compile:
g++ -std=c++17 main.cpp -ldl -o my_gpu_app

# Run the application:
./my_gpu_app

Windows (with Visual Studio):

# Save the above code as main.cpp, then compile:
cl /std:c++17 main.cpp /Fe:my_gpu_app.exe

# Run the application:
my_gpu_app.exe

Requirements:

  • C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CUDA SDK installed at run-time, but not at build time!
  • Linux or Windows

Loading Kernels from Files

// Load kernel from a .cu file
auto* kernel = KernelFactory::instance().createFromSource(
    "my_kernel",
    "/path/to/kernel.cu",
    "kernel.cu",
    {"-std=c++17"}
);

Template Kernel Names

For templated kernels, use the getKernelName helper:

// For a templated kernel like: template<typename T> __global__ void process(T* data)
std::string kernel_name = getKernelName<float>("process");
auto* kernel = factory.create(kernel_name, source, "template_kernel.cu", {});

CMake Integration

Method 1: Using FetchContent (Recommended)

cmake_minimum_required(VERSION 3.14)
project(MyProject LANGUAGES CXX)

include(FetchContent)

FetchContent_Declare(
    gpulite
    GIT_REPOSITORY https://github.com/rubber-duck-debug/gpu-lite.git
    GIT_TAG v1.0.0
)

FetchContent_MakeAvailable(gpulite)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE gpulite)

This is the recommended approach because:

  • Dependencies (${CMAKE_DL_LIBS}, Threads) are automatically linked
  • Include paths are automatically configured
  • Use a specific tag (e.g., v1.0.0) for reproducible builds

Method 2: Using add_subdirectory

Clone or copy the gpulite repo to external/gpulite, then:

cmake_minimum_required(VERSION 3.12)
project(MyProject LANGUAGES CXX)

add_subdirectory(external/gpulite)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE gpulite)

Method 3: Manual Header Copy

Copy gpulite/gpulite.hpp to your project and link dependencies manually:

cmake_minimum_required(VERSION 3.12)
project(MyProject LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)

add_executable(my_app main.cpp)
target_include_directories(my_app PRIVATE path/to/gpulite)

if(NOT WIN32)
    target_link_libraries(my_app PRIVATE ${CMAKE_DL_LIBS})
endif()

Advanced Usage

Custom Compilation Options

std::vector<std::string> options = {
    "-std=c++17",                   // C++ standard
    "--use_fast_math",              // Fast math operations
    "-DBLOCK_SIZE=256",             // Preprocessor definitions
    "--maxrregcount=32"             // Maximum register count
};

auto* kernel = factory.create("optimized_kernel", source, "kernel.cu", options);

Kernel Function Attributes

// Set maximum dynamic shared memory
kernel->setFuncAttribute(CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES, 8192);

// Get kernel register usage
int reg_count = kernel->getFuncAttribute(CU_FUNC_ATTRIBUTE_NUM_REGS);
std::cout << "Kernel uses " << reg_count << " registers per thread" << std::endl;

Asynchronous Execution

// Create CUDA stream
cudaStream_t stream;
GPULITE_CUDART_CALL(cudaStreamCreate(&stream));

// Launch kernel asynchronously
kernel->launch(
    grid, block, shared_mem_size,
    reinterpret_cast<void*>(stream),
    args,
    false  // don't synchronize
);

// Do other work...

// Synchronize when needed
GPULITE_CUDART_CALL(cudaStreamSynchronize(stream));
GPULITE_CUDART_CALL(cudaStreamDestroy(stream));

Error Handling

gpulite provides comprehensive error checking with detailed error messages:

try {
    auto* kernel = factory.create("my_kernel", source, "kernel.cu", {});
    kernel->launch(grid, block, 0, nullptr, args);
} catch (const std::runtime_error& e) {
    std::cerr << "CUDA Error: " << e.what() << std::endl;
    // Error message includes file and line information
}

Requirements

Runtime Requirements

  • NVIDIA GPU with CUDA capability 3.0 or higher
  • NVIDIA CUDA drivers installed
  • Linux or Windows operating system

Build Requirements (No CUDA SDK Needed!)

  • C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CMake 3.12 or higher
  • Standard C++ library with threading support
  • No CUDA SDK installation required - gpulite uses minimal type wrappers

CUDA Libraries (loaded dynamically at runtime)

Linux:

  • libcuda.so - CUDA Driver API
  • libcudart.so - CUDA Runtime API
  • libnvrtc.so - NVIDIA Runtime Compilation API

Windows:

  • nvcuda.dll - CUDA Driver API
  • cudart64_*.dll - CUDA Runtime API
  • nvrtc64_*.dll - NVIDIA Runtime Compilation API

Note: These libraries are only required at runtime, provided by the CUDA SDK/NVIDIA drivers.

Platform Support

Platform Status
Linux ✅ Supported
Windows ✅ Supported
macOS ❌ Not applicable (no CUDA support)

CUDA/HIP Support

Platform Status
CUDA ✅ Supported
HIP 🚧 Planned

Performance Considerations

  • First Launch: Kernels are compiled on first use, which may add initial latency
  • Subsequent Launches: Cached kernels launch immediately with minimal overhead
  • Memory Usage: Compiled kernels are kept in memory for the application lifetime
  • Context Switching: gpulite automatically handles CUDA context management

Troubleshooting

"CUDA runtime not available" Error

Linux:

  • Ensure NVIDIA drivers are installed: nvidia-smi
  • Verify CUDA libraries are in library path: ldconfig -p | grep cuda

Windows:

  • Ensure NVIDIA drivers are installed: nvidia-smi
  • Verify CUDA libraries are in system PATH or same directory as executable
  • Check that CUDA toolkit is installed and DLLs are accessible

Compilation Errors

  • Check kernel syntax using nvcc offline
  • Verify compute capability compatibility
  • Review compilation options for conflicts

Runtime Errors

  • Enable CUDA error checking in debug builds
  • Use cuda-gdb and compute-sanitizer for kernel debugging
  • Check memory alignment and access patterns

Build Scripts

For convenience, platform-specific build scripts are provided in the scripts/ directory.

Linux

# Run from the gpulite root directory
./scripts/build_examples_linux.sh

Windows

# Run from the gpulite root directory
scripts\build_examples_windows.bat

Manual Build Instructions

You can also build manually with CMake:

Linux

# Create build directory
mkdir build && cd build

# Generate Makefiles
cmake ..

# Build the project
make -j$(nproc)

Windows

# Create build directory
mkdir build
cd build

# Generate Visual Studio project files
cmake .. -G "Visual Studio 16 2019" -A x64

# Build the project
cmake --build . --config Release

Note: Ensure you have:

  • Visual Studio 2017 or later with C++17 support
  • CMake 3.12 or later
  • NVIDIA drivers installed (CUDA SDK not required at build time)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments


For more examples and advanced usage patterns, see the examples/ directory in this repository.

About

A lightweight CUDA header-only runtime wrapper with dynamic symbol resolution.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors