CUDA Dynamic Parallelism Synchronization (CUDA > 11.6) #349

ug0x01 · 2025-03-07T02:20:49Z

Hello, I have to use cudaDeviceSynchronize kind of function to wait to kernel to get finished but we can not use any kind of synchronization at device functions after version 11.6

May I request an example for it?

Here's the code that I'm trying to run:

__global__ void NNFeedForwardNormalMultiple(double* __restrict__ values, double* __restrict__ weigths, double* result, int inputsize, int outputsize) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int outputidx = idx / outputsize;
    int inputidx = idx % outputsize;

    if (outputidx >= outputsize || inputidx >= inputsize) {
        return;
    }

    atomicAdd(&result[outputidx], values[inputidx] * weigths[outputsize*outputidx + inputidx]);
}

__global__ void NNFeedForwardNormalActivate(double* __restrict__ biases, double* result, int size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx >= size) {
        return;
    }

    result[idx] = 1.0 / (1.0 + exp(-(result[idx] + biases[idx])));
}

__global__ void NNFeedForwardNormal(double* __restrict__ values, double* __restrict__ weigths, double* result, double* __restrict__ biases, int inputsize, int outputsize) {
    int blocksize = (inputsize * outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK;
    NNFeedForwardNormalMultiple<<<blocksize, THREADS_PER_BLOCK>>>(values, weigths, result, inputsize, outputsize);
    //normally cudaDeviceSynchronize() kind of function to wait for child kernel to finish;
    NNFeedForwardNormalActivate<<<(outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(biases, result, outputsize);
}

Thanks!

The text was updated successfully, but these errors were encountered:

rwarmstr · 2025-03-10T21:34:27Z

I'm not sure what you're referring to; cudaDeviceSynchronize() is still a valid and supported API. Usually, though, you don't want to launch all of your work into the default stream but rather use streams explicitly, in which case you'd use cudaStreamSynchronize(). Also keep in mind that any work launched into a stream will still complete sequentially - what I mean by that is that if you launch kernel_a, kernel_b, and kernel_c into a stream they'll run in that order.

ug0x01 · 2025-03-11T11:31:48Z

Hello @rwarmstr,

I'm not sure what you're referring to; cudaDeviceSynchronize() is still a valid and supported API

No, it's not supported to be called from a device or global function after CUDA 11.6. Also I tried cudaStreamSynchronize and other synchronizations too but none of them could be called from a device function.

Also keep in mind that any work launched into a stream will still complete sequentially

Also no, when you call 2 different device functions without a synchronization, it will run almost at the same time, and in a case like mine (which makes some calculations which will take some time) it will try to activate neurons before matrix multiplication and it will add some more value after activation because the first kernel will be still running

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Dynamic Parallelism Synchronization (CUDA > 11.6) #349

CUDA Dynamic Parallelism Synchronization (CUDA > 11.6) #349

ug0x01 commented Mar 7, 2025

rwarmstr commented Mar 10, 2025

ug0x01 commented Mar 11, 2025

CUDA Dynamic Parallelism Synchronization (CUDA > 11.6) #349

CUDA Dynamic Parallelism Synchronization (CUDA > 11.6) #349

Comments

ug0x01 commented Mar 7, 2025

rwarmstr commented Mar 10, 2025

ug0x01 commented Mar 11, 2025