Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Dynamic Parallelism Synchronization (CUDA > 11.6) #349

Open
ug0x01 opened this issue Mar 7, 2025 · 2 comments
Open

CUDA Dynamic Parallelism Synchronization (CUDA > 11.6) #349

ug0x01 opened this issue Mar 7, 2025 · 2 comments

Comments

@ug0x01
Copy link

ug0x01 commented Mar 7, 2025

Hello, I have to use cudaDeviceSynchronize kind of function to wait to kernel to get finished but we can not use any kind of synchronization at device functions after version 11.6

May I request an example for it?

Here's the code that I'm trying to run:

__global__ void NNFeedForwardNormalMultiple(double* __restrict__ values, double* __restrict__ weigths, double* result, int inputsize, int outputsize) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int outputidx = idx / outputsize;
    int inputidx = idx % outputsize;

    if (outputidx >= outputsize || inputidx >= inputsize) {
        return;
    }

    atomicAdd(&result[outputidx], values[inputidx] * weigths[outputsize*outputidx + inputidx]);
}

__global__ void NNFeedForwardNormalActivate(double* __restrict__ biases, double* result, int size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx >= size) {
        return;
    }

    result[idx] = 1.0 / (1.0 + exp(-(result[idx] + biases[idx])));
}

__global__ void NNFeedForwardNormal(double* __restrict__ values, double* __restrict__ weigths, double* result, double* __restrict__ biases, int inputsize, int outputsize) {
    int blocksize = (inputsize * outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK;
    NNFeedForwardNormalMultiple<<<blocksize, THREADS_PER_BLOCK>>>(values, weigths, result, inputsize, outputsize);
    //normally cudaDeviceSynchronize() kind of function to wait for child kernel to finish;
    NNFeedForwardNormalActivate<<<(outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(biases, result, outputsize);
}

Thanks!

@rwarmstr
Copy link
Collaborator

I'm not sure what you're referring to; cudaDeviceSynchronize() is still a valid and supported API. Usually, though, you don't want to launch all of your work into the default stream but rather use streams explicitly, in which case you'd use cudaStreamSynchronize(). Also keep in mind that any work launched into a stream will still complete sequentially - what I mean by that is that if you launch kernel_a, kernel_b, and kernel_c into a stream they'll run in that order.

@ug0x01
Copy link
Author

ug0x01 commented Mar 11, 2025

Hello @rwarmstr,

I'm not sure what you're referring to; cudaDeviceSynchronize() is still a valid and supported API

No, it's not supported to be called from a device or global function after CUDA 11.6. Also I tried cudaStreamSynchronize and other synchronizations too but none of them could be called from a device function.

Also keep in mind that any work launched into a stream will still complete sequentially

Also no, when you call 2 different device functions without a synchronization, it will run almost at the same time, and in a case like mine (which makes some calculations which will take some time) it will try to activate neurons before matrix multiplication and it will add some more value after activation because the first kernel will be still running

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants