Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results produced by warp shuffles in gpgpu-sim #230

Open
ueqri opened this issue Aug 11, 2021 · 1 comment
Open

Incorrect results produced by warp shuffles in gpgpu-sim #230

ueqri opened this issue Aug 11, 2021 · 1 comment

Comments

@ueqri
Copy link

ueqri commented Aug 11, 2021

Hello everyone,

I am writing this issue to ask if you could give me some suggestions about how to solve the inaccurate result produced by warp primitives running in gpgpu-sim.

Code snippet: The minimal code comes from the official tutorial of CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-examples-broadcast.

#include <stdio.h>

__global__ void bcast(int arg) {
    int laneId = threadIdx.x & 0x1f;
    int value;
    if (laneId == 0)        // Note unused variable for
        value = arg;        // all threads except lane 0
    value = __shfl_sync(0xffffffff, value, 0);   // Synchronize all threads in warp, and get "value" from lane 0
    if (value != arg)
        printf("Thread %d failed.\n", threadIdx.x);
}

int main() {
    bcast<<< 1, 32 >>>(1234);
    cudaDeviceSynchronize();

    return 0;
}

Build environment: I used is the image jonghyun1215/gpgpu:gpgpusim4 from docker hub, with GCC 7.5, gpgpu-sim 4.0.0(commit ID:90ec33997, exactly the latest commit in branch dev), CUDA 10.1.

Situation: The sample code shouldn't print the failed message and it was tested in a real GPU environment which got the expected results. But when running it through gpgpu-sim, no matter using performance simulation or functional simulation, the results are wrong. 😕

Investigation: I turned to other warp samples like shfl_down_sync, shfl_xor_sync in that tutorial, the correctness error still exists. For comparison, I also wrote a simple reduction using two methods separately, 1) shared memory, 2) warp shuffle, the result of shared memory is exactly correct, but warp shuffle is not, which confused me a lot. Thus, I guess there are some bugs in the implementations of warp primitives in gpgpu-sim.

Possible Parts: To locate the relevant part in the gpgpu-sim codebase, I searched for the shfl PTX operator, and found the implementations here: link. Not very experienced in the codes of gpgpu-sim, I've been blocked in these step for few days.

I would appreciate it sincerely if you could help me with this trouble. Thanks for your consideration! ☺️

@ueqri
Copy link
Author

ueqri commented Aug 15, 2021

Thank @mkhairy for the reply of this issue ☺️, please see here: https://groups.google.com/g/accel-sim/c/SxtFMYrshXg/m/pTYTsZesAQAJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant