Incorrect results produced by warp shuffles in gpgpu-sim

Hello everyone,

I am writing this issue to ask if you could give me some suggestions about how to solve the inaccurate result produced by warp primitives running in gpgpu-sim.

**Code snippet**: The minimal code comes from the official tutorial of CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-examples-broadcast.

```c++
#include <stdio.h>

__global__ void bcast(int arg) {
    int laneId = threadIdx.x & 0x1f;
    int value;
    if (laneId == 0)        // Note unused variable for
        value = arg;        // all threads except lane 0
    value = __shfl_sync(0xffffffff, value, 0);   // Synchronize all threads in warp, and get "value" from lane 0
    if (value != arg)
        printf("Thread %d failed.\n", threadIdx.x);
}

int main() {
    bcast<<< 1, 32 >>>(1234);
    cudaDeviceSynchronize();

    return 0;
}
```

**Build environment**: I used is the image [jonghyun1215/gpgpu:gpgpusim4](https://hub.docker.com/layers/jonghyun1215/gpgpu/gpgpusim4/images/sha256-38a5521232452c225004b7c2ae3fa2a1668224d5770d85ed12f7ad754d49d46b?context=explore) from docker hub, with GCC 7.5, gpgpu-sim 4.0.0(commit ID:[90ec33997](https://github.com/gpgpu-sim/gpgpu-sim_distribution/commit/90ec3399763d7c8512cfe7dc193473086c38ca38), exactly the latest commit in branch dev), CUDA 10.1.

**Situation:** The sample code shouldn't print the failed message and it was tested in a real GPU environment which got the expected results. But when running it through gpgpu-sim, no matter using performance simulation or functional simulation, the  results are wrong. :confused:

**Investigation**: I turned to other warp samples  like `shfl_down_sync`, `shfl_xor_sync` in that tutorial, the correctness error still exists. For comparison, I also wrote a simple reduction using two methods separately, 1) shared  memory, 2) warp shuffle, the result of shared memory is exactly correct, but warp shuffle is not, which confused me a lot. Thus, I guess there are some bugs in the implementations of warp primitives in gpgpu-sim.

**Possible Parts**: To locate the relevant part in the gpgpu-sim codebase, I searched for the shfl PTX operator, and found the implementations here: [link](https://github.com/gpgpu-sim/gpgpu-sim_distribution/blob/master/src/cuda-sim/instructions.cc#L5349-L5471). Not very experienced in the codes of gpgpu-sim, I've been blocked in these step for few days.

I would appreciate it sincerely if you could help me with this trouble. Thanks for your consideration! :relaxed:



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect results produced by warp shuffles in gpgpu-sim #230

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect results produced by warp shuffles in gpgpu-sim #230

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions