You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am writing this issue to ask if you could give me some suggestions about how to solve the inaccurate result produced by warp primitives running in gpgpu-sim.
#include<stdio.h>
__global__ voidbcast(int arg) {
int laneId = threadIdx.x & 0x1f;
int value;
if (laneId == 0) // Note unused variable for
value = arg; // all threads except lane 0
value = __shfl_sync(0xffffffff, value, 0); // Synchronize all threads in warp, and get "value" from lane 0if (value != arg)
printf("Thread %d failed.\n", threadIdx.x);
}
intmain() {
bcast<<< 1, 32 >>>(1234);
cudaDeviceSynchronize();
return0;
}
Build environment: I used is the image jonghyun1215/gpgpu:gpgpusim4 from docker hub, with GCC 7.5, gpgpu-sim 4.0.0(commit ID:90ec33997, exactly the latest commit in branch dev), CUDA 10.1.
Situation: The sample code shouldn't print the failed message and it was tested in a real GPU environment which got the expected results. But when running it through gpgpu-sim, no matter using performance simulation or functional simulation, the results are wrong. 😕
Investigation: I turned to other warp samples like shfl_down_sync, shfl_xor_sync in that tutorial, the correctness error still exists. For comparison, I also wrote a simple reduction using two methods separately, 1) shared memory, 2) warp shuffle, the result of shared memory is exactly correct, but warp shuffle is not, which confused me a lot. Thus, I guess there are some bugs in the implementations of warp primitives in gpgpu-sim.
Possible Parts: To locate the relevant part in the gpgpu-sim codebase, I searched for the shfl PTX operator, and found the implementations here: link. Not very experienced in the codes of gpgpu-sim, I've been blocked in these step for few days.
I would appreciate it sincerely if you could help me with this trouble. Thanks for your consideration! ☺️
The text was updated successfully, but these errors were encountered:
Hello everyone,
I am writing this issue to ask if you could give me some suggestions about how to solve the inaccurate result produced by warp primitives running in gpgpu-sim.
Code snippet: The minimal code comes from the official tutorial of CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-examples-broadcast.
Build environment: I used is the image jonghyun1215/gpgpu:gpgpusim4 from docker hub, with GCC 7.5, gpgpu-sim 4.0.0(commit ID:90ec33997, exactly the latest commit in branch dev), CUDA 10.1.
Situation: The sample code shouldn't print the failed message and it was tested in a real GPU environment which got the expected results. But when running it through gpgpu-sim, no matter using performance simulation or functional simulation, the results are wrong. 😕
Investigation: I turned to other warp samples like
shfl_down_sync
,shfl_xor_sync
in that tutorial, the correctness error still exists. For comparison, I also wrote a simple reduction using two methods separately, 1) shared memory, 2) warp shuffle, the result of shared memory is exactly correct, but warp shuffle is not, which confused me a lot. Thus, I guess there are some bugs in the implementations of warp primitives in gpgpu-sim.Possible Parts: To locate the relevant part in the gpgpu-sim codebase, I searched for the shfl PTX operator, and found the implementations here: link. Not very experienced in the codes of gpgpu-sim, I've been blocked in these step for few days.
I would appreciate it sincerely if you could help me with this trouble. Thanks for your consideration!☺️
The text was updated successfully, but these errors were encountered: