-
-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reductions with CUDA are sometimes wrong #892
Comments
Would you be able to craft a reproducer? That would help a lot with tracking down the issue. |
The only way I found to reliably reproduce the bug is through our buildkite on our cluster with a clean depot. If I run the same test on another machine, the test always passes. If I run from an interactive shell, any change seem to affect the outcome. When running interactively, the test could go from always failing to always passing seemingly only adding a print statament. I know it's not the best, but I can probably give you a way to run something in the same way. I also just ran this test: reverting 877, the build passes. It's hard to see how #877 could be the problem, so I don't know if I am reading the tea leaves here. I can also bisect #877 further. |
Yeah, that could be useful. I'm also struggling to see how that change would have any effect on platforms like x86-64 where the MPI extent and Julia's aligned size agreed already before that PR. Also, the bug observed in #853 was that we were reading garbage memory (because of misaligned sizes of array elements), a factor of 2 of difference is a bit too specific. |
In your example, |
I agree with Erik. That code piece is the most suspect. One must synchronize the CUDA stream before using CUDA memory in an MPI call. The allocation of the copy and the copy itself are all stream ordered. You could try adding a synchronize before the barrier and use a |
As suggested in JuliaParallel/MPI.jl#892 ``` remapper._interpolated_values[remapper.colons..., begin] ``` allocates a new copy, which can trip up CUDA's synchronization.
As suggested in JuliaParallel/MPI.jl#892 ``` remapper._interpolated_values[remapper.colons..., begin] ``` allocates a new copy, which can trip up CUDA's synchronization.
@Sbozzolo I see you're working around this issue uptream in CliMA/ClimaCore.jl#2169, but did you get an intuition of what could the problem? Is there anything we can do here to prevent similar situations elsewhere? |
As suggested in JuliaParallel/MPI.jl#892 ``` remapper._interpolated_values[remapper.colons..., begin] ``` allocates a new copy, which can trip up CUDA's synchronization.
(Sorry for the lack of communication over the past days. I was travelling and wanted to write a longer summary for this issue). I found a few more things:
Let me also point out that I did try to synchronize the CUDA streams before the MPI call. I tested something like: MPI.barrier(remapper.comms_ctx.mpicomm)
CUDA.synchronize()
MPI.Reduce!(
remapper._interpolated_values[:, :, :, begin],
dest,
+,
remapper.comms_ctx.mpicomm;
root = 0
) This did not solve the issue. (In fact, most of my tests were running with this) It is still not clear to me whether this pattern is supposed to work or not. If not, I can add a note about it to the docs of MPI.jl |
I think what @eschnett and @vchuravy were suggesting was that You could fix it by: src = remapper._interpolated_values[:, :, :, begin]
CUDA.synchronize()
MPI.Reduce!(
src,
dest,
+,
remapper.comms_ctx.mpicomm;
root = 0
) or by using a view: MPI.Reduce!(
@view(remapper._interpolated_values[:, :, :, begin]),
dest,
+,
remapper.comms_ctx.mpicomm;
root = 0
) |
Yes, sorry if I didn't make it clear in my previous message. I changed the code to use Do I understand correctly that |
I presume the problem is the fact that this is data which lives on the GPU rather than the CPU? In julia, slicing a regular But also note that slicing always copies, so depending on what you want to do you may be doing something wrong. |
As suggested in JuliaParallel/MPI.jl#892 ``` remapper._interpolated_values[remapper.colons..., begin] ``` allocates a new copy, which can trip up CUDA's synchronization.
As suggested in JuliaParallel/MPI.jl#892 ``` remapper._interpolated_values[remapper.colons..., begin] ``` allocates a new copy, which can trip up CUDA's synchronization.
As suggested in JuliaParallel/MPI.jl#892 ``` remapper._interpolated_values[remapper.colons..., begin] ``` allocates a new copy, which can trip up CUDA's synchronization.
Starting mid December, sometimes, some of my MPI reductions produce wrong results when I am working with CUDA (and only when working with CUDA).
I came to this conclusion via printf-debugging:
that produces:
The value 249.99999999999997 is wrong and it should be 500 (0.0 + 500.0 = 500.0). I tried forcing
CUDA.synchronize
and/orCUDA.sync
, and sometimes results change (for example, I can get 450.0 instead of 250). Note also that 250 could be the correct value for another point.The code above is embedded in a more complex test case, and I can reproduce the problem reliably when running in a relatively isolated environment. When I work interactively or when I change the test, the test becomes very flaky and extremely hard to reproduce. To ensure consistent results, I have to clean the depot.
I initially thought it was a buggy MPI implementation, but it seems that this problem also appears in production on a different machine (with simulations that run for multiple days). On that machine, the same test never triggers the bug (as checked by running the test 100 times)
Considering the bug is hard to reproduce, this is what I think I concluded:
local
in front of datatype as suggested on Slack does not fix the issue (build)The text was updated successfully, but these errors were encountered: