Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] implement
record_stream
when using CUDA streams during group offloading #11081[feat] implement
record_stream
when using CUDA streams during group offloading #11081Changes from 4 commits
ffce2d1
f25ea18
2a28f6d
41ea4c8
f5b69b0
9281e84
637f84e
612136f
d5afea5
fb59f36
4a6eeba
87a93fe
1d4ca61
535dcd1
2ff9112
b4deedc
622aba7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_stream should be self.stream here I think. We need to tell pytorch that the param.data and buffer.data here is owned by the non-default stream. Currently, we're telling it that it is owned by the default stream, which seems incorrect to me
Sorry for the back and forth but I think we will have to run the benchmark once more with the change 😅
Apart from this, everything else looks good. We can button up the docs and merge after @DN6 gives a look. Let's make sure to mention that this may use more memory in comparison to
record_stream=False
in certain cases and link to the torch::Tensor::record_stream docsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind, ignore my comment.
We don't create anything on the non-default stream, so torch.cuda.current_stream is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-r-r-o-w no problem. Better to be rigorous and double-check everything.
Just ran the benchmark after merging
main
into this branch (on DGX on a single 80GB A100):Feel free to run it yourself if you want.
Absolutely. I will mention in the docstrings. Is there any other place you wanted me to mention it?