-
Couldn't load subscription status.
- Fork 3k
Fix random seed on shuffle and interleave_datasets #7823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Cool ! To avoid unwanted side effects it could be implemented for every class instead of using Returning a new object is actually quite important otherwise iterating on the dataset multiple times would shift the RNGs every time |
Thanks for the review. I managed to return instances of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it sounds good to me as is, we can improve later if needed
closes #7567
Add
shift_rngsmethod toExamplesIterablethat is called directly after sharding. If a generator is available (not the case for all subclasses) we update the seed of the generator by shifting by the worker_id.This is just the fix forshuffle, in the corresponding issueinterleave_datasetsis mentioned as well, which won't be fixed with this approach.EDIT: This is a fix for
shuffleandinterleave_datasets. Adding recursivity toshift_rngssolvedinterleave_datasetsas well. Not sure though if this is completely safe or if we could destroy something with that. I don't think so but could be wrong and appreciate some guidance from the maintainers. I also checked, on a single_worker we are always handing overindex=0so that case preserves the seed the user specified.