Skip to content

Conversation

@CloseChoice
Copy link
Contributor

@CloseChoice CloseChoice commented Oct 17, 2025

closes #7567

Add shift_rngs method to ExamplesIterable that is called directly after sharding. If a generator is available (not the case for all subclasses) we update the seed of the generator by shifting by the worker_id.

This is just the fix for shuffle, in the corresponding issue interleave_datasets is mentioned as well, which won't be fixed with this approach.

EDIT: This is a fix for shuffle and interleave_datasets. Adding recursivity to shift_rngs solved interleave_datasets as well. Not sure though if this is completely safe or if we could destroy something with that. I don't think so but could be wrong and appreciate some guidance from the maintainers. I also checked, on a single_worker we are always handing over index=0 so that case preserves the seed the user specified.

@CloseChoice CloseChoice changed the title Fix shuffle seed 7567 Fix random seed when shuffling Oct 17, 2025
@CloseChoice CloseChoice changed the title Fix random seed when shuffling Fix random seed when shuffling and interleaving_datasets Oct 17, 2025
@CloseChoice CloseChoice changed the title Fix random seed when shuffling and interleaving_datasets Fix random seed on shuffle and interleave_datasets Oct 17, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Oct 20, 2025

Cool ! To avoid unwanted side effects it could be implemented for every class instead of using hasattr, and also return a new instance instead of modifying it in place. Some other functions work like that already like shuffle_data_sources and shard_data_sources.

Returning a new object is actually quite important otherwise iterating on the dataset multiple times would shift the RNGs every time

@CloseChoice
Copy link
Contributor Author

CloseChoice commented Oct 23, 2025

Cool ! To avoid unwanted side effects it could be implemented for every class instead of using hasattr, and also return a new instance instead of modifying it in place. Some other functions work like that already like shuffle_data_sources and shard_data_sources.

Returning a new object is actually quite important otherwise iterating on the dataset multiple times would shift the RNGs every time

Thanks for the review.

I managed to return instances of _BaseExamplesIterable instead of modifying inplace and this shifts the seed correctly even in the case of multiple iterations (see additional test). But I kept the hasattr for the reason that we have chained ex_iterables (so an ex_iterable can have other ex_iterables) and for the child ex_iterable it is not clear if they do have a generator attribute (e.g. ArrowExamplesIterable, RebatchedArrowExamplesIterable, SelectColumnsIterable, StepExamplesIterable, SkipExamplesIterable, RepeatExamplesIterable all don't have that). We could implement a generic method shift_rngs on the baseclass and then just return the ex_iterable and get rid of hasattr but I'd prefer the hasattr solution here. Let me now what you think about that.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it sounds good to me as is, we can improve later if needed

@lhoestq lhoestq merged commit a7600ac into huggingface:main Oct 24, 2025
10 of 14 checks passed
@CloseChoice CloseChoice deleted the fix-shuffle-seed-7567 branch October 24, 2025 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

interleave_datasets seed with multiple workers

3 participants