Skip to content

Conversation

nmusolino
Copy link

Currently, when the Python package's Dataset class is constructed with a list of user-supplied Sequence objects, the Dataset samples rows at random in a row-by-row manner. (The number of samples is controlled by the bin_construct_sample_cnt parameter.)

With this MR, the Dataset class will sample rows in batched fashion, by batching the randomly-generated row indices according to the Sequence objects' length and batch size. In cases where all rows are being sampled, the Sequence objects are indexed with slices.

The goal is to allow user-defined Sequence classes to provide data more efficiently or with better overall performance.

See #7006 .

@nmusolino nmusolino changed the title [python-package] Sample from Sequence objects in batches, rather than row-by-row Draft: [python-package] Sample from Sequence objects in batches, rather than row-by-row Aug 21, 2025
@nmusolino nmusolino marked this pull request as draft August 21, 2025 21:23
@nmusolino
Copy link
Author

@nmusolino please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants