Skip to content

Support Cycling a CombinedStreamingDataset #658

@shaoyuancc

Description

@shaoyuancc

🚀 Feature

The ability to provide length as an input argument to the CombinedStreamingDataset such that the epoch length is dissociated from the number of samples in the dataset. Same as ParallelStreamingDataset.

Motivation

I want to create a CombinedStreamingDataset that is the weighted combination of StreamingDatasets but be able to specify the number of training steps/cycle the CombinedStreamingDataset arbitrarily. As discussed with @tchaton.

Related to #524

Alternatives

Not sure if this would work but conceptually one workaround might be to wrap the CombinedStreamingDataset with the ParallelStreamingDataset? e.g.

ds1 = StreamingDataset(...)
ds2 = StreamingDataset(...)
cds = CombinedStreamingDataset([ds1, ds2], weights)
pds = ParallelStreamingDataset([cds], length=100)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions