Skip to content

Convert iterator-style raw datasets to map-style raw datasets #1296

Open
@parmeet

Description

@parmeet

🚀 Feature

Motivation

torchtext provide several open source nlp datasets in raw form. These datasets are provide as Iterables. Although there are times when user may prefer map-style datasets.

Pitch

We would like to implement a functionality that would convert iterable datasets into map-style datasets. This functionality can either be implemented as functional where-in the input would be raw iterable dataset. An alternative implementation would be to directly provide it as a member function for raw dataset iterable class.

Alternatives

A naive approach would be to simply materialize iterator into list as follows:

from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
train_dataset = list(train_iter)

Unfortunately, passing list to PyTorch DataLoader would have memory regressions for multi-processing. More details below.

Additional context

Note that the solution stated in Alternatives would have issues when doing multi-process data loading. This issue is discussed in details here (pytorch/pytorch#13246). Other options would be to explore numpy tensors, but seems like it would suffer from same issue as lists, refer this this comment (pytorch/pytorch#13246 (comment)).

Potential Solution

Thanks to @cpuhrsch for proposing it here (#1281 (comment))
One idea would be to create a thin C++ wrapper where in the data is stored in std::arraystd::string data-structure. This wrapper can then be binded in Python using pybind11. One potential caveat though is performance regression due to overhead of querying through binding functions. Hopefully this cost is negligible compared to downstream processing.

cc: @cpuhrsch

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions