Description
🚀 Feature
Motivation
torchtext provide several open source nlp datasets in raw form. These datasets are provide as Iterables. Although there are times when user may prefer map-style datasets.
Pitch
We would like to implement a functionality that would convert iterable datasets into map-style datasets. This functionality can either be implemented as functional where-in the input would be raw iterable dataset. An alternative implementation would be to directly provide it as a member function for raw dataset iterable class.
Alternatives
A naive approach would be to simply materialize iterator into list as follows:
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
train_dataset = list(train_iter)
Unfortunately, passing list to PyTorch DataLoader would have memory regressions for multi-processing. More details below.
Additional context
Note that the solution stated in Alternatives would have issues when doing multi-process data loading. This issue is discussed in details here (pytorch/pytorch#13246). Other options would be to explore numpy tensors, but seems like it would suffer from same issue as lists, refer this this comment (pytorch/pytorch#13246 (comment)).
Potential Solution
Thanks to @cpuhrsch for proposing it here (#1281 (comment))
One idea would be to create a thin C++ wrapper where in the data is stored in std::arraystd::string data-structure. This wrapper can then be binded in Python using pybind11. One potential caveat though is performance regression due to overhead of querying through binding functions. Hopefully this cost is negligible compared to downstream processing.
cc: @cpuhrsch