Convert iterator-style raw datasets to map-style raw datasets

## 🚀 Feature


**Motivation**

torchtext provide several open source nlp datasets in raw form. These datasets are provide as [Iterables](https://pytorch.org/docs/stable/data.html#iterable-style-datasets). Although there are times when user may prefer [map-style](https://pytorch.org/docs/stable/data.html#map-style-datasets) datasets.

**Pitch**

We would like to implement a functionality that would convert iterable datasets into map-style datasets. This functionality can either be implemented as functional where-in the input would be raw iterable dataset. An alternative implementation would be to directly provide it as a member function for [raw dataset iterable class](https://github.com/pytorch/text/blob/master/torchtext/data/datasets_utils.py#L282).

**Alternatives**

A naive approach would be to simply materialize iterator into list as follows:

```
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
train_dataset = list(train_iter)
```

Unfortunately, passing list to PyTorch DataLoader would have memory regressions for multi-processing. More details below.

**Additional context**

Note that the solution stated in Alternatives would have issues when doing multi-process data loading. This issue is discussed in details here (https://github.com/pytorch/pytorch/issues/13246). Other options would be to explore numpy tensors, but seems like it would suffer from same issue as lists, refer this this comment (https://github.com/pytorch/pytorch/issues/13246#issuecomment-445770039).

**Potential Solution**

Thanks to @cpuhrsch for proposing it here (https://github.com/pytorch/text/pull/1281#discussion_r618908965)
One idea would be to create a thin C++ wrapper where in the data is stored in std::array<std::string> data-structure. This wrapper can then be binded in Python using [pybind11](https://github.com/pybind/pybind11). One potential caveat though is performance regression due to overhead of querying through binding functions. Hopefully this cost is negligible compared to downstream processing.

cc: @cpuhrsch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert iterator-style raw datasets to map-style raw datasets #1296

🚀 Feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Convert iterator-style raw datasets to map-style raw datasets #1296

Description

🚀 Feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions