Support offsets in Raw Iterable datasets

## 🚀 Feature

 
**Motivation**
 
torchtext provides several open source nlp datasets in raw form. These datasets are provided as [Iterables](https://pytorch.org/docs/stable/data.html#iterable-style-datasets). There are times when the user wants to start the iterator at an arbitrary offset rather than from the beginning. One of the reference issue on this is https://github.com/pytorch/text/issues/1265
 
**Pitch**
 
We would like to have functionality that let users offset the raw Iterator at arbitrary offset.
 
**Alternatives**
 
A naive approach would be to simply materialize the iterator into a list as follows and then start the iteration at arbitrary offset. 
 
```
 
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
train_dataset = list(train_iter)
 
offset = 1000
train_dataset = train_dataset[offset:]
```
 
Unfortunately, this would mean we give up the upside of having Iterable datasets and need to explicitly materialize them that could potentially result in memory issues.
 
There are some attempts made in the past. Have a look at this PR https://github.com/pytorch/text/pull/1281 for additional details. The [implementation](https://github.com/pytorch/text/pull/1281/files#diff-ab3ee9237fbb0465f1d7fd4ea04808ab303af375b934a0c205c3ebaa1810b220R314) proposed in this PR  is naive in the sense that it still requires internally to iterate over the dataset that could be time consuming if the offset is huge. 
 
The other extreme is to store offsets for all the locations. This would provide instant access to the iterator at given offsets though it doesn’t come for free as we require to store these offsets and make them accessible to the user in some form. There are some discussions on this here https://github.com/pytorch/text/pull/1281#discussion_r613729291
 
**Proposed solution**
 
There is an interesting middle ground proposed by @cpuhrsch  here https://github.com/pytorch/text/pull/1281#discussion_r618921980. 
 
**Background on datasets**
 
tochtext NLP datasets can potentially come in following flavors and the approach to provide offset can vary between them:
 
**1. Every line in the file corresponds to one data sample.** 
 
In this case, we simply read each line from the file, do some operations on it (if necessary) and yield the data sample. 

Example: AG_NEWS
 
**2. Multiple lines in the file correspond to one data sample**
In this case, multiple lines are read from file, combined together to form the data sample, and yield it.
 
Example: CoNLL2000Chunking
 
**3. Formatted datasets**
In this case, the dataset in the file is provided in some standard format (XML, JSON etc). We read the data from file and remove formatting, perform collation (if necessary) to create the data sample which is eventually yielded by the iterator. 
 
Example: SQuAD1
 
Note that sometimes a dataset in category 3 above is converted into category 1 first. An example of such a case is the IWSLT2016 dataset.

 
**Proposed workflow**
1. Pick an un-ticked dataset (or multiple datasets if you like :) ) from the list below, identify it's category and see if there is any reference PR in that category that can serve as a guide (If there is none, you will be the first and lead the way, yay!!)
2. Based on this [idea](https://github.com/pytorch/text/pull/1281#discussion_r618921980), store the offsets in the same file where the dataset is defined. All the torchtext datasets have their own dedicated file in this [folder](https://github.com/pytorch/text/tree/master/torchtext/datasets).
3. Modify the dataset function to accept offset argument and implement the functionality. Note that you may also need to modify (if this is not already done) helper functions used inside dataset functions. For example this [function](https://github.com/pytorch/text/blob/cbf8af7983c54bdeb266776a8a3230d528248890/torchtext/data/datasets_utils.py#L84) may need modification to accept offset argument (if not done already). 
5. TESTING: Make sure to add the test that your offset works. For this you can simply follow the same testing as is done [here](https://github.com/pytorch/text/blob/cbf8af7983c54bdeb266776a8a3230d528248890/test/data/test_builtin_datasets.py#L155) for the first data sample yielded by the iterator. The only difference is that you would do the same test at some offset. Note that you would also need to modify this[ file](https://github.com/pytorch/text/blob/master/test/asset/raw_datasets.jsonl) to add couple of more items: the `offset value` and the` hash` of the corresponding data entry. For reference, you may refer to PR #1281 regarding how this has been done.
6. Create PR and if PR is merged successfully, come back here and tick the dataset(s).  
 
**Datasets**

- [ ] AG_NEWS
- [ ] AmazonReviewFull
- [ ] AmazonReviewPolarity
- [ ] CoNLL2000Chunking
- [ ] DBpedia
- [ ] EnWik9
- [ ] IMDB
- [ ] IWSLT2016
- [ ] IWSLT2017
- [ ] PennTreebank
- [ ] SQuAD1
- [ ] SQuAD2
- [ ] SogouNews
- [ ] UDPOS
- [ ] WikiText103
- [ ] WikiText2
- [ ] YahooAnswers
- [ ] YelpReviewFull
- [ ] YelpReviewPolarity
- [ ] Multi30k

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support offsets in Raw Iterable datasets #1311

🚀 Feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support offsets in Raw Iterable datasets #1311

Description

🚀 Feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions