Description
🚀 Feature
Motivation
torchtext provides several open source nlp datasets in raw form. These datasets are provided as Iterables. There are times when the user wants to start the iterator at an arbitrary offset rather than from the beginning. One of the reference issue on this is #1265
Pitch
We would like to have functionality that let users offset the raw Iterator at arbitrary offset.
Alternatives
A naive approach would be to simply materialize the iterator into a list as follows and then start the iteration at arbitrary offset.
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
train_dataset = list(train_iter)
offset = 1000
train_dataset = train_dataset[offset:]
Unfortunately, this would mean we give up the upside of having Iterable datasets and need to explicitly materialize them that could potentially result in memory issues.
There are some attempts made in the past. Have a look at this PR #1281 for additional details. The implementation proposed in this PR is naive in the sense that it still requires internally to iterate over the dataset that could be time consuming if the offset is huge.
The other extreme is to store offsets for all the locations. This would provide instant access to the iterator at given offsets though it doesn’t come for free as we require to store these offsets and make them accessible to the user in some form. There are some discussions on this here #1281 (comment)
Proposed solution
There is an interesting middle ground proposed by @cpuhrsch here #1281 (comment).
Background on datasets
tochtext NLP datasets can potentially come in following flavors and the approach to provide offset can vary between them:
1. Every line in the file corresponds to one data sample.
In this case, we simply read each line from the file, do some operations on it (if necessary) and yield the data sample.
Example: AG_NEWS
2. Multiple lines in the file correspond to one data sample
In this case, multiple lines are read from file, combined together to form the data sample, and yield it.
Example: CoNLL2000Chunking
3. Formatted datasets
In this case, the dataset in the file is provided in some standard format (XML, JSON etc). We read the data from file and remove formatting, perform collation (if necessary) to create the data sample which is eventually yielded by the iterator.
Example: SQuAD1
Note that sometimes a dataset in category 3 above is converted into category 1 first. An example of such a case is the IWSLT2016 dataset.
Proposed workflow
- Pick an un-ticked dataset (or multiple datasets if you like :) ) from the list below, identify it's category and see if there is any reference PR in that category that can serve as a guide (If there is none, you will be the first and lead the way, yay!!)
- Based on this idea, store the offsets in the same file where the dataset is defined. All the torchtext datasets have their own dedicated file in this folder.
- Modify the dataset function to accept offset argument and implement the functionality. Note that you may also need to modify (if this is not already done) helper functions used inside dataset functions. For example this function may need modification to accept offset argument (if not done already).
- TESTING: Make sure to add the test that your offset works. For this you can simply follow the same testing as is done here for the first data sample yielded by the iterator. The only difference is that you would do the same test at some offset. Note that you would also need to modify this file to add couple of more items: the
offset value
and thehash
of the corresponding data entry. For reference, you may refer to PR [WIP] adding offset and to map-style dataset options to Raw Text Dataset Iterator #1281 regarding how this has been done. - Create PR and if PR is merged successfully, come back here and tick the dataset(s).
Datasets
- AG_NEWS
- AmazonReviewFull
- AmazonReviewPolarity
- CoNLL2000Chunking
- DBpedia
- EnWik9
- IMDB
- IWSLT2016
- IWSLT2017
- PennTreebank
- SQuAD1
- SQuAD2
- SogouNews
- UDPOS
- WikiText103
- WikiText2
- YahooAnswers
- YelpReviewFull
- YelpReviewPolarity
- Multi30k