Skip to content

multi30k dataset only contains test set for French and no Czech data #762

Open
@bentrevett

Description

@bentrevett

The multi30k dataset provided by torchtext.datasets contains the training/testing/validation sets for English and German, however for French it only contains a test set - even though training and validation sets are available.

It also does not contain any of the Czech data, or any of the 2017 test data (which is not available in Czech).

The official multi30k repository - https://github.com/multi30k/dataset - contains all of the data at https://github.com/multi30k/dataset/tree/master/data/task1/raw, although getting the data from GitHub is slightly more tricky than directly downloading it from the urls torchtext is already using for multi30k.

I'm guessing that adding this can wait until #751 is finished?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions