Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Files

Latest commit

author
Sharat Chikkerur
Jun 23, 2020
5c690ef · Jun 23, 2020

History

History

dataset

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jan 3, 2020
Aug 19, 2019
Jan 16, 2020
Apr 3, 2020
Feb 17, 2020
Aug 20, 2019
Sep 2, 2019
Jun 8, 2020
Jun 23, 2020
Aug 20, 2019
Dec 13, 2019
Aug 20, 2019
Oct 23, 2019
Aug 20, 2019
Mar 4, 2020
Jun 19, 2020
Jun 18, 2020
Aug 20, 2019

This submodule includes helper functions for downloading datasets and formatting them appropriately as well as utilities for splitting data for training / testing.

Data Loading

There are dataloaders for several datasets. For example, the snli module will allow you to load a dataframe in pandas from the SNLI dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Most datasets may be split into train, dev, and test, for example:

from utils_nlp.dataset.snli import load_pandas_df

df = load_pandas_df(DATA_FOLDER, file_split ="train", nrows = 1000)

Dataset List

Dataset Dataloader script
Microsoft Research Paraphrase Corpus msrpc.py
The Multi-Genre NLI (MultiNLI) Corpus multinli.py
The Stanford Natural Language Inference (SNLI) Corpus snli.py
Wikigold NER wikigold.py
The Cross-Lingual NLI (XNLI) Corpus xnli.py
The STSbenchmark dataset stsbenchmark.py
The Stanford Question Answering Dataset (SQuAD) squad.py
CNN/Daily Mail(CNN/DM) Dataset cnndm.py
Preprocessed CNN/Daily Mail(CNN/DM) Dataset for Extractive Summarization cnndm.py

Dataset References

Please see Dataset References for notice and information regarding datasets used.