dataset

Sharat Chikkerur

Jun 23, 2020

5c690ef · Jun 23, 2020

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	change name	Jan 3, 2020
__init__.py	__init__.py	minor edits	Aug 19, 2019
bbc_hindi.py	bbc_hindi.py	additional ordering of things	Jan 16, 2020
cnndm.py	cnndm.py	update to use s2s-ft v0.3; add cpu test	Apr 3, 2020
dac.py	dac.py	update dataset url	Feb 17, 2020
data_loaders.py	data_loaders.py	Minor edits.	Aug 20, 2019
msrpc.py	msrpc.py	Merge branch 'staging' into sphinx	Sep 2, 2019
multinli.py	multinli.py	Feat: add functionality to download MNLI preprocessed tsv data.	Jun 8, 2020
ner_utils.py	ner_utils.py	reverted changes to dataset	Jun 23, 2020
preprocess.py	preprocess.py	Minor edits.	Aug 20, 2019
sentence_selection.py	sentence_selection.py	move sentence selection out of bertsum package	Dec 13, 2019
snli.py	snli.py	Merge with staging and resolve conflicts.	Aug 20, 2019
squad.py	squad.py	edits and updates	Oct 23, 2019
stsbenchmark.py	stsbenchmark.py	Minor edits.	Aug 20, 2019
url_utils.py	url_utils.py	Improved data preprocessing functions.	Mar 4, 2020
wikigold.py	wikigold.py	simplifying NER notebook by using dataset utilities	Jun 19, 2020
xnli.py	xnli.py	update xnli links	Jun 18, 2020
xnli_torch_dataset.py	xnli_torch_dataset.py	Merge with staging and resolve conflicts.	Aug 20, 2019

Dataset

This submodule includes helper functions for downloading datasets and formatting them appropriately as well as utilities for splitting data for training / testing.

Data Loading

There are dataloaders for several datasets. For example, the snli module will allow you to load a dataframe in pandas from the SNLI dataset, with the option to set the number of rows to load in order to test algorithms and evaluate performance benchmarks. Most datasets may be split into train, dev, and test, for example:

from utils_nlp.dataset.snli import load_pandas_df

df = load_pandas_df(DATA_FOLDER, file_split ="train", nrows = 1000)

Dataset List

Dataset	Dataloader script
Microsoft Research Paraphrase Corpus	msrpc.py
The Multi-Genre NLI (MultiNLI) Corpus	multinli.py
The Stanford Natural Language Inference (SNLI) Corpus	snli.py
Wikigold NER	wikigold.py
The Cross-Lingual NLI (XNLI) Corpus	xnli.py
The STSbenchmark dataset	stsbenchmark.py
The Stanford Question Answering Dataset (SQuAD)	squad.py
CNN/Daily Mail(CNN/DM) Dataset	cnndm.py
Preprocessed CNN/Daily Mail(CNN/DM) Dataset for Extractive Summarization	cnndm.py

Dataset References

Please see Dataset References for notice and information regarding datasets used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

dataset

dataset

README.md

Dataset

Data Loading

Dataset List

Dataset References

Files

dataset

Directory actions

More options

Directory actions

More options

Latest commit

History

dataset

Folders and files

parent directory

README.md

Dataset

Data Loading

Dataset List

Dataset References