Releases: huggingface/datasets
Releases · huggingface/datasets
1.5.0
Datasets changes
- New: Europarl Bilingual #1874 (@lucadiliello)
- New: Stanford Sentiment Treebank #1961 (@patpizio)
- New: RO-STS #1978 (@lorinczb)
- New: newspop #1871 (@frankier)
- New: FashionMNIST #1999 (@gchhablani)
- New: Common voice #1886 (@BirgerMoell), #2063 (@patrickvonplaten)
- New: Cryptonite #2013 (@theo-m)
- New: RoSent #2011 (@gchhablani)
- New: PersiNLU reading-comprehension #2028 (@danyaljj)
- New: conllpp #1991 (@ZihanWangKi)
- New: LaRoSeDa #2004 (@MihaelaGaman)
- Update: unnecessary docstart check in conll-like datasets #2020 (@mariosasko)
- Update: semeval 2020 task 11 - add article_id and process test set template #1979 (@hemildesai)
- Update: Md gender - card update #2018 (@mcmillanmajora)
- Update: XQuAD - add Romanian #2023 (@M-Salti)
- Update: DROP - all answers #1980 (@KaijuML)
- Fix: TIMIT ASR - Make sure not only the first sample is used #1995 (@patrickvonplaten)
- Fix: Wikipedia - save memory by replacing root.clear with elem.clear #2037 (@miyamonz)
- Fix: Doc2dial update data_infos and data_loaders #2041 (@songfeng)
- Fix: ZEST - update download link #2057 (@matt-peters)
- Fix: ted_talks_iwslt - fix version error #2064 (@mariosasko)
Datasets Features
- Implement Dataset from CSV #1946 (@albertvillanova)
- Implement Dataset from JSON and JSON Lines #1943 (@albertvillanova)
- Implement Dataset from text #2030 (@albertvillanova)
- Optimize int precision for tokenization #1985 (@albertvillanova)
- This allows to save 75%+ of space when tokenizing a dataset
General Bug fixes and improvements
- Fix ArrowWriter closes stream at exit #1971 (@albertvillanova)
- feat(docs): navigate with left/right arrow keys #1974 (@ydcjeff)
- Fix various typos/grammer in the docs #2008 (@mariosasko)
- Update format columns in Dataset.rename_columns #2027 (@mariosasko)
- Replace print with logging in dataset scripts #2019 (@mariosasko)
- Raise an error for outdated sacrebleu versions #2033 (@lhoestq)
- Not all languages have 2 digit codes. #2016 (@asiddhant)
- Fix arrow memory checks issue in tests #2042 (@lhoestq)
- Support pickle protocol for dataset splits defined as ReadInstruction #2043 (@mariosasko)
- Preserve column ordering in Dataset.rename_column #2045 (@mariosasko)
- Fix text-classification tags #2049 (@gchhablani)
- Fix docstring rendering of Dataset/DatasetDict.from_csv args #2066 (@albertvillanova)
- Fixes check of TF_AVAILABLE and TORCH_AVAILABLE #2073 (@philschmid)
- Add and fix docstring for NamedSplit #2069 (@albertvillanova)
- Bump huggingface_hub version #2077 (@SBrandeis)
- Fix docstring issues #2072 (@albertvillanova)
1.4.1
1.4.0
Datasets Changes
- New: iapp_wiki_qa_squad #1873 (@cstorm125)
- New: Financial PhraseBank #1866 (@frankier)
- New: CoVoST2 #1935 (@patil-suraj)
- New: TIMIT #1903 (@vrindaprabhu)
- New: Mlama (multilingual lama) #1931 (@pdufter)
- New: FewRel #1823 (@gchhablani)
- New: CCAligned Multilingual Dataset #1815 (@gchhablani)
- New: Turkish News Category Lite #1967 (@yavuzKomecoglu)
- Update: WMT - use mirror links #1912 for better download speed (@lhoestq)
- Update: multi_nli - add missing fields #1950 (@bhavitvyamalik)
- Fix: ALT - fix duplicated examples in alt-parallel #1899 (@lhoestq)
- Fix: WMT datasets - fix download errors #1901 (@YangWang92), #1902 (@lhoestq)
- Fix: QA4MRE - fix download URLs #1918 (@M-Salti)
- Fix: Wiki_dpr - fix when with_embeddings is False or index_name is "no_index" #1925 (@lhoestq)
- Fix: Wiki_dpr - add missing scalar quantizer #1926 (@lhoestq)
- Fix: GEM - fix the URL filtering for bad MLSUM examples in GEM #1970 (@yjernite)
Datasets Features
- Add to_dict and to_pandas for Dataset #1889 (@SBrandeis)
- Add to_csv for Dataset #1887 (@SBrandeis)
- Add keep_linebreaks parameter to text loader #1913 (@lhoestq)
- Add not-in-place implementations for several dataset transforms #1883 (@SBrandeis):
- This introduces new methods for Dataset objects: rename_column, remove_columns, flatten and cast.
- The old in-place methods rename_column_, remove_columns_, flatten_ and cast_ are now deprecated.
- Make DownloadManager downloaded/extracted paths accessible #1846 (@albertvillanova)
- Add cross-platform support for datasets-cli #1951 (@mariosasko)
Metrics Changes
Offline loading
- Handle timeouts #1952 (@lhoestq)
- Add datasets full offline mode with HF_DATASETS_OFFLINE #1976 (@lhoestq)
General improvements and bugfixes
- Replace flatten_nested #1879 (@albertvillanova)
- add missing info on how to add large files #1885 (@stas00)
- Docs for adding new column on formatted dataset #1888 (@lhoestq)
- Fix PandasArrayExtensionArray conversion to native type #1897 (@lhoestq)
- Bugfix for string_to_arrow timestamp[ns] support #1900 (@justin-yan)
- Fix to_pandas for boolean ArrayXD #1904 (@lhoestq)
- Fix logging imports and make all datasets use library logger #1914 (@albertvillanova)
- Standardizing datasets dtypes #1921 (@justin-yan)
- Remove unused py_utils objects #1916 (@albertvillanova)
- Fix save_to_disk with relative path #1923 (@lhoestq)
- Updating old cards #1928 (@mcmillanmajora)
- Improve typing and style and fix some inconsistencies #1929 (@mariosasko)
- Fix builder config creation with data_dir #1932 (@lhoestq)
- Disallow ClassLabel with no names #1938 (@lhoestq)
- Update documentation with not in place transforms and update DatasetDict #1947 (@lhoestq)
- Documentation for to_csv, to_pandas and to_dict #1953 (@lhoestq)
- typos + grammar #1955 (@stas00)
- Fix unused arguments #1962 (@mariosasko)
- Fix metrics collision in separate multiprocessed experiments #1966 (@lhoestq)
1.3.0
Dataset Features
- On-the-fly data transforms (#1795)
- ADD S3 support for downloading and uploading processed datasets (#1723)
- Allow loading dataset in-memory (#1792)
- Support future datasets (#1813)
- Enable/disable caching (#1703)
- Offline dataset loading (#1726)
Datasets Hub Features
- Loading from the Datasets Hub (#1860)
This allows users to create their own dataset repositories in the Datasets Hub and then load them using the library.
Repositories can be created on the website: https://huggingface.co/new-dataset or using the huggingface-cli. More information in the dataset sharing section of the documentation
Dataset Changes
- New: LJ Speech (#1878)
- New: Add Hindi Discourse Analysis Natural Language Inference Dataset (#1822)
- New: cord 19 (#1850)
- New: Tweet Eval Dataset (#1829)
- New: CIFAR-100 Dataset (#1812)
- New: SICK (#1804)
- New: BBC Hindi NLI Dataset (#1158)
- New: Freebase QA Dataset (#1814)
- New: Arabic sarcasm (#1798)
- New: Semantic Scholar Open Research Corpus (#1606)
- New: DuoRC Dataset (#1800)
- New: Aggregated dataset for the GEM benchmark (#1807)
- New: CC-News dataset of English language articles (#1323)
- New: irc disentangle (#1586)
- New: Narrative QA Manual (#1778)
- New: Universal Morphologies (#1174)
- New: SILICONE (#1761)
- New: Librispeech ASR (#1767)
- New: OSCAR (#1694, #1868, #1833)
- New: CANER Corpus (#1684)
- New: Arabic Speech Corpus (#1852)
- New: id_liputan6 (#1740)
- New: Stuctured Argument Extraction for Korean dataset (#1748)
- New: TurkCorpus (#1732)
- New: Hatexplain Dataset (#1716)
- New: adversarialQA (#1714)
- Update: Doc2dial - reading comprehension update to latest version (#1816)
- Update: OPUS Open Subtitles - add with metadata information (#1865)
- Update: SWDA - use all metadata features(#1799)
- Update: SWDA - add metadata and correct splits (#1749)
- Update: CommonGen - update citation information (#1787)
- Update: SciFact - update URL (#1780)
- Update: BrWaC - update features name (#1736)
- Update: TLC - update urls to be github links (#1737)
- Update: Ted Talks IWSLT - add new version: WIT3 (#1676)
- Fix: multi_woz_v22 - fix checksums (#1880)
- Fix: limit - fix url (#1861)
- Fix: WebNLG - fix test test + more field (#1739)
- Fix: PAWS-X - fix csv Dictreader splitting data on quotes (#1763)
- Fix: reuters - add missing "brief" entries (#1744)
- Fix: thainer: empty token bug (#1734)
- Fix: lst20: empty token bug (#1734)
Metrics Changes
- New: Word Error Metric (#1847)
- New: COMET (#1577, #1753)
- Fix: bert_score - set version dependency (#1851)
Metric Docs
- Add metrics usage examples and tests (#1820)
CLI Changes
- [BREAKING] remove outdated commands (#1869):
- remove outdated "datasets-cli upload_dataset" and "datasets-cli upload_metric"
- instead, use the huggingface-hub CLI
Bug fixes
- fix writing GPU Faiss index (#1862)
- update pyarrow import warning (#1782)
- Ignore definition line number of functions for caching (#1779)
- update saving and loading methods for faiss index so to accept path like objects (#1663)
- Print error message with filename when malformed CSV (#1826)
- Fix default tensors precision when format is set to PyTorch and TensorFlow (#1795)
Refactoring
- Refactoring: Create config module (#1848)
- Use a config id in the cache directory names for custom configs (#1754)
Logging
- Enable logging propagation and remove logging handler (#1845)
1.2.1
New Features
- Fast start up (#1690): Importing
datasets
is now significantly faster.
Datasets Changes
- New: MNIST (#1730)
- New: Korean intonation-aided intention identification dataset (#1715)
- New: Switchboard Dialog Act Corpus (#1678)
- Update: Wiki-Auto - Added unfiltered versions of the training data for the GEM simplification task. (#1722)
- Update: Scientific papers - Mirror datasets zip (#1721)
- Update: Update DBRD dataset card and download URL (#1699)
- Fix: Thainer - fix ner_tag bugs (#1695)
- Fix: reuters21578 - metadata parsing errors (#1693)
- Fix: ade_corpus_v2 - fix config names (#1689)
- Fix: DaNE - fix last example (#1688)
Datasets tagging
- rename "part-of-speech-tagging" tag in some dataset cards (#1645)
Bug Fixes
- Fix column list comparison in transmit format (#1719)
- Fix windows path scheme in cached path (#1711)
Docs
- Add information about caching and verifications in "Load a Dataset" docs (#1705)
Moreover many dataset cards of datasets added during the sprint were updated ! Thanks to all the contributors :)
1.2.0
1.1.3
Datasets changes
- New: NLI-Tr (#787)
- New: Amazon Reviews (#791)(#844)(#845)(#799)
- New: ASNQ - answer sentence selection (#780)
- New: OpenBookCorpus (#856)
- New: ASLG-PC12 - sign language translation (#731)
- New: Quail - question answering dataset (#747)
- Update: SNLI: Created dataset card snli.md (#663)
- Update: csv - Use pandas reader in csv (#857)
- Better memory management
- Breaking: the previous
read_options
,parse_options
and convert_options
are replaced with plain parameters like pandas.read_csv
- Update: conll2000, conll2003, germeval_14, wnut_17, XTREME PAN-X - Create ClassLabel for labelling tasks datasets (#850)
- Breaking: use of ClassLabel features instead of string features + naming of columns updated for consistency
- Update: XNLI - Add XNLI train set (#781)
- Update: XSUM - Use full released xsum dataset (#754)
- Update: CompGuessWhat - New version of CompGuessWhat?! with refined annotations (#748)
- Update: CLUE - add OCNLI, a new CLUE dataset (#742)
- Fix: KOR-NLI - Fix csv reader (#855)
- Fix: Discofuse - fix discofuse urls (#793)
- Fix: Emotion - fix description (#745)
- Fix: TREC - update urls (#740)
Metrics changes
- New: accuracy, precision, recall and F1 metrics (#825)
- Fix: squad_v2 (#840)
- Fix: seqeval (#810)(#738)
- Fix: Rouge - fix description (#774)
- Fix: GLUE - fix description (#734)
- Fix: BertScore - fix custom baseline (#763)
Command line tools
- add clear_cache parameter in the test command (#863)
Dependencies
- Integrate file_lock inside the lib for better logging control (#859)
Dataset features
- Add writer_batch_size attribute to GeneratorBasedBuilder (#828)
- pretty print dataset objects (#725)
- allow custom split names in text dataset (#776)
Tests
- All configs is a slow test now
Bug fixes
1.1.2
1.1.0: Windows support, Better Multiprocessing, New Datasets
Windows support
- Add Windows support (#644):
- add tests and CI for Windows
- fix numerous windows specific issues
- The library now fully supports Windows
Dataset changes
- New: HotpotQA (#703)
- New: OpenWebText (#660)
- New: Winogrande - add debiased subset (#655)
- Update: XNLI - update download link (#695)
- Update: text - switch to pandas reader, better memory usage, fix delimiter issues (#689)
- Update: csv - add features parameter to CSV (#685)
- Fix: GAP - fix wrong computation of boolean features (#680)
- Fix: C4 - fix manual instruction function (#681)
Metric changes
- Update: ROUGE - Add rouge 2 and rouge Lsum to rouge metric outputs by default (#701, #702)
- Fix: SQuAD - fix kwargs description (#670)
Dataset Features
- Use multiprocess from pathos for multiprocessing (#656):
- allow lambda functions in multiprocessed map
- allow local functions in multiprocessed map
- and more ! As long as functions are compatible with
dill
Bug fixes
- Datasets: fix possible program hanging with tokenizers - Disable tokenizers parallelism in multiprocessed map (#688)
- Datasets: fix cast with unordered features - fix column order issue in cast (#684)
- Datasets: fix first time creation of cache directory - move cache dir root creation in builder's init (#677)
- Datasets: fix OverflowError when using negative ids - fix negative ids in slicing with an array (#679)
- Datasets: fix empty dictionaries afetr multiprocessing - keep new columns in transmit format (#659)
- Datasets: fix type inference for nested types - handle data alteration when trying type (#653)
- Metrics: fix compute metric with empty input - pass metric features to the reader (#654)
Documentation
- Elasticsearch integration documentation (#696)
Tests
- Use GitHub instead of AWS in remote dataset tests (#694)
1.0.2
Dataset changes:
- New: CoNLL-2003 (#613)
- New: ConLL-2000 (#634)
- New: MATINF (ACL 2020) (#637)
- New: Polyglot-NER (#641)
- Update: GLUE - update GLUE urls (now hosted on FB) (#626)
- Update: GLUE/qqp - update download checksum (#639)
- Update: MLQA - feature names update (#627)
- Update: LinCE - update feature names - Consistent ner features (#636)
- Update: WNUT 17: update feature names - Consistent ner features (#642)
- Update: XTREME/PAN-X - update feature names - Consistent ner features (#636)
- Update: RACE - update dataset checksum + add new configurations (#540)
- Fix: text - fix delimiter (#631)
- Fix: Wiki DPR - fix download error in wiki_dpr (f38a871)
Logging:
- Set level to warning (previously info) (#635)