Releases · huggingface/datasets

Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository by @alvarobartt in #5902
Fix Quickstart notebook link by @mariosasko in #6070
Remove README link to deprecated Colab notebook by @mariosasko in #6080
Misc doc improvements by @mariosasko in #6074

Full Changelog: 2.14.0...2.14.1

Contributors

alvarobartt, lhoestq, and mariosasko

Assets 2

24 Jul 15:54

lhoestq

2.14.0

88896a7

2.14.0

Important: caching

Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
Datasets that were already cached are still supported.
This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in #5331.

Dataset Configuration

Support for multiple configs via metadata yaml info by @polinaeterna in #5331

Configure your dataset using YAML at the top of your dataset card (docs here)
Choose which file goes into which split

  ---
  configs:
  - config_name: default
    data_files:
    - split: train
       path: data.csv
    - split: test
        path: holdout.csv
  ---

Define multiple dataset configurations

  ---
  configs:
  - config_name: main_data
    data_files: main_data.csv
  - config_name: additional_data
    data_files: additional_data.csv
  ---

Dataset Features

Support for multiple configs via metadata yaml info by @polinaeterna in #5331

push_to_hub() additional dataset configurations

ds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")

Support returning dataframe in map transform by @mariosasko in #5995

What's Changed

Deprecate errors param in favor of encoding_errors in text builder by @mariosasko in #5974
Fix select_columns columns order by @lhoestq in #5994
Replace metadata utils with huggingface_hub's RepoCard API by @mariosasko in #5949
Pin joblib to avoid joblibspark test failures by @mariosasko in #6000
Align column_names type check with type hint in sort by @mariosasko in #6001
Deprecate use_auth_token in favor of token by @mariosasko in #5996
Drop Python 3.7 support by @mariosasko in #6005
Misc improvements by @mariosasko in #6004
Make IterableDataset.from_spark more efficient by @mathewjacob1002 in #5986
Fix cast for dictionaries with no keys by @mariosasko in #6009
Avoid stuck map operation when subprocesses crashes by @pappacena in #5976
Deprecate task api by @mariosasko in #5865
Add metadata ui screenshot in docs by @lhoestq in #6015
Fix ClassLabel min max check for None values by @mariosasko in #6023
[docs] Update return statement of index search by @stevhliu in #6021
Improve logging by @mariosasko in #6019
Fix style with ruff 0.0.278 by @lhoestq in #6026
Don't reference self in Spark._validate_cache_dir by @maddiedawson in #6024
Delete task_templates in IterableDataset when they are no longer valid by @mariosasko in #6027
[docs] Fix link by @stevhliu in #6029
fixed typo in comment by @NightMachinery in #6030
Fix legacy_dataset_infos by @lhoestq in #6040
Flatten repository_structure docs on yaml by @lhoestq in #6041
Use new hffs by @lhoestq in #6028
Bump dev version by @lhoestq in #6047
Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in #6042
Rename "pattern" to "path" in YAML data_files configs by @lhoestq in #6044
Remove HfFileSystem and deprecate S3FileSystem by @mariosasko in #6052
Dill 3.7 support by @mariosasko in #6061
Improve Dataset.from_list docstring by @mariosasko in #6062
Check if column names match in Parquet loader only when config features are specified by @mariosasko in #6045
Release: 2.14.0 by @lhoestq in #6063

New Contributors

@mathewjacob1002 made their first contribution in #5986
@pappacena made their first contribution in #5976

Full Changelog: 2.13.1...2.14.0

Contributors

pappacena, polinaeterna, and 6 other contributors

Assets 2

22 Jun 18:31

lhoestq

2.13.1

682d21e

2.13.1

General improvements and bug fixes

Fix JSON generation in benchmarks CI by @mariosasko in #5966
Always return list in list_datasets by @mariosasko in #5964
Add encoding and errors params to JSON loader by @mariosasko in #5969
Filter unsupported extensions by @lhoestq in #5972

Full Changelog: 2.13.0...2.13.1

Contributors

lhoestq and mariosasko

Assets 2

14 Jun 16:25

lhoestq

2.13.0

9aaee6f

2.13.0

Dataset Features

Add IterableDataset.from_spark by @maddiedawson in #5770

Stream the data from your Spark DataFrame directly to your training pipeline

from datasets import IterableDataset
from torch.utils.data import DataLoader

ids = IterableDataset.from_spark(df)
ids = ids.map(...).filter(...).with_format("torch")
for batch in DataLoader(ids, batch_size=16, num_workers=4):
    ...

IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:

IterableDataset Arrow formatting by @lhoestq in #5821
Iterable torch formatting by @lhoestq in #5852

from datasets import load_dataset

ids = load_dataset("c4", "en", split="train", streaming=True)
ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.

Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in #5893

from datasets import IterableDataset

ids = IterableDataset.from_file("path/to/data.arrow")

Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in #5944

from datasets import load_dataset

ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})

Experimental

Add parallel module using joblib for Spark by @es94129 in #5924

General improvements and bug fixes

Preserve stopping_strategy of shuffled interleaved dataset (random cycling case) by @mariosasko in #5816
Fix incomplete docstring for BuilderConfig by @Laurent2916 in #5824
[docs] Custom decoding transforms by @stevhliu in #5836
Add accelerate as metric's test dependency to fix CI error by @mariosasko in #5848
Add date_format param to the CSV reader by @mariosasko in #5845
[docs] Redirects, migrated from nginx by @julien-c in #5853
Fix infer module for uppercase extensions by @albertvillanova in #5872
Minor tqdm optim by @lhoestq in #5860
Always set nullable fields in the writer by @lhoestq in #5835
Add fn_kwargs to map and filter of IterableDataset and IterableDatasetDict by @yuukicammy in #5810
Better error message when combining dataset dicts instead of datasets by @lhoestq in #5861
Force overwrite existing filesystem protocol by @baskrahmer in #5894
Support working_dir in from_spark by @maddiedawson in #5826
Raise TypeError when indexing a dataset with bool by @albertvillanova in #5859
Fix minor typo in docs loading.mdx by @albertvillanova in #5900
Fix FixedSizeListArray casting by @mariosasko in #5897
Unpin responses by @mariosasko in #5916
Validate name parameter in make_file_instructions by @albertvillanova in #5904
Raise error in DatasetBuilder.as_dataset when file_format is not "arrow" by @mariosasko in #5915
Refactor extensions by @albertvillanova in #5917
Use more efficient and idiomatic way to construct list. by @ttsugriy in #5909
Add flatten_indices to DatasetDict by @maximxlss in #5907
Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in #5920
Make prepare_split more robust if errors in metadata dataset_info splits by @albertvillanova in #5901
Fix streaming parquet with image feature in schema by @lhoestq in #5921
canonicalize data dir in config ID hash by @kylrth in #5899
Fix link to quickstart docs in README.md by @mariosasko in #5928
Fix string-encoding, make batch_size optional, and minor improvements in Dataset.to_tf_dataset by @alvarobartt in #5883
Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in #5863
[doc build] Use secrets by @mishig25 in #5932
Fix to_numpy when None values in the sequence by @qgallouedec in #5933
Better row group size in push_to_hub by @lhoestq in #5935
Avoid parallel redownload in cache by @albertvillanova in #5937
Better filenotfound for gated by @lhoestq in #5954
Make get_from_cache use custom temp filename that is locked by @albertvillanova in #5938
Fix ArrowExamplesIterable.shard_data_sources by @lhoestq in #5956
Add Arrow builder docs by @lhoestq in #5952
Fix sequence of array support for most dtype by @qgallouedec in #5948

New Contributors

@Laurent2916 made their first contribution in #5824
@yuukicammy made their first contribution in #5810
@baskrahmer made their first contribution in #5894
@ttsugriy made their first contribution in #5909
@maximxlss made their first contribution in #5907
@mariusz-jachimowicz-83 made their first contribution in #5893
@kylrth made their first contribution in #5899
@qgallouedec made their first contribution in #5933
@es94129 made their first contribution in #5924

Full Changelog: 2.12.0...zef

Contributors

ttsugriy, julien-c, and 16 other contributors

Assets 2

28 Apr 10:09

lhoestq

2.12.0

8e1af7b

2.12.0

Datasets Features

Add Dataset.from_spark by @maddiedawson in #5701
- Get a Dataset from a Spark DataFrame (docs):
```
>>> from datasets import Dataset
>>> ds = Dataset.from_spark(df)
```

Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in #5689

Stream data from Wikipedia:

>>> from datasets import load_dataset
>>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
>>> next(iter(ds["train"]))
{'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}

Implement sharding on merged iterable datasets by @Hubert-Bonisseur in #5735

Use interleaved datasets in a distributed setup or with a DataLoader

>>> from datasets import load_dataset, interleave_datasets
>>> from torch.utils.data import DataLoader
>>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
>>> c4 = load_dataset("c4", "en", split="train", streaming=True)
>>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
>>> dataloader = DataLoader(merged, num_workers=4)

Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in #5751
- Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
- Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
- Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

Fix a description error for interleave_datasets. by @QizhiPei in #5680
[docs] Split pattern search order by @stevhliu in #5693
Raise an error on missing distributed seed by @lhoestq in #5697
Fix xnumpy_load for .npz files by @albertvillanova in #5714
Temporarily pin fsspec by @albertvillanova in #5731
Unpin fsspec by @albertvillanova in #5733
Fix CI warnings by @albertvillanova in #5741
Fix CI mock filesystem fixtures by @albertvillanova in #5740
Fix link in docs by @bbbxyz in #5746
fix typo: "mow" -> "now" by @csris in #5763
[docs] Compress data files by @stevhliu in #5691
Fix style by @lhoestq in #5774
Minor tqdm fixes by @mariosasko in #5754
Fixes #5757 by @eli-osherovich in #5758
Fix JSON builder when missing keys in first row by @albertvillanova in #5772
Warning specifying future change in to_tf_dataset behaviour by @amyeroberts in #5742
Prepare tests for hfh 0.14 by @Wauplin in #5788
Call fs.makedirs in save_to_disk by @lhoestq in #5779
Allow to run CI on push to ci-branch by @albertvillanova in #5790
Fix nondeterministic sharded data split order by @albertvillanova in #5729
Raise subprocesses traceback when interrupting by @lhoestq in #5784
Fix spark imports by @lhoestq in #5795
Change downloaded file permission based on umask by @albertvillanova in #5800
Fix inferring module for unsupported data files by @albertvillanova in #5787
Reorder default data splits to have validation before test by @albertvillanova in #5718
Validate non-empty data_files by @albertvillanova in #5802
Spark docs by @lhoestq in #5796
Release: 2.12.0 by @lhoestq in #5803

New Contributors

@QizhiPei made their first contribution in #5680
@bbbxyz made their first contribution in #5746
@csris made their first contribution in #5763
@eli-osherovich made their first contribution in #5758
@maddiedawson made their first contribution in #5701

Full Changelog: 2.11.0...2.12.0

Contributors

csris, eli-osherovich, and 10 other contributors

Assets 2

29 Mar 18:23

lhoestq

2.11.0

3b16e08

2.11.0

Important

Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in #5573
- this allows to not have dependencies on pytorch to decode audio files
- this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
Deprecated batch_size on Dataset.to_dict()

Datasets Features

Add writer_batch_size for ArrowBasedBuilder by @lhoestq in #5565
- allow to specofy the row group / record batch size when you download_and_prepare() a dataset
Experimental support of cloud storage in load_dataset():
- Support cloud storage in load_dataset via fsspec by @dwyatte in #5580
- Pass down storage options by @dwyatte in #5673
Support PyArrow arrays as column values in from_dict by @mariosasko in #5643
Allow direct cast from binary to Audio/Image by @mariosasko in #5644
Add column_names to IterableDataset by @patrickloeber in #5582
pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in #5569
add Dataset.to_list by @kyoto7250 in #5611

General imrovements and bug fixes

Update csv.py by @xdoubleu in #5562
Remove instructions for ffmpeg system package installation on Colab by @polinaeterna in #5558
Apply ruff flake8-comprehension checks by @Skylion007 in #5549
Fix datasets.load_from_disk, DatasetDict.load_from_disk and Dataset.load_from_disk by @alvarobartt in #5529
Add pre-commit config yaml file to enable automatic code formatting by @polinaeterna in #5561
Add huggingface_hub version to env cli command by @mariosasko in #5578
Do no write index by default when exporting a dataset by @mariosasko in #5583
Flatten dataset on the fly in save_to_disk by @mariosasko in #5588
Fix sort with indices mapping by @mariosasko in #5587
Fix docstring example by @stevhliu in #5592
Fix push_to_hub with no dataset_infos by @lhoestq in #5598
Don't compute checksums if not necessary in datasets-cli test by @lhoestq in #5603
Update README logo by @gary149 in #5605
Fix CI by temporarily pinning fsspec < 2023.3.0 by @albertvillanova in #5617
Fix archive fs test by @lhoestq in #5614
unpin fsspec by @lhoestq in #5619
Bump pyarrow to 8.0.0 by @lhoestq in #5620
Remove set_access_token usage + fail tests if FutureWarning by @Wauplin in #5623
Fix outdated verification_mode values by @polinaeterna in #5607
Adding Oracle Cloud to docs by @ahosler in #5621
Fix CI: ignore C901 ("some_func" is to complex) in ruff by @polinaeterna in #5636
add kwargs to index search by @SaulLu in #5628
Less zip false positives by @lhoestq in #5640
Allow self as key in Features by @mariosasko in #5646
Bump hfh to 0.11.0 by @lhoestq in #5642
Support streaming datasets with numpy.load by @albertvillanova in #5626
Fix unnecessary dict comprehension by @albertvillanova in #5662
Fix CI by temporarily pinning tensorflow < 2.12.0 by @albertvillanova in #5664
Copy features by @lhoestq in #5652
Improve features decoding in to_iterable_dataset by @lhoestq in #5655
Fix fsspec.open when using an HTTP proxy by @bryant1410 in #5656
Jax requires jaxlib by @lhoestq in #5667
docs: Update num_shards docs to mention num_proc on Dataset and DatasetDict by @connor-henderson in #5658
Allow loading/saving of FAISS index using fsspec by @Dref360 in #5526
Fix verification_mode when ignore_verifications is passed by @albertvillanova in #5683
Release: 2.11.0 by @lhoestq in #5684

New Contributors

@xdoubleu made their first contribution in #5562
@Skylion007 made their first contribution in #5549
@Hubert-Bonisseur made their first contribution in #5569
@ahosler made their first contribution in #5621
@patrickloeber made their first contribution in #5582
@SaulLu made their first contribution in #5628
@connor-henderson made their first contribution in #5658
@kyoto7250 made their first contribution in #5611

Full Changelog: 2.10.0...2.11.0

Contributors

Skylion007, dwyatte, and 17 other contributors

Assets 2

28 Feb 18:08

lhoestq

2.10.1

2843fce

2.10.1

What's Changed

Fix sort with indices mapping by @mariosasko #5587
- Fix IndexError when doing ds.filter(...).sort(...) or ds.select(...).sort(...)

Full Changelog: 2.10.0...2.10.1

Contributors

mariosasko

Assets 2

22 Feb 12:58

lhoestq

2.10.0

cac733f

2.10.0

Important

Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
- Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
Skip dataset verifications by default by @mariosasko in #5303
- introduces multiple verification_mode you can pass to `load_dataset()):
- the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

Single TQDM bar in multi-proc map by @mariosasko in #5455
- No more stacked TQDM bars when calling .map() in multiprocessing
Map-style Dataset to IterableDataset by @lhoestq in #5410
- introduces .to_iterable_dataset() to get a IterableDataset from a Dataset
- see all the advantages of IterableDataset in the documentation about the differences between Dataset and IterableDataset
Select columns of Dataset or DatasetDict by @daskol in #5480
- introduces .select_column() to return a dataset only containing the requested columns
Added functionality: sort datasets by multiple keys by @MichlF in #5502
- introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
Add JAX device selection when formatting by @alvarobartt in #5547
- introduces ds = ds.with_format("jax", device=device)
Reload features from Parquet metadata by @MFreidank in #5516
Speed up batched PyTorch DataLoader by @lhoestq in #5512

Documentation

Add section in tutorial for IterableDataset by @stevhliu in #5485
- https://huggingface.co/docs/datasets/main/en/access#iterabledataset
Tutorial for creating a dataset by @stevhliu in #5540
- https://huggingface.co/docs/datasets/main/en/create_dataset
Add JAX-formatting documentation by @alvarobartt in #5535
- https://huggingface.co/docs/datasets/main/en/use_with_jax

General improvements and bug fixes

Pin sqlalchemy by @lhoestq in #5476
Update dataset card creation by @stevhliu in #5470
Add num_test_batches option by @amyeroberts in #5471
Tip for recomputing metadata by @stevhliu in #5478
Disable aiohttp requoting of redirection URL by @albertvillanova in #5459
[MINOR] Typo by @cakiki in #5491
Pin dill lower version by @albertvillanova in #5489
Improved error message for gated/private repos by @osanseviero in #5497
Update docs for nyu_depth_v2 dataset by @awsaf49 in #5484
don't zero copy timestamps by @dwyatte in #5504
Remove unused load_from_cache_file arg from Dataset.shard() docstring by @polinaeterna in #5493
Do not add index column by default when exporting to CSV by @albertvillanova in #5490
Fix bug when casting empty array to class labels by @marioga in #5521
Fix benchmarks CI - pin protobuf by @lhoestq in #5527
Remove py.typed by @mariosasko in #5518
Add missing license in NumpyFormatter by @alvarobartt in #5530
Unify load_from_cache_file type and logic by @HallerPatrick in #5515
Format code with ruff by @mariosasko in #5519
Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in #5522
Resolve four broken refs in the docs by @tomaarsen in #5550
Use default audio resampling type by @lhoestq in #5556
- resampy is no longer needed to resample audio data
improved message error row formatting by @Plutone11011 in #5553
Make tiktoken tokenizers hashable by @mariosasko in #5552
Suggest scikit-learn instead of sklearn by @osbm in #5551
Add filter desc by @lhoestq in #5557
Fix map suffix_template by @lhoestq in #5559
Ensure last tqdm update in map by @mariosasko in #5560

New Contributors

@amyeroberts made their first contribution in #5471
@awsaf49 made their first contribution in #5484
@dwyatte made their first contribution in #5504
@marioga made their first contribution in #5521
@MFreidank made their first contribution in #5516
@daskol made their first contribution in #5480
@Plutone11011 made their first contribution in #5553
@osbm made their first contribution in #5551
@MichlF made their first contribution in #5502

Full Changelog: 2.9.0...ef

Contributors

dwyatte, cakiki, and 17 other contributors

Assets 2

Releases: huggingface/datasets

2.14.3

Bug fixes

Contributors

Uh oh!

2.14.2

Bug fixes

Contributors

Uh oh!

2.14.1

Bug fixes

Other improvements

Contributors

Uh oh!

2.14.0

Important: caching

Dataset Configuration

Dataset Features

What's Changed

New Contributors

Contributors

Uh oh!

2.13.1

General improvements and bug fixes

Contributors

Uh oh!

2.13.0

Dataset Features

Experimental

General improvements and bug fixes

New Contributors

Contributors

Uh oh!

2.12.0

Datasets Features

General improvements and bug fixes

New Contributors

Contributors

Uh oh!

2.11.0

Important

Datasets Features

General imrovements and bug fixes

New Contributors

Contributors

Uh oh!

2.10.1

What's Changed

Contributors

Uh oh!

2.10.0

Important

Datasets features

Documentation

General improvements and bug fixes

New Contributors

Contributors

Uh oh!