Skip to content

Releases: huggingface/datasets

2.12.0

28 Apr 10:09
8e1af7b
Compare
Choose a tag to compare

Datasets Features

  • Add Dataset.from_spark by @maddiedawson in #5701

    • Get a Dataset from a Spark DataFrame (docs):
    >>> from datasets import Dataset
    >>> ds = Dataset.from_spark(df)
  • Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in #5689

    • Stream data from Wikipedia:
    >>> from datasets import load_dataset
    >>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
    >>> next(iter(ds["train"]))
    {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
  • Implement sharding on merged iterable datasets by @Hubert-Bonisseur in #5735

    • Use interleaved datasets in a distributed setup or with a DataLoader
    >>> from datasets import load_dataset, interleave_datasets
    >>> from torch.utils.data import DataLoader
    >>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
    >>> c4 = load_dataset("c4", "en", split="train", streaming=True)
    >>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
    >>> dataloader = DataLoader(merged, num_workers=4)
  • Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in #5751

    • Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
    • Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
    • Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

New Contributors

Full Changelog: 2.11.0...2.12.0

2.11.0

29 Mar 18:23
3b16e08
Compare
Choose a tag to compare

Important

  • Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in #5573
    • this allows to not have dependencies on pytorch to decode audio files
    • this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
  • Deprecated batch_size on Dataset.to_dict()

Datasets Features

  • Add writer_batch_size for ArrowBasedBuilder by @lhoestq in #5565
    • allow to specofy the row group / record batch size when you download_and_prepare() a dataset
  • Experimental support of cloud storage in load_dataset():
  • Support PyArrow arrays as column values in from_dict by @mariosasko in #5643
  • Allow direct cast from binary to Audio/Image by @mariosasko in #5644
  • Add column_names to IterableDataset by @patrickloeber in #5582
  • pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in #5569
  • add Dataset.to_list by @kyoto7250 in #5611

General imrovements and bug fixes

New Contributors

Full Changelog: 2.10.0...2.11.0

2.10.1

28 Feb 18:08
2843fce
Compare
Choose a tag to compare

What's Changed

  • Fix sort with indices mapping by @mariosasko #5587
    • Fix IndexError when doing ds.filter(...).sort(...) or ds.select(...).sort(...)

Full Changelog: 2.10.0...2.10.1

2.10.0

22 Feb 12:58
cac733f
Compare
Choose a tag to compare

Important

  • Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in #5542
    • Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
  • Skip dataset verifications by default by @mariosasko in #5303
    • introduces multiple verification_mode you can pass to `load_dataset()):
    • the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

  • Single TQDM bar in multi-proc map by @mariosasko in #5455
    • No more stacked TQDM bars when calling .map() in multiprocessing
  • Map-style Dataset to IterableDataset by @lhoestq in #5410
  • Select columns of Dataset or DatasetDict by @daskol in #5480
    • introduces .select_column() to return a dataset only containing the requested columns
  • Added functionality: sort datasets by multiple keys by @MichlF in #5502
    • introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
  • Add JAX device selection when formatting by @alvarobartt in #5547
    • introduces ds = ds.with_format("jax", device=device)
  • Reload features from Parquet metadata by @MFreidank in #5516
  • Speed up batched PyTorch DataLoader by @lhoestq in #5512

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: 2.9.0...ef

2.9.0

26 Jan 19:33
b5672a9
Compare
Choose a tag to compare

Datasets Features

  • Parallel implementation of to_tf_dataset() by @Rocketknight1 in #5377

    • Pass num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessing
  • Distributed support by @lhoestq in #5369

    • Split your dataset for each node for distributed training
    • It supports both Dataset and IterableDataset (e.g. in streaming mode)
    • See the documentation for more details
    import os
    from datasets.distributed import split_dataset_by_node
    
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
  • Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in #5400

  • Tqdm progress bar for to_parquet by @zanussbaum in #5456

  • ZIP files support in iter_archive with better compression type check by @Mehdi2402 in #3379

  • Support other formats than uint8 for image arrays by @vigsterkr in #5365

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: 2.8.0...2.9.0

2.8.0

19 Dec 10:55
037c9b5
Compare
Choose a tag to compare

Important

  • Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
    • From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
    • The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
    • Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

  • Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in #5287
    • Datasets in streaming mode now update their features after column renaming or removal
  • Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
    • Use multiprocessing to load multiple files in parallel
  • Add features param to IterableDataset.map by @alvarobartt in #5311
  • Sharded save_to_disk + multiprocessing by @lhoestq in #5268
    • Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
    • Pass num_proc to use multiprocessing.
  • Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
  • Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
    • You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
    from datasets import load_dataset
    ds = load_dataset("c4", "en", streaming=True, split="train")
    dataloader = DataLoader(ds, batch_size=32, num_workers=4)

Docs

General improvements and bug fixes

New Contributors

Full Changelog: 2.7.0...2.8.0

2.7.1

22 Nov 17:27
5ef1ab1
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.7.0...2.7.1

2.6.2

22 Nov 17:49
a6a5a1c
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.6.1...2.6.2

2.7.0

16 Nov 10:11
edf1902
Compare
Choose a tag to compare

Dataset Features

  • Multiprocessed dataset builder by @TevenLeScao in #5107
    • Load big datasets faster than before using multiprocessing:
    from datasets import load_dataset
    ds = load_dataset("imagenet-1k", num_proc=4)
  • Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
    • Function passed to map or filter that uses tensors or pipelines can now be cached
  • Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
  • TextConfig: added "errors" by @NightMachinery in #5155

Audio setup

Docs

General improvements and bug fixes

New Contributors

Full Changelog: 2.6.1...2.7.0

2.6.1

14 Oct 15:45
Compare
Choose a tag to compare

Bug fixes

  • Fix filter indices when batched by @albertvillanova in #5113
    • fixed a bug where filter could return examples with the wrong indices
  • Fix iter_batches by @lhoestq in #5115
    • fixed a bug where map with batch=True could return a dataset with less examples
  • Fix a typo in arrow_dataset.py by @yangky11 in #5108

New Contributors

Full Changelog: 2.6.0...2.6.1