Support cloud storage in load_dataset via fsspec #5580

dwyatte · 2023-02-27T04:06:05Z

This PR uses fsspec to support datasets on cloud storage (tested manually with GCS). ETags are currently unsupported for cloud storage. In general, a much larger refactor could be done to just use fsspec for all schemes (ftp, http/s, s3, gcs) to unify the interfaces here, but I ultimately opted to leave that out of this PR

I didn't create a GCS filesystem class in datasets.filesystems since the S3 one appears to be a wrapper around s3fs.S3FileSystem and mainly used to generate docs.

HuggingFaceDocBuilderDev · 2023-02-27T04:10:25Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Nice thank you !

I added a few comments.

Regarding the tests I think it should be possible to use the mockfs fixture, it allows to play with a dummy fsspec FileSystem with the "mock://" protocol.

However it requires some storage_options to be passed. Maybe it can be added to DownloadConfig which is passed to cached_path, so that fsspec_get and fsspec_head can use the user's storage_options ?

setup.py

lhoestq · 2023-02-28T10:53:31Z

src/datasets/utils/file_utils.py

+def fsspec_get(url, temp_file, timeout=10.0):
+    _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
+    try:
+        fsspec.filesystem(urlparse(url).scheme).get(url, temp_file, timeout=timeout)


Would be cool to have a tqdm bar as in http_get

Also you may need to use fsspec.get_fs_token_paths first to instantiate the filesystem in case some filesystem kwargs can be parsed from the URL (it concerns all the filesystems that implement _get_kwargs_from_urls including gcsfs)

lhoestq · 2023-02-28T10:54:16Z

src/datasets/utils/file_utils.py

+def fsspec_head(url, timeout=10.0):
+    _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
+    try:
+        fsspec.filesystem(urlparse(url).scheme).info(url, timeout=timeout)


What about using a hash of the file info as a pseudo ETag ? We can use it as a normal ETag to invalidate the cache if the remote file changed

same here, you may need to use fsspec.get_fs_token_paths

What about using a hash of the file info as a pseudo ETag ? We can use it as a normal ETag to invalidate the cache if the remote file changed

Interesting idea. This actually returns quite a bit of info including an ETag on GCS, so as long as it's deterministic, I think we could. In the worst case if a response has a uuid or similar, we always invalidate the cache, but maybe that's the safer thing to do

src/datasets/utils/file_utils.py

setup.py

dwyatte · 2023-03-02T04:08:35Z

Regarding the tests I think it should be possible to use the mockfs fixture, it allows to play with a dummy fsspec FileSystem with the "mock://" protocol.

However it requires some storage_options to be passed. Maybe it can be added to DownloadConfig which is passed to cached_path, so that fsspec_get and fsspec_head can use the user's storage_options ?

@lhoestq I went ahead and tested this with a patch so that I could assign the mockfs as a return value. Let me know if I'm missing something though and we need to pass storage_options down

lhoestq

Instead of patching think it would be better to have a new filesystem TmpDirFileSystem (tmpfs) that doesn't need storage_options for the tests, and that is based on a temporary directory created just for the fixture. Maybe something like this ?

class TmpDirFileSystem(MockFileSystem):
    protocol = "tmp"
    tmp_dir = None

    def __init__(self):
        assert self.tmp_dir is not None, "TmpDirFileSystem.tmp_dir is not set"
        super().__init__(local_root_dir=self.tmp_dir, auto_mkdir=True)


@pytest.fixture
def mock_fsspec():
    original_registry = fsspec.registry.copy()
    fsspec.register_implementation("mock", MockFileSystem)
    fsspec.register_implementation("tmp", TmpDirFileSystem)
    yield
    fsspec.registry = original_registry


@pytest.fixture
def tmpfs(tmp_path_factory, mock_fsspec):
    tmp_fs_dir = tmp_path_factory.mktemp("tmpfs")
    with patch.object(TmpDirFileSystem, "tmp_dir", tmp_fs_dir):
        yield TmpDirFileSystem()

tests/test_file_utils.py

lhoestq · 2023-03-02T13:45:50Z

tests/test_file_utils.py

+def mockfs_file(mockfs):
+    with open(os.path.join(mockfs.local_root_dir, FILE_PATH), "w") as f:
+        f.write(FILE_CONTENT)
+    return mockfs


Since the fixture is named mockfs_file I'd expect it to return the file path inside the mock filesystem ?

Suggested change

return mockfs

return FILE_PATH

In this case, we need to return the fs itself (which has been seeded with the file) to patch the fs infile_utils.fsspec_get so we can test get_from_cache

Maybe mockfs_with_file is a better fixture name, but let me also explore the tmpfs solution above too

The tmpfs solution feels pretty clean, thanks for the recommendation!

lhoestq · 2023-03-02T13:52:32Z

tests/fixtures/fsspec.py

+    def get_file(self, rpath, lpath, *args, **kwargs):
+        rpath = posixpath.join(self.local_root_dir, self._strip_protocol(rpath))
+        return self._fs.get_file(rpath, lpath, *args, **kwargs)


Why is it needed ? IIRC it's already implemented as part of the AbstractFileSystem and uses self.open() under the hood

Ah, you're right. Will remove in next commit

dwyatte · 2023-03-05T17:16:15Z

Instead of patching think it would be better to have a new filesystem TmpDirFileSystem (tmpfs) that doesn't need storage_options for the tests, and that is based on a temporary directory created just for the fixture. Maybe something like this ?

Thanks for the recommendation, this works great.

lhoestq · 2023-03-10T15:34:37Z

Feel free to merge main into your PR to fix the CI :)

lhoestq

Awesome thanks ! I added a few suggestions and we can merge

src/datasets/utils/file_utils.py

…tes pseudo etag from head response

Co-authored-by: Alvaro Bartolome <[email protected]>

Co-authored-by: Quentin Lhoest <[email protected]>

dwyatte · 2023-03-10T21:39:34Z

Should be good to go. Thanks!

lhoestq

thanks !

github-actions · 2023-03-11T01:02:49Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006183 / 0.011353 (-0.005170)	0.004180 / 0.011008 (-0.006829)	0.095965 / 0.038508 (0.057457)	0.026754 / 0.023109 (0.003645)	0.339724 / 0.275898 (0.063826)	0.381628 / 0.323480 (0.058149)	0.004615 / 0.007986 (-0.003371)	0.004469 / 0.004328 (0.000140)	0.074035 / 0.004250 (0.069784)	0.035089 / 0.037052 (-0.001963)	0.352253 / 0.258489 (0.093764)	0.389598 / 0.293841 (0.095757)	0.032262 / 0.128546 (-0.096285)	0.011392 / 0.075646 (-0.064254)	0.323884 / 0.419271 (-0.095388)	0.042658 / 0.043533 (-0.000874)	0.331533 / 0.255139 (0.076394)	0.364723 / 0.283200 (0.081523)	0.086349 / 0.141683 (-0.055334)	1.465687 / 1.452155 (0.013533)	1.559782 / 1.492716 (0.067066)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.198562 / 0.018006 (0.180556)	0.457170 / 0.000490 (0.456680)	0.000409 / 0.000200 (0.000209)	0.000061 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022439 / 0.037411 (-0.014973)	0.096551 / 0.014526 (0.082025)	0.102230 / 0.176557 (-0.074326)	0.160878 / 0.737135 (-0.576257)	0.109348 / 0.296338 (-0.186990)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.456635 / 0.215209 (0.241426)	4.563571 / 2.077655 (2.485916)	2.313048 / 1.504120 (0.808928)	2.117433 / 1.541195 (0.576239)	2.127478 / 1.468490 (0.658988)	0.699478 / 4.584777 (-3.885299)	3.358955 / 3.745712 (-0.386757)	1.821437 / 5.269862 (-3.448424)	1.158239 / 4.565676 (-3.407438)	0.083207 / 0.424275 (-0.341068)	0.012925 / 0.007607 (0.005318)	0.556526 / 0.226044 (0.330482)	5.552364 / 2.268929 (3.283435)	2.744696 / 55.444624 (-52.699928)	2.374455 / 6.876477 (-4.502022)	2.442021 / 2.142072 (0.299949)	0.809393 / 4.805227 (-3.995834)	0.152305 / 6.500664 (-6.348359)	0.066164 / 0.075469 (-0.009305)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.258268 / 1.841788 (-0.583520)	13.402391 / 8.074308 (5.328083)	13.816927 / 10.191392 (3.625535)	0.148466 / 0.680424 (-0.531958)	0.016487 / 0.534201 (-0.517714)	0.385888 / 0.579283 (-0.193395)	0.378840 / 0.434364 (-0.055524)	0.444527 / 0.540337 (-0.095810)	0.531011 / 1.386936 (-0.855925)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006230 / 0.011353 (-0.005123)	0.004488 / 0.011008 (-0.006520)	0.077539 / 0.038508 (0.039031)	0.026611 / 0.023109 (0.003502)	0.342093 / 0.275898 (0.066195)	0.371555 / 0.323480 (0.048075)	0.004665 / 0.007986 (-0.003321)	0.003289 / 0.004328 (-0.001039)	0.078378 / 0.004250 (0.074128)	0.035223 / 0.037052 (-0.001829)	0.339972 / 0.258489 (0.081483)	0.378755 / 0.293841 (0.084914)	0.031331 / 0.128546 (-0.097215)	0.011406 / 0.075646 (-0.064241)	0.086891 / 0.419271 (-0.332381)	0.047713 / 0.043533 (0.004180)	0.342678 / 0.255139 (0.087539)	0.364536 / 0.283200 (0.081337)	0.092132 / 0.141683 (-0.049551)	1.537050 / 1.452155 (0.084895)	1.639927 / 1.492716 (0.147211)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.219933 / 0.018006 (0.201927)	0.391627 / 0.000490 (0.391137)	0.002238 / 0.000200 (0.002038)	0.000072 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024890 / 0.037411 (-0.012521)	0.098989 / 0.014526 (0.084464)	0.104505 / 0.176557 (-0.072052)	0.156252 / 0.737135 (-0.580884)	0.108027 / 0.296338 (-0.188312)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.443957 / 0.215209 (0.228748)	4.450850 / 2.077655 (2.373196)	2.076043 / 1.504120 (0.571923)	1.866396 / 1.541195 (0.325202)	1.902692 / 1.468490 (0.434202)	0.703160 / 4.584777 (-3.881617)	3.373761 / 3.745712 (-0.371951)	2.615649 / 5.269862 (-2.654213)	1.340612 / 4.565676 (-3.225065)	0.083836 / 0.424275 (-0.340439)	0.012619 / 0.007607 (0.005012)	0.553410 / 0.226044 (0.327365)	5.526500 / 2.268929 (3.257571)	2.513213 / 55.444624 (-52.931411)	2.152701 / 6.876477 (-4.723776)	2.165092 / 2.142072 (0.023019)	0.818381 / 4.805227 (-3.986846)	0.152118 / 6.500664 (-6.348546)	0.066950 / 0.075469 (-0.008519)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.291468 / 1.841788 (-0.550320)	13.694828 / 8.074308 (5.620520)	13.821019 / 10.191392 (3.629627)	0.126077 / 0.680424 (-0.554347)	0.016543 / 0.534201 (-0.517658)	0.381399 / 0.579283 (-0.197884)	0.377326 / 0.434364 (-0.057038)	0.439275 / 0.540337 (-0.101063)	0.524021 / 1.386936 (-0.862915)

Xe · 2024-11-14T17:02:46Z

@dwyatte

(tested manually with GCS)

Can you please paste the code you used to test this with? It's not clear how one would go about actually using this to access datasets in Google Cloud Storage or S3.

dwyatte · 2024-11-27T01:25:38Z

@Xe

With GCS, this can be completely seamless if you have an activated set of credentials with access to the file (which is how I use the functionality). You should be able to pass storage_options with credentials too

dwyatte-mac:tmp dwyatte$ gcloud auth list
     Credentialed Accounts
ACTIVE  ACCOUNT
*       [YOUR_EMAIL]

In [1]: from datasets import load_dataset
   ...: 
   ...: dataset = load_dataset("parquet", data_files="gs://path/to/file.pq")

Downloading data: 100%|████████████████████████████████████████| 26.5M/26.5M [00:01<00:00, 20.7MB/s]
Generating train split: 2000 examples [00:00, 26434.38 examples/s]

dwyatte mentioned this pull request Feb 27, 2023

Support cloud storage in load_dataset #5281

Open

dwyatte force-pushed the fsspec branch 3 times, most recently from b7f309e to 2fceb77 Compare February 27, 2023 14:38

lhoestq reviewed Feb 28, 2023

View reviewed changes

dwyatte force-pushed the fsspec branch from b0d527a to 17d88d1 Compare February 28, 2023 15:50

alvarobartt reviewed Feb 28, 2023

View reviewed changes

setup.py Outdated Show resolved Hide resolved

dwyatte force-pushed the fsspec branch 2 times, most recently from 0d041d4 to 50f2f64 Compare March 2, 2023 04:01

dwyatte requested a review from lhoestq March 2, 2023 04:11

lhoestq reviewed Mar 2, 2023

View reviewed changes

dwyatte force-pushed the fsspec branch from 141b5db to 92d81fa Compare March 5, 2023 17:17

dwyatte requested a review from lhoestq March 5, 2023 17:18

dwyatte force-pushed the fsspec branch from 92d81fa to f618135 Compare March 5, 2023 17:20

lhoestq reviewed Mar 10, 2023

View reviewed changes

src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved

src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved

dwyatte and others added 9 commits March 10, 2023 14:18

support cloud storage in load_dataset via fsspec

118642d

fsspec get uses tqdm, tries to handle additional protocols, and compu…

2869af6

…tes pseudo etag from head response

Update setup.py

a9e058e

add test

c5d74b1

Update setup.py

9eae6d4

Co-authored-by: Alvaro Bartolome <[email protected]>

Update tests/test_file_utils.py

c37215a

Co-authored-by: Quentin Lhoest <[email protected]>

add tmpfs and use to test fsspec in get_from_cache

341bda5

Update src/datasets/utils/file_utils.py

74c0c45

Co-authored-by: Quentin Lhoest <[email protected]>

Update src/datasets/utils/file_utils.py

b2c958a

Co-authored-by: Quentin Lhoest <[email protected]>

dwyatte force-pushed the fsspec branch from 7d491e1 to b2c958a Compare March 10, 2023 21:18

remove comment

b9b1075

lhoestq approved these changes Mar 11, 2023

View reviewed changes

lhoestq merged commit 3e62699 into huggingface:main Mar 11, 2023

Nilabhra mentioned this pull request May 11, 2023

Streaming support for load_from_disk #5838

Closed

janineguo mentioned this pull request Jun 3, 2023

add support for storage_options for load_dataset API #5919

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cloud storage in load_dataset via fsspec #5580

Support cloud storage in load_dataset via fsspec #5580

dwyatte commented Feb 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 27, 2023 •

edited

Loading

lhoestq left a comment

lhoestq Feb 28, 2023

lhoestq Feb 28, 2023

lhoestq Feb 28, 2023

lhoestq Feb 28, 2023

dwyatte Feb 28, 2023 •

edited

Loading

dwyatte commented Mar 2, 2023

lhoestq left a comment

lhoestq Mar 2, 2023

dwyatte Mar 2, 2023 •

edited

Loading

dwyatte Mar 5, 2023

lhoestq Mar 2, 2023

dwyatte Mar 2, 2023

dwyatte commented Mar 5, 2023

lhoestq commented Mar 10, 2023 •

edited

Loading

lhoestq left a comment

dwyatte commented Mar 10, 2023

lhoestq left a comment

github-actions bot commented Mar 11, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Xe commented Nov 14, 2024

dwyatte commented Nov 27, 2024

Support cloud storage in load_dataset via fsspec #5580

Support cloud storage in load_dataset via fsspec #5580

Conversation

dwyatte commented Feb 27, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Feb 27, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwyatte Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

dwyatte commented Mar 2, 2023

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwyatte Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dwyatte commented Mar 5, 2023

lhoestq commented Mar 10, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

dwyatte commented Mar 10, 2023

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 11, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Xe commented Nov 14, 2024

dwyatte commented Nov 27, 2024

dwyatte commented Feb 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 27, 2023 •

edited

Loading

dwyatte Feb 28, 2023 •

edited

Loading

dwyatte Mar 2, 2023 •

edited

Loading

lhoestq commented Mar 10, 2023 •

edited

Loading