-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass down storage options #5673
Conversation
76d9ad4
to
94bc9ef
Compare
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ! Just one comment:
download_and_prepare
is not called when streaming a dataset, so we may need to have storage_options
in the DatasetBuilder.__init__
? This way it could also be passed later to as_streaming_dataset
and the StreamingDownloadManager
Currently the storage_options
parameter in download_and_prepare
are for the target filesystem where the dataset must be downloaded and prepared as arrow files
Ah, I noted this when looking for ways to plumb down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing ! It works like a charm :)
Just before we merge I wanted to mention this in the docstring, let me know what you think:
Noting as experimental SGTM. The only tests I can think of to add at the moment would be mocks that assert the storage options get passed all the way down using |
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
I think adding tests with the mockfs fixture will do the job. Tests and docs can be added when request_etag and is_remote_url support fsspec (right now they would fail with mockfs). Let's see in a subsequent PR, this is exciting ! :) |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Remove implementation-specific kwargs from
file_utils.fsspec_get
andfile_utils.fsspec_head
, instead allowing them to be passed down viastorage_options
. This fixes an issue where s3fs did not recognize a timeout arg as well as fixes an issue mentioned in #5281 by allowing users to pass downstorage_options
all the way fromdatasets.load_dataset
to support implementation-specific credentialsSupports something like the following to provide credentials explicitly instead of relying on boto's methods of locating them