-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for storage_options for load_dataset API #5919
Conversation
hi @lhoestq,
|
I just re-ran the CI, hopefully it's fixed |
The documentation is not available anymore as the PR was closed or merged. |
I just checked, still has the same error, maybe need someone to fix it |
I think the issue comes from this PR somehow, since the CI fail is related to loading private repositories and this PR touches authentication related code. Let me check what's the issue, and I'll also review your PR later (sorry I don't have a ton of bandwidth atm) |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@lhoestq Hi sorry to bother you, the CI check_code_quality failed and it said |
I just ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just deprecated use_auth_token
in favor of token
in DownloadConfig
, so I added suggestions to fix this as well as some comments
src/datasets/arrow_dataset.py
Outdated
@@ -5316,7 +5316,12 @@ def path_in_repo(_index, shard): | |||
for data_file in data_files | |||
if data_file.startswith(f"data/{split}-") and data_file not in shards_path_in_repo | |||
] | |||
deleted_size = sum(xgetsize(hf_hub_url(repo_id, data_file), token=token) for data_file in data_files_to_delete) | |||
|
|||
download_config = DownloadConfig(use_auth_token=token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_config = DownloadConfig(use_auth_token=token) | |
download_config = DownloadConfig(token=token) |
def _prepare_http_url_kwargs(url: str, token: Optional[Union[str, bool]] = None) -> Tuple[str, dict]: | ||
def _validate_servers(urlpath: str): | ||
server = urlpath.split("://")[0] | ||
return server in SUPPORTED_REMOTE_SERVER_TYPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to support any fsspec protocol here instead of hardcoding a list
if not rest_hops and _validate_servers(main_hop): | ||
if not implemented: | ||
raise NotImplementedError("Currently not extended to support URLs in streaming mode") | ||
main_hop, http_kwargs = _prepare_http_url_kwargs(main_hop, download_config=download_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_prepare_http_url_kwargs should only be called on https URLs no ?
And for any other fsspec URI you can return the path unchanged and the storage_options from download_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, _prepare_http_url_kwargs should only be called for URLs where we need to set the kwargs.
that really makes the code clearer and more elegant.
src/datasets/features/audio.py
Outdated
|
||
with xopen(path, "rb", token=token) as f: | ||
download_config = DownloadConfig(use_auth_token=use_auth_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_config = DownloadConfig(use_auth_token=use_auth_token) | |
download_config = DownloadConfig(token=token) |
token = token_per_repo_id[repo_id] | ||
token_per_repo_id[repo_id] | ||
except (ValueError, KeyError): | ||
token = None | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should revert this no ? we need to get the token from token_per_repo_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need, I can change the setting of DownloadConfig.
assert xisdir("zip://main_dir::" + root_url, token=hf_token) is True | ||
assert xisdir("zip://qwertyuiop::" + root_url, token=hf_token) is False | ||
|
||
download_config = DownloadConfig(use_auth_token=hf_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_config = DownloadConfig(use_auth_token=hf_token) | |
download_config = DownloadConfig(token=hf_token) |
@@ -374,8 +379,9 @@ def test_xisfile(input_path, isfile, tmp_path, mock_fsspec): | |||
@pytest.mark.integration | |||
def test_xisfile_private(hf_private_dataset_repo_txt_data, hf_token): | |||
root_url = hf_hub_url(hf_private_dataset_repo_txt_data, "") | |||
assert xisfile(root_url + "data/text_data.txt", token=hf_token) is True | |||
assert xisfile(root_url + "qwertyuiop", token=hf_token) is False | |||
download_config = DownloadConfig(use_auth_token=hf_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_config = DownloadConfig(use_auth_token=hf_token) | |
download_config = DownloadConfig(token=hf_token) |
@@ -397,9 +403,10 @@ def test_xgetsize(input_path, size, tmp_path, mock_fsspec): | |||
@pytest.mark.integration | |||
def test_xgetsize_private(hf_private_dataset_repo_txt_data, hf_token): | |||
root_url = hf_hub_url(hf_private_dataset_repo_txt_data, "") | |||
assert xgetsize(root_url + "data/text_data.txt", token=hf_token) == 39 | |||
download_config = DownloadConfig(use_auth_token=hf_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_config = DownloadConfig(use_auth_token=hf_token) | |
download_config = DownloadConfig(token=hf_token) |
@@ -440,8 +447,9 @@ def test_xglob(input_path, expected_paths, tmp_path, mock_fsspec): | |||
@pytest.mark.integration | |||
def test_xglob_private(hf_private_dataset_repo_zipped_txt_data, hf_token): | |||
root_url = hf_hub_url(hf_private_dataset_repo_zipped_txt_data, "data.zip") | |||
assert len(xglob("zip://**::" + root_url, token=hf_token)) == 3 | |||
assert len(xglob("zip://qwertyuiop/*::" + root_url, token=hf_token)) == 0 | |||
download_config = DownloadConfig(use_auth_token=hf_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_config = DownloadConfig(use_auth_token=hf_token) | |
download_config = DownloadConfig(token=hf_token) |
assert len(list(xwalk("zip://::" + root_url, token=hf_token))) == 2 | ||
assert len(list(xwalk("zip://main_dir::" + root_url, token=hf_token))) == 1 | ||
assert len(list(xwalk("zip://qwertyuiop::" + root_url, token=hf_token))) == 0 | ||
download_config = DownloadConfig(use_auth_token=hf_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_config = DownloadConfig(use_auth_token=hf_token) | |
download_config = DownloadConfig(token=hf_token) |
I am working on this issue right now #6017 which is strongly connected to your PR, and I might end up cherry-picking some of your commits (keeping attribution of course !). Would you be ok with that ? |
it's totally ok for me, I just wish the S3 File system could support streaming too. |
I already adjust the code and test on my local Mac, you can check it now, and you can make any changes to it. |
Closing this PR in favor of #6028 which includes your contribution :) |
to solve the issue in #5880
add s3 support in the link check step, previous we only check
http
andhttps
,change the parameter of
use_auth_token
todownload_config
to support bothstorage_options
anduse_auth_token
parameter when trying to handle(list, open, read, etc,.) the remote files.integrate the check part's duplicate code to make adding or deleting other sources easier.