Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for storage_options for load_dataset API #5919

Closed
wants to merge 6 commits into from

Conversation

janineguo
Copy link
Contributor

@janineguo janineguo commented Jun 1, 2023

to solve the issue in #5880

  1. add s3 support in the link check step, previous we only check http and https,

  2. change the parameter of use_auth_token to download_config to support both storage_options and use_auth_token parameter when trying to handle(list, open, read, etc,.) the remote files.

  3. integrate the check part's duplicate code to make adding or deleting other sources easier.

@janineguo
Copy link
Contributor Author

janineguo commented Jun 10, 2023

hi @lhoestq,
I saw some errors in my test and found all the failed reasons are FileNotFoundError about test_load_streaming_private_dataset_with_zipped_data and test_load_dataset_private_zipped_images in test_load.py , I run pytest on my own Wins and Ubuntu system all the test in test_load.py are succeed. could you help me to check the test environment of our server?

2023-06-08T16:50:48.0828281Z FAILED tests/test_load.py::test_load_streaming_private_dataset_with_zipped_data - FileNotFoundError: Couldn't find a dataset script at D:\a\datasets\datasets\__DUMMY_TRANSFORMERS_USER__\repo_zipped_txt_data-16862429577813\repo_zipped_txt_data-16862429577813.py or any data file in the same directory. Couldn't find '__DUMMY_TRANSFORMERS_USER__/repo_zipped_txt_data-16862429577813' on the Hugging Face Hub either: FileNotFoundError: No (supported) data files or dataset script found in __DUMMY_TRANSFORMERS_USER__/repo_zipped_txt_data-16862429577813
2023-06-08T16:50:48.0830602Z FAILED tests/test_load.py::test_load_dataset_private_zipped_images[False-False] - FileNotFoundError: Couldn't find a dataset script at D:\a\datasets\datasets\__DUMMY_TRANSFORMERS_USER__\repo_zipped_img_data-16862429594168\repo_zipped_img_data-16862429594168.py or any data file in the same directory. Couldn't find '__DUMMY_TRANSFORMERS_USER__/repo_zipped_img_data-16862429594168' on the Hugging Face Hub either: FileNotFoundError: No (supported) data files or dataset script found in __DUMMY_TRANSFORMERS_USER__/repo_zipped_img_data-16862429594168

@janineguo janineguo closed this Jun 13, 2023
@janineguo janineguo reopened this Jun 13, 2023
@lhoestq
Copy link
Member

lhoestq commented Jun 13, 2023

I just re-ran the CI, hopefully it's fixed

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 13, 2023

The documentation is not available anymore as the PR was closed or merged.

@janineguo
Copy link
Contributor Author

I just re-ran the CI, hopefully it's fixed

I just checked, still has the same error, maybe need someone to fix it

@lhoestq
Copy link
Member

lhoestq commented Jun 13, 2023

I think the issue comes from this PR somehow, since the CI fail is related to loading private repositories and this PR touches authentication related code. Let me check what's the issue, and I'll also review your PR later (sorry I don't have a ton of bandwidth atm)

src/datasets/load.py Outdated Show resolved Hide resolved
@janineguo janineguo closed this Jul 10, 2023
@janineguo janineguo reopened this Jul 10, 2023
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@janineguo
Copy link
Contributor Author

janineguo commented Jul 11, 2023

@lhoestq Hi sorry to bother you, the CI check_code_quality failed and it said would reformat /home/runner/work/datasets/datasets/src/datasets/download/streaming_download_manager.py but I cant see any changes when I run python3 -m black --check tests src benchmarks metrics and python3 -m ruff tests src benchmarks metrics on my own computer, is there any version requirements on the tools? I didn't specific the version.

@lhoestq
Copy link
Member

lhoestq commented Jul 11, 2023

I just ran make style and pushed the changes.
You can install the right versions of black and ruff using pip install -e .[quality] ;)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just deprecated use_auth_token in favor of token in DownloadConfig, so I added suggestions to fix this as well as some comments

@@ -5316,7 +5316,12 @@ def path_in_repo(_index, shard):
for data_file in data_files
if data_file.startswith(f"data/{split}-") and data_file not in shards_path_in_repo
]
deleted_size = sum(xgetsize(hf_hub_url(repo_id, data_file), token=token) for data_file in data_files_to_delete)

download_config = DownloadConfig(use_auth_token=token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_config = DownloadConfig(use_auth_token=token)
download_config = DownloadConfig(token=token)

def _prepare_http_url_kwargs(url: str, token: Optional[Union[str, bool]] = None) -> Tuple[str, dict]:
def _validate_servers(urlpath: str):
server = urlpath.split("://")[0]
return server in SUPPORTED_REMOTE_SERVER_TYPE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to support any fsspec protocol here instead of hardcoding a list

if not rest_hops and _validate_servers(main_hop):
if not implemented:
raise NotImplementedError("Currently not extended to support URLs in streaming mode")
main_hop, http_kwargs = _prepare_http_url_kwargs(main_hop, download_config=download_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_prepare_http_url_kwargs should only be called on https URLs no ?

And for any other fsspec URI you can return the path unchanged and the storage_options from download_config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, _prepare_http_url_kwargs should only be called for URLs where we need to set the kwargs.
that really makes the code clearer and more elegant.


with xopen(path, "rb", token=token) as f:
download_config = DownloadConfig(use_auth_token=use_auth_token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_config = DownloadConfig(use_auth_token=use_auth_token)
download_config = DownloadConfig(token=token)

Comment on lines -177 to +180
token = token_per_repo_id[repo_id]
token_per_repo_id[repo_id]
except (ValueError, KeyError):
token = None
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should revert this no ? we need to get the token from token_per_repo_id

Copy link
Contributor Author

@janineguo janineguo Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need, I can change the setting of DownloadConfig.

assert xisdir("zip://main_dir::" + root_url, token=hf_token) is True
assert xisdir("zip://qwertyuiop::" + root_url, token=hf_token) is False

download_config = DownloadConfig(use_auth_token=hf_token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_config = DownloadConfig(use_auth_token=hf_token)
download_config = DownloadConfig(token=hf_token)

@@ -374,8 +379,9 @@ def test_xisfile(input_path, isfile, tmp_path, mock_fsspec):
@pytest.mark.integration
def test_xisfile_private(hf_private_dataset_repo_txt_data, hf_token):
root_url = hf_hub_url(hf_private_dataset_repo_txt_data, "")
assert xisfile(root_url + "data/text_data.txt", token=hf_token) is True
assert xisfile(root_url + "qwertyuiop", token=hf_token) is False
download_config = DownloadConfig(use_auth_token=hf_token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_config = DownloadConfig(use_auth_token=hf_token)
download_config = DownloadConfig(token=hf_token)

@@ -397,9 +403,10 @@ def test_xgetsize(input_path, size, tmp_path, mock_fsspec):
@pytest.mark.integration
def test_xgetsize_private(hf_private_dataset_repo_txt_data, hf_token):
root_url = hf_hub_url(hf_private_dataset_repo_txt_data, "")
assert xgetsize(root_url + "data/text_data.txt", token=hf_token) == 39
download_config = DownloadConfig(use_auth_token=hf_token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_config = DownloadConfig(use_auth_token=hf_token)
download_config = DownloadConfig(token=hf_token)

@@ -440,8 +447,9 @@ def test_xglob(input_path, expected_paths, tmp_path, mock_fsspec):
@pytest.mark.integration
def test_xglob_private(hf_private_dataset_repo_zipped_txt_data, hf_token):
root_url = hf_hub_url(hf_private_dataset_repo_zipped_txt_data, "data.zip")
assert len(xglob("zip://**::" + root_url, token=hf_token)) == 3
assert len(xglob("zip://qwertyuiop/*::" + root_url, token=hf_token)) == 0
download_config = DownloadConfig(use_auth_token=hf_token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_config = DownloadConfig(use_auth_token=hf_token)
download_config = DownloadConfig(token=hf_token)

assert len(list(xwalk("zip://::" + root_url, token=hf_token))) == 2
assert len(list(xwalk("zip://main_dir::" + root_url, token=hf_token))) == 1
assert len(list(xwalk("zip://qwertyuiop::" + root_url, token=hf_token))) == 0
download_config = DownloadConfig(use_auth_token=hf_token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_config = DownloadConfig(use_auth_token=hf_token)
download_config = DownloadConfig(token=hf_token)

@lhoestq
Copy link
Member

lhoestq commented Jul 12, 2023

I am working on this issue right now #6017 which is strongly connected to your PR, and I might end up cherry-picking some of your commits (keeping attribution of course !). Would you be ok with that ?

@janineguo
Copy link
Contributor Author

it's totally ok for me, I just wish the S3 File system could support streaming too.

@janineguo
Copy link
Contributor Author

I already adjust the code and test on my local Mac, you can check it now, and you can make any changes to it.

@lhoestq lhoestq mentioned this pull request Jul 13, 2023
@lhoestq
Copy link
Member

lhoestq commented Jul 17, 2023

Closing this PR in favor of #6028 which includes your contribution :)

@lhoestq lhoestq closed this Jul 17, 2023
@janineguo janineguo deleted the S3Support_bug#5880 branch July 18, 2023 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants