Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force JSON format regardless of file naming on S3 #5012

Closed
junwang-wish opened this issue Sep 22, 2022 · 4 comments
Closed

Force JSON format regardless of file naming on S3 #5012

junwang-wish opened this issue Sep 22, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@junwang-wish
Copy link

I have a file on S3 created by Data Version Control, it looks like s3://dvc/ac/badff5b134382a0f25248f1b45d7b2 but contains a json file. If I run

dataset = load_dataset(
    "json",
    data_files='s3://dvc/ac/badff5b134382a0f25248f1b45d7b2' 
)

It gives me

InvalidSchema: No connection adapters were found for 's3://dvc/ac/badff5b134382a0f25248f1b45d7b2'

However, I cannot go ahead and change the names of the s3 file. Is there a way to "force" load a S3 url with certain decoder (JSON, CSV, etc.) regardless of s3 URL naming?

@junwang-wish junwang-wish added the enhancement New feature or request label Sep 22, 2022
@lhoestq
Copy link
Member

lhoestq commented Sep 26, 2022

Hi ! Support for URIs like s3://... is not implemented yet in data_files=. You can use the HTTP URL instead if your data is public in the meantime

@rdpratti
Copy link

rdpratti commented Apr 3, 2023

Hi,
I want to make sure I understand this response. I have a set of files on S3 that are private for security reasons. Because they are not public files I cannot read those files (many are parquet) into my hf notebooks in Kaggle? That can't be correct, can it?

@lhoestq
Copy link
Member

lhoestq commented Apr 4, 2023

Hi ! There is a discussion at #5281

Using the latest datasets 2.11 you can try passing fsspec URLs to private buckets to data_files in load_dataset(). Though this is still experimental and undocumented, so feedback is welcome. You may not have the best experience though, since anything related to performance and caching hasn't been tested properly yet.

@lhoestq
Copy link
Member

lhoestq commented Aug 16, 2023

closing this one since data_files supports fsspec (still experimental/untested/undocumented for s3 though)

@lhoestq lhoestq closed this as completed Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants