Download a very large dataset #5707

yuvalkirstain · 2023-04-05T05:26:17Z

yuvalkirstain
Apr 5, 2023

Hello, I want to upload a very large dataset to the hub, and would like to make sure that users are able to efficiently download it. I know that when I upload it, it will be automatically be divided into shards which is great. What will be the most efficient way to download the dataset afterwards? Eg making use of the largest number of concurrent processes/threads etc.
Also, if I upload with push_to_hub, do I need to add a custom download script to make the download more efficient? If I am locally
Uploading it with load_from_disk it works fine.

mariosasko · 2023-04-05T18:21:58Z

mariosasko
Apr 5, 2023
Collaborator

Hi! Loading from Parquet is already very fast, but you can make it even faster by calling load_dataset with num_proc to parallelize downloading and processing of the Parquet shards.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download a very large dataset #5707

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Download a very large dataset #5707

yuvalkirstain Apr 5, 2023

Replies: 1 comment

mariosasko Apr 5, 2023 Collaborator

yuvalkirstain
Apr 5, 2023

mariosasko
Apr 5, 2023
Collaborator