Streaming and creating refactored dataset with shards using Generator #7235
Unanswered
WillPowellUk
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to stream a dataset (i.e. to disk not to memory), refactor it using a generator and map, and then push it back to the hub. The following methodology acheives this but it is slow, due to the following error:
Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
N.B. there was a GitHub issue related here this but I cannot create a solution with
gen_kwargs
Here is my minimal reproduce-able code:
In the from_generator examples, it says that it should be implemented as follows:
Therefore I experimented with a new script to use
gen_kwargs
to take in a series of shards from another dataset.This works to remove the error (so i assume the num_procs is set to 16), however it is even slower than using one CPU.
Beta Was this translation helpful? Give feedback.
All reactions