does ludwig support training on datasets > memory? #3360

skunkwerk · 2023-04-20T18:14:39Z

skunkwerk
Apr 20, 2023

I have some Parquet files I'd like to use with Ludwig. Does ludwig support training on data that is larger > available memory?
I'm guessing it will use either Ray or Dask under the hood, based on what I've configured?
Are there any tutorials on how to do this efficiently?

thanks

tgaddair · 2023-04-22T21:33:55Z

tgaddair
Apr 22, 2023
Maintainer

Hi @skunkwerk, yes, we support it through the Ray backend, as you suggested.

Running with the Ray backend can be pretty easy, you can even run with it locally on your laptop by just specifying the following in your config:

backend:
    type: ray

In most cases though you'll want to use it with a distributed cluster running in the cloud. For that, we have the Running Ludwig on Ray guide and the Backend configuration section of the docs.

Happy to help more if you have followup questions.

2 replies

imakbarcoupang Jun 1, 2023

thanks @tgaddair. For Parquet files, it sounds like dask is used under the hood to read the file format.

do I need to combine all my Parquet files into a single file to load as a dataset in ludwig?
will ludwig/dask read the parquet files in chunks, ie like a PyTorch DataLoader iterator, so I don't need to worry about running out of memory?

tgaddair Jun 1, 2023
Maintainer

Hi @imakbarcoupang, good questions:

No, Ludwig can read Parquet files as a directory as well as a single file.
Yes, if you're using the ray backend for Ludwig, then Dask will process the data per partition. At data loading time, we stream the data into memory in batches, so at no point does all the data need to fit into memory. There are still some cases where you can run into memory issues based on how your Ray cluster is configured, but these can be tweaked to handle arbitrarily large datasets.

skunkwerk · 2023-06-06T17:53:00Z

skunkwerk
Jun 6, 2023
Author

thanks Travis.

0 replies

skunkwerk · 2023-06-06T18:56:52Z

skunkwerk
Jun 6, 2023
Author

I'm unsure how to pass in a directory of Parquet files to the dataset parameter of ludwig.

If I pass in a path to a folder, like "/data/folder" I get the error:
ValueError: Dataset path string "data/folder" does not contain a valid extension

I was able to side-step this by adding in a "data_format" parameter to the create_auto_config method here, which then passes it to the call to load_dataset; however, I still get an error further down in this code because it doesn't recognize the extension.

If I pass in a wildcard path, like "/data/folder/*.parquet" I get the error: no such file or directory

What's the proper way to do this?

3 replies

ksbrar Jun 14, 2023
Collaborator

@skunkwerk To clarify, in your example /data/folder/ does the terminal folder have a .parquet postfix at the end? (e.g. so it would be data/folder.parquet/?

skunkwerk Jun 14, 2023
Author

No, it does not. the folder has .parquet files in it, but the folder name itself does not have the .parquet extension

ksbrar Jun 14, 2023
Collaborator

Do you mind retrying after renaming it that way? (it should work regardless, but I've experienced weird behavior with parquet recently.)

skunkwerk · 2023-06-06T22:50:11Z

skunkwerk
Jun 6, 2023
Author

Also, I seem to having trouble with Ray reading and training with an 8 GB parquet file used as the dataset in ludwig. Using the ray backend on a single-node (local) instance, I get errors like this:

2 Workers (tasks / actors) killed due to memory pressure (OOM)

My machine has 16 GB of RAM; any tips on how to configure Ray to not run OOM?

thanks

0 replies

skunkwerk · 2023-06-09T17:30:25Z

skunkwerk
Jun 9, 2023
Author

@tgaddair is there any way I could use a local backend, and send batches of a large Parquet file (converted to pandas dataframes) into Ludwig? I know how to do the Parquet batching; I just don't know where I would need to patch the code.

As Ray on a single node cluster doesn't seem to work, and I can't figure out how to fix it by reading their docs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

does ludwig support training on datasets > memory? #3360

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

does ludwig support training on datasets > memory? #3360

skunkwerk Apr 20, 2023

Replies: 5 comments · 5 replies

tgaddair Apr 22, 2023 Maintainer

imakbarcoupang Jun 1, 2023

tgaddair Jun 1, 2023 Maintainer

skunkwerk Jun 6, 2023 Author

skunkwerk Jun 6, 2023 Author

ksbrar Jun 14, 2023 Collaborator

skunkwerk Jun 14, 2023 Author

ksbrar Jun 14, 2023 Collaborator

skunkwerk Jun 6, 2023 Author

skunkwerk Jun 9, 2023 Author

skunkwerk
Apr 20, 2023

Replies: 5 comments 5 replies

tgaddair
Apr 22, 2023
Maintainer

tgaddair Jun 1, 2023
Maintainer

skunkwerk
Jun 6, 2023
Author

skunkwerk
Jun 6, 2023
Author

ksbrar Jun 14, 2023
Collaborator

skunkwerk Jun 14, 2023
Author

ksbrar Jun 14, 2023
Collaborator

skunkwerk
Jun 6, 2023
Author

skunkwerk
Jun 9, 2023
Author