Replies: 5 comments 5 replies
-
Hi @skunkwerk, yes, we support it through the Ray backend, as you suggested. Running with the Ray backend can be pretty easy, you can even run with it locally on your laptop by just specifying the following in your config:
In most cases though you'll want to use it with a distributed cluster running in the cloud. For that, we have the Running Ludwig on Ray guide and the Backend configuration section of the docs. Happy to help more if you have followup questions. |
Beta Was this translation helpful? Give feedback.
-
thanks Travis. |
Beta Was this translation helpful? Give feedback.
-
I'm unsure how to pass in a directory of Parquet files to the dataset parameter of ludwig. If I pass in a path to a folder, like "/data/folder" I get the error: I was able to side-step this by adding in a "data_format" parameter to the create_auto_config method here, which then passes it to the call to load_dataset; however, I still get an error further down in this code because it doesn't recognize the extension. If I pass in a wildcard path, like "/data/folder/*.parquet" I get the error: no such file or directory What's the proper way to do this? |
Beta Was this translation helpful? Give feedback.
-
Also, I seem to having trouble with Ray reading and training with an 8 GB parquet file used as the dataset in ludwig. Using the ray backend on a single-node (local) instance, I get errors like this: 2 Workers (tasks / actors) killed due to memory pressure (OOM) My machine has 16 GB of RAM; any tips on how to configure Ray to not run OOM? thanks |
Beta Was this translation helpful? Give feedback.
-
@tgaddair is there any way I could use a local backend, and send batches of a large Parquet file (converted to pandas dataframes) into Ludwig? I know how to do the Parquet batching; I just don't know where I would need to patch the code. As Ray on a single node cluster doesn't seem to work, and I can't figure out how to fix it by reading their docs. |
Beta Was this translation helpful? Give feedback.
-
I have some Parquet files I'd like to use with Ludwig. Does ludwig support training on data that is larger > available memory?
I'm guessing it will use either Ray or Dask under the hood, based on what I've configured?
Are there any tutorials on how to do this efficiently?
thanks
Beta Was this translation helpful? Give feedback.
All reactions