Batch size for many input files #80

Fazel-AVB · 2025-01-07T13:06:11Z

Hi, thank you for developing this useful tool.
I have so many genomes (> 100k) along with their gbk files, and I want to annotate them via Phold. Would a large --batch_size (e.g. 128) help process files faster? Since in the documentation, you have mentioned that a batch size of 1 is usually faster!
And, in general, should I combine all of my gbk files into a single one as input, or can I give different gbk files in parallel to the phold predict?

bw

The text was updated successfully, but these errors were encountered:

gbouras13 · 2025-01-09T23:50:27Z

Hi @Fazel-AVB ,

Really interesting.

In terms of the --batch_size, I found using a batch size of 1 was fasted on my hardware (RTX4090) but it should really be more efficient with larger batch sizes. I am finalising a 'production release' of Phold now so I will look into it.

In terms of the gbk input - I would say you should run them in chunks (of e.g. 1000/5000) which I have done in the past. Not sure if you are running this on a cluster environment, but it would allow you to distribute to multiple GPUs as well. I have found it to be most efficient for cluster environments and also just generally more robust (running 100k genomes will take hours/days and if there is some error, you will lose the intermediate steps). You can't run different gbks in parallel (as it uses a single GPU).

George

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size for many input files #80

Batch size for many input files #80

Fazel-AVB commented Jan 7, 2025 •

edited

Loading

gbouras13 commented Jan 9, 2025

Batch size for many input files #80

Batch size for many input files #80

Comments

Fazel-AVB commented Jan 7, 2025 • edited Loading

gbouras13 commented Jan 9, 2025

Fazel-AVB commented Jan 7, 2025 •

edited

Loading