Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch size for many input files #80

Open
Fazel-AVB opened this issue Jan 7, 2025 · 1 comment
Open

Batch size for many input files #80

Fazel-AVB opened this issue Jan 7, 2025 · 1 comment

Comments

@Fazel-AVB
Copy link

Fazel-AVB commented Jan 7, 2025

Hi, thank you for developing this useful tool.
I have so many genomes (> 100k) along with their gbk files, and I want to annotate them via Phold. Would a large --batch_size (e.g. 128) help process files faster? Since in the documentation, you have mentioned that a batch size of 1 is usually faster!
And, in general, should I combine all of my gbk files into a single one as input, or can I give different gbk files in parallel to the phold predict?

bw

@gbouras13
Copy link
Owner

Hi @Fazel-AVB ,

Really interesting.

In terms of the --batch_size, I found using a batch size of 1 was fasted on my hardware (RTX4090) but it should really be more efficient with larger batch sizes. I am finalising a 'production release' of Phold now so I will look into it.

In terms of the gbk input - I would say you should run them in chunks (of e.g. 1000/5000) which I have done in the past. Not sure if you are running this on a cluster environment, but it would allow you to distribute to multiple GPUs as well. I have found it to be most efficient for cluster environments and also just generally more robust (running 100k genomes will take hours/days and if there is some error, you will lose the intermediate steps). You can't run different gbks in parallel (as it uses a single GPU).

George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants