Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phold proteins-predict ends prematurely with large input files #79

Open
shiraz-shah opened this issue Dec 2, 2024 · 6 comments
Open
Labels
bug Something isn't working

Comments

@shiraz-shah
Copy link

This might be related to an earlier issue I posted.

  • phold version: 0.2.0
  • Python version:
  • Operating System: Ubuntu Server 22.04 LTS

Description

When supplied by a sufficiently large input file, phold proteins-compare ends with intermediate files such as foldseek_results.tsv instead of phold_per_cds_predictions.tsv

It's as if it wasn't finished with the job. Maybe ran out of memory.

The phold log file does not contain any error. The Last line in log file in these cases is:

2024-10-08 02:17:05.033 | INFO     | phold.results.topfunction:get_topfunctions:35 - Processing Foldseek output

Whereas when phold does succeed (if run on a sufficiently small input file), there are two additional lines in the log:

2024-10-02 17:24:07.868 | INFO     | phold.results.topfunction:get_topfunctions:35 - Processing Foldseek output
2024-10-02 17:27:37.435 | INFO     | phold.utils.util:end_phold:101 - phold proteins-compare has finished
2024-10-02 17:27:37.435 | INFO     | phold.utils.util:end_phold:102 - Elapsed time: 8111.45 seconds

What I Did

  • My machine is a 64GB 16-thread i7 with a 24GB GTX 4090.
  • The above problem was encountered with input files larger than 500k proteins
  • As for input files smaller than 100k proteins, they ran fine to the end.
    Command:
phold proteins-predict -i viruses.sampled.faa -o viruses.sampled.p_predict
phold proteins-compare -i viruses.sampled.faa --predictions_dir viruses.sampled.p_predict -o viruses.sampled.p_compare

Suggestions

  • Can we get a meaningful error in the log file when phold fails?
  • If it's a memory issue, can we have warnings ahead of time instead of waiting until failure?
  • Or there options to "reduce batch size", and if so, are there tradeoffs with sensitivity?
  • Is there a way to salvage a phold run that has almost completed? How can one run the last two steps on foldseek_results.tsv without having to run everything else from scratch? Seems the results is only minutes away.
@gbouras13
Copy link
Owner

Hi @shiraz-shah ,

Sorry for the late reply. Seems as though you are running this locally (so no cluster time limit issues). I think simply it just takes a while to process the output (the speed of which I will try and and improve in any case).

I am not sure it is a memory issue (but it could be, 500k proteins is a lot, not sure I have tested it on a set that large).

I think maybe a --restart flag would be a good idea to just run the processing step after the foldseek results are generated - how big is the foldseek_results.tsv file? (I imagine tens or hundreds of GBs). I'll try that out.

George

@gbouras13 gbouras13 added the bug Something isn't working label Jan 9, 2025
@shiraz-shah
Copy link
Author

The process just dies, so waiting more won't help. I think it must be memory.

foldseek_results.tsv is 98GB.

What kind of processing are you doing to this file in Python in order to get the annotations per CDS? Maybe I could help and write a bash script that does the same thing, so you don't have to load the file into Pandas. That would both be faster and use less memory.

Let me know, George. I'm happy to help

@gbouras13
Copy link
Owner

That'll do it I'd think @shiraz-shah !

Short of batching your proteins into smaller chunks from the start (I would recommend but it seems wasteful here given you have generated the 3Dis), I think a 100GB file might be tough for most non-bash/awk etc solutions. I have been intending to replace pandas with polars in phold and pharokka, but even polars might struggle with this.

A pure end-to-end bash/awk solution for this will be pretty convoluted and probably not worth your time, as there are quite a number of operations (the most expensive being, merging in the PHROG categories for each hit, then choosing to top hit that isn't a hypothetical protein, if there is one, and if there are only hypothetical proteins, choosing the top hit regardless). If you really want to give it a go, most of the code is here (https://github.com/gbouras13/phold/blob/main/src/phold/results/topfunction.py)

What I suggest is twofold:

  1. try and make foldseek_results.tsv a bit more manageable - most proteins will have thousands of hits that are not that informative beyond the top hit (or top few if the top hit is hypothetical). I suggest filtering the foldseek_results.tsv.

for example you could run something like this to just get the tophits per query (or perhaps modify it slightly to get say the top 20/50 hits per protein based off the evalue)

cat foldseek_results.tsv | awk -F'\t' '{
           if (!seen[$1]++) {  # Add the condition here to ensure only unique rows based on column 1
               print $0;
           }
}' > "tophit_foldseek_results.tsv"
mv foldseek_results.tsv foldseek_results_backup.tsv
mv tophit_foldseek_results.tsv foldseek_results.tsv
  1. wait for me to implement a --restart flag (I'm back to actively developing phold today, so shouldn't be long) and then run this

Alternatively, you could also try rerunning phold proteins-compare with --max_seqs 500 or run something to reduce the number of hits that pass through the pre filter, but it still will probably be a massive tsv.

George

@shiraz-shah
Copy link
Author

OK, that makes sense. It's definitely doable in bash, since I've been doing something similar myself for years. But a more optimised Python solution would be cleaner, so I understand if you want to keep it within Python.

I've checked the TSV file, and some of the proteins have tens of thousands of hits, so there's lots of room for optimisation here. I think --max_seqs 500 would help a lot on tsv size. Would it also make the foldseek search faster? If so, I think I'll try to rerun proteins-compare.

@shiraz-shah
Copy link
Author

I can see now that the default value is supposed to be 10000, but I can see in my foldseek_results.tsv that some proteins have up to 50k hits.

@gbouras13
Copy link
Owner

--max_seqs doesn't cap the number of hits, it caps the number of sequences that pass the pre filter. I'm not an absolute expert but I believe the number of hits can be higher because phold runs Foldseek in cluster search mode (so can find more hits via clusters after). But it should still reduce the number of hits overall if you reduce this.

The other thing is in the dev branch, I have implemented --foldseek_extra_params, which should let you try other params if you wish (like -s).

George

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants