-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phold proteins-predict ends prematurely with large input files #79
Comments
Hi @shiraz-shah , Sorry for the late reply. Seems as though you are running this locally (so no cluster time limit issues). I think simply it just takes a while to process the output (the speed of which I will try and and improve in any case). I am not sure it is a memory issue (but it could be, 500k proteins is a lot, not sure I have tested it on a set that large). I think maybe a George |
The process just dies, so waiting more won't help. I think it must be memory.
What kind of processing are you doing to this file in Python in order to get the annotations per CDS? Maybe I could help and write a bash script that does the same thing, so you don't have to load the file into Pandas. That would both be faster and use less memory. Let me know, George. I'm happy to help |
That'll do it I'd think @shiraz-shah ! Short of batching your proteins into smaller chunks from the start (I would recommend but it seems wasteful here given you have generated the 3Dis), I think a 100GB file might be tough for most non-bash/awk etc solutions. I have been intending to replace pandas with polars in phold and pharokka, but even polars might struggle with this. A pure end-to-end bash/awk solution for this will be pretty convoluted and probably not worth your time, as there are quite a number of operations (the most expensive being, merging in the PHROG categories for each hit, then choosing to top hit that isn't a hypothetical protein, if there is one, and if there are only hypothetical proteins, choosing the top hit regardless). If you really want to give it a go, most of the code is here (https://github.com/gbouras13/phold/blob/main/src/phold/results/topfunction.py) What I suggest is twofold:
for example you could run something like this to just get the tophits per query (or perhaps modify it slightly to get say the top 20/50 hits per protein based off the evalue)
Alternatively, you could also try rerunning George |
OK, that makes sense. It's definitely doable in bash, since I've been doing something similar myself for years. But a more optimised Python solution would be cleaner, so I understand if you want to keep it within Python. I've checked the TSV file, and some of the proteins have tens of thousands of hits, so there's lots of room for optimisation here. I think |
I can see now that the default value is supposed to be 10000, but I can see in my |
The other thing is in the George |
This might be related to an earlier issue I posted.
Description
When supplied by a sufficiently large input file,
phold proteins-compare
ends with intermediate files such asfoldseek_results.tsv
instead ofphold_per_cds_predictions.tsv
It's as if it wasn't finished with the job. Maybe ran out of memory.
The phold log file does not contain any error. The Last line in log file in these cases is:
Whereas when phold does succeed (if run on a sufficiently small input file), there are two additional lines in the log:
What I Did
Command:
Suggestions
foldseek_results.tsv
without having to run everything else from scratch? Seems the results is only minutes away.The text was updated successfully, but these errors were encountered: