-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add import feature for user-provided regions and/or features #250
Comments
For my purposes (#216), the proposed design using GFF3 CDS features would appear to work well. While ORF callers are quite good these days, sometimes they miss features where we have strong biological evidence for their existence. |
After some further considerations, I decided to keep this simpler and go for a mere import ofo So, now, there's a new parameter
|
Attempting to use this, I find I'm wanting Bakta to search for my sequences instead of having to do it myself. Say I have several hundred genomes to annotate, and I know geneX exists in many of them, but geneX tends to not get annotated. If I understand correctly, under the current scheme I have to go find the coordinates for geneX in each of those genomes and then make a supplemental GFF for each genome, and then supply that GFF for --regions when I run Bakta. What I'd rather do is feed Bakta the sequence for geneX, and if a sufficiently homologous match is found it gets added as a user CDS in the way described above. |
Thanks @marade for the clarification. Now, I see your point. However, I've read and understood your use-case above and in #216 in the way that importing a priori-annotated CDS regions is important to allow for amended regional annotations in single genomes. This new feature now allows for such manual annotations. However, as you already mentioned, of course these coordinates must be provided for each single genome. Even in clonal genomes, gene positions can (and often will) slightly differ. So, if I understand your post correctly, you're interested in inferring CDS simply by homology without de novo-prediction. This could also be done, but in general, this should be handled with care since you cannot now if this is a proper functional gene. De novo gene prediction tools take into account further information as for example genetic neighborhood, ribosomal binding sites, etc. So, in principle, it's possible and not to complicated to implement and add such a feature, too. But there are several non-trivial questions arising from that:
Therefore, I'm reluctant to implement this simply b/c there are so many different parameters to either anticipate as a default or ask from the user. But, what about an external script that can be executed before Bakta? This could use One huge advantage would be, that the various parameters that would be required to adopt this to different use cases can be added w/o overcrowding Bakta's UI. |
I don't love this solution, but here's a script to try. Please have a look when you get a chance. |
My use case would be that I have a lot of ncbi-annotated genomes where I for consistency want to continue using the same locus tags and CDS coordinates as in the ncbi gff files but improve the hopelessly bad annotation using my own curated reference protein fasta file. I will try the --region option, which sounds great but ideally I would also like an option to disable the de novo CDS prediction by pyrodigal, I can see when this could be useful but in my case it is redundant. |
I tried the --regions option but got an error message.
I have tried to figure out what it means but haven't managed to solve it. I attach my runfile including the log file, and input fasta and gff file. 231124_bakta_reannot_HpGP_ncbi.sh.txt |
Hi @thorellk. Just change it to |
Though the homology-based automated lookup of user-provided features is still open, I'd see the initial use-case addressed and covered. Therefore, I'd like to close it this for now. To followup on the homology based lookups, please either use #260 or #247. Thanks a lot for all these contributions! |
Hi @oschwengers
|
Hi @oschwengers, now I don't get that error anymore but instead it's complaining about my gff files. The Gff files are from the NCBI PGAP pipeline and should be fairly standard. Unfortunately the error message is very general so I don't know how to troubleshoot. The files are still the ones that I attached above.
|
OK, after renaming the fsa header to
So there's a gene -[50878, 52099] which has a coding sequence which is not a multiple of 3 and thus causes this error:
After removing this CDS, there are more of these. As far as I know, a CDS should always consist of triplets. |
Yes, one would definitely expect CDS to contain even triplets. This is official NCBI PGAP annotation and I checked the accompanying protein fasta file. The fasta header for that entry is |
Hi again @oschwengers. I am sorry to push for this but do you think there is any way to work around this issue? We have several projects where we work with NCBI annotated genomes and we want to keep the gene coordinates and locus tags but improve the functional annotation. If you don't think it will be possible with bakta, do you have any other suggestion? I have tried for example liftoff but it is not at all as versatile. |
I guess this issue may have a similar solution as #262? |
Hey @thorellk , I'm very sorry for not having responded earlier - this just somehow slipped through. Just in case this is still of interest, I think we could skip the strict triplet checks for pseudogenes, as indictated in this case by the |
As this gets asked more & more often (#216 #245 #247 ), I'm thinking of adding this as a new larger feature to Bakta.
At first, this is a mere reservoir for ideas and requirements of this new feature - active early feedback is highly welcome!
! So far, I cannot make any promises if and when this will be available.
Based on the feedback so far, currently, a first sketch looks like this:
GFF3
orGenbank
format:--import-regions
to import feature regions without annotations--import-features
to import entire features with annotationsCDS
features, onlyAny thoughts, ideas, comments? Please, let us know what you think.
The text was updated successfully, but these errors were encountered: