Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting a vOTU query #2

Open
AndAvia opened this issue May 12, 2024 · 1 comment
Open

Getting a vOTU query #2

AndAvia opened this issue May 12, 2024 · 1 comment

Comments

@AndAvia
Copy link

AndAvia commented May 12, 2024

hi,
I download the vOTU.fna (There are 5251 vOTUs in there.)data in the text and used the MGV pipeline(https://github.com/snayfach/MGV/tree/master/viral_detection_pipeline), like this

conda activate hmmer
cd ~/virus1/temp/MGV/mgv/viral_detection_pipeline
cp ~/virus/viwrap_result/paper/vOTUs.fna ./input/
#Call viral genes
./prodigal.linux -i input/vOTUs.fna -a input/vOTUs.faa -d input/vOTUs.ffn -p meta -f gff > input/vOTUs.gff #5251
#Output files include proteins (.faa), genes (.ffn), and gene coordinates (.gff). Genes called in metagenomic mode (-p meta).
 
#Run HMMER on imgvr and pfam databases
hmmsearch -Z 1 --cpu 16 --noali --tblout output/imgvr.out input/imgvr.hmm input/vOTUs.faa
hmmsearch -Z 1 --cut_tc --cpu 16 --noali  --tblout output/pfam.out input/pfam.hmm input/vOTUs.faa
#Here the -Z 1 flag is specified to make E-values comparable between databases and between samples. The IMG/VR database contains viral genes while the Pfam database contains non-viral genes.

#Count genes hitting viral and microbial marker genes
python count_hmm_hits.py input/vOTUs.fna input/vOTUs.faa output/imgvr.out output/pfam.out > output/hmm_hits.tsv
#Each gene assigned according to its best hit with E-value <1e-10. We found that several of the HMMs from IMG/VR HMMs are commonly found in non-viral genomes and several of the HMMs from Pfam are commonly found in viral genomes. Those HMMs are excluded from the analysis.

#Run VirFinder
Rscript virfinder.R input/vOTUs.fna output/virfinder.tsv
#VirFinder scores each contig using a machine-learning algorithm based on kmers https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0283-5

#Quantify strand switch rate of genes
python strand_switch.py input/vOTUs.fna input/vOTUs.faa > output/strand_switch.tsv
#Here the code scans the proteins from each genome in genomic order. It counts the nuber of strand switches (+ to - or - to +) and divides by the total number of genes.

#Create master table of sequence features
python master_table.py output/hmm_hits.tsv output/virfinder.tsv output/strand_switch.tsv > output/master_table.tsv

#Predict viral contigs
python viral_classify.py \
--features output/master_table.tsv \
--in_base input/vOTUs \
--out_base output/vOTUs #vOTUs.tsv  #最终得到2897个病毒

During the run no error was reported, all steps were 5251, after running the last step python viral_classify.py, finally only got 2897 viruses, in the literature the result was 4422, where is the problem, I see viral_classify.py used to a rules file classification_rules.tsv, is that file different from what you guys were using. Thanks for answering the questions!

@joacjo
Copy link
Collaborator

joacjo commented May 17, 2024

Hi Chuang Ma

Your result sounds quite different. You could run a tool like Vibrant to get closer to the set of 4422 vOTUs used in the analysis (see article "Furthermore, we also parsed vOTUs using VIBRANT (v.1.2.1, default settings) and found a 4240/4422 (96%) viral prediction agreement.").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants