Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binning question, how to use vamb? #54

Open
ChaoXianSen opened this issue Oct 10, 2023 · 7 comments
Open

Binning question, how to use vamb? #54

ChaoXianSen opened this issue Oct 10, 2023 · 7 comments

Comments

@ChaoXianSen
Copy link

Hi !
Before operating phamb, i use vamb process binning,
Vamb rum mode: vamb --outdir output63
--fasta R63.contigs.fa.gz
--bamfiles R63_sort.bam
-o C

report err.log : Traceback (most recent call last):
File "/public/home/bioinfo_wang/00_software/miniconda3/envs/avamb/bin/vamb", line 33, in
sys.exit(load_entry_point('vamb', 'console_scripts', 'vamb')())
File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 1395, in main
run(
File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 834, in run
cluster(
File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 665, in cluster
clusternumber, ncontigs = vamb.vambtools.write_clusters(
File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 440, in write_clusters
for clustername, contigs in clusters:
File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 701, in binsplit
for newbinname, splitheaders in _split_bin(binname, headers, separator):
File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 676, in _split_bin
raise KeyError(f"Separator '{separator}' not in sequence label: '{header}'")
KeyError: "Separator 'C' not in sequence label: 'k141_84347'"

But, the reuslt contain ‘k141_84347 ’ :
‘ less contignames |grep "k141_84347" -A2 -B2 ' --> 'k141_512747
k141_170723
k141_84347
k141_170724
k141_512748'

the vamb operation result file contain :
'0 Oct 9 23:52 vae_clusters.tsv # why the file is empty?
7.7M Oct 9 23:52 contignames
2.6M Oct 9 23:52 lengths.npz
41K Oct 9 23:52 log.txt
77M Oct 9 23:52 latent.npz
815K Oct 9 23:51 model.pt
894 Oct 9 14:40 mask.npz
2.3M Oct 9 14:40 abundance.npz
252M Oct 9 14:38 composition.npz'

Thanks!

@joacjo
Copy link
Collaborator

joacjo commented Oct 10, 2023

Hi Chao

I see, this will not fly. It looks like there is an issue in the steps where you run vamb, your contignames has to be named with a strict format that makes you able to identify the original sample origin of each sample.
When that works and you get a non zero vae_clusters.tsv file, the downstream step using phamb should be straight forward.

Maybe @enryH - Can help you identify what went wrong.

@enryH
Copy link
Member

enryH commented Oct 10, 2023

KeyError: "Separator 'C' not in sequence label: 'k141_84347'"

So what is your separator? I guess you know VAMB better than me:)

@ChaoXianSen
Copy link
Author

the file R63.contigs.fa.gz format like this:

k141_514373 flag=1 multi=3.0000 len=356
CCATAAATCTGATTTTAGTCAAAAAAATATGCAGTTTTTCAAAAAGGGTGTATAATTCTTTCGTTACATGAAATATTTTGGAGGTGCTATTTTTATGAAAA
k141_0 flag=1 multi=2.0000 len=345
AGGTGAAGATGACCGAAGAGGAGATTAAGGCCCGTGAGTTTGCCAAAGCGGCGCAGAAGGAGAAGGAGGACCGTGAGGCCAAGAAAGCGCTGG

@joacjo
Copy link
Collaborator

joacjo commented Oct 24, 2023

Hi Chao - did you manage to run VAMB as described in the example on the VAMB Github repo? I think that will solve many of your issues posted here. I hope that you are trying to process more than just one sample of course.

  1. Concatenate assemblies from different samples
python concatenate.py /path/to/catalogue.fna.gz /path/to/assemblies/sample1/contigs.fasta /path/to/assemblies/sample2/contigs.fasta  [ ... ]
  1. Do the mapping
minimap2 -d catalogue.mmi /path/to/catalogue.fna.gz; # make index
minimap2 -t 8 -N 5 -ax sr catalogue.mmi --split-prefix mmsplit /path/to/reads/sample1.fw.fq.gz /path/to/reads/sample1.rv.fq.gz | samtools view -F 3584 -b --threads 8 > /path/to/bam/sample1.bam

  1. run VAMB
vamb --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C

It will be hard to help if the naming of the contigs in the fasta file and in the clusters.tsv file are not correctly formatted.

@ChaoXianSen
Copy link
Author

Hi Chao - did you manage to run VAMB as described in the example on the VAMB Github repo? I think that will solve many of your issues posted here. I hope that you are trying to process more than just one sample of course.

  1. Concatenate assemblies from different samples
python concatenate.py /path/to/catalogue.fna.gz /path/to/assemblies/sample1/contigs.fasta /path/to/assemblies/sample2/contigs.fasta  [ ... ]
  1. Do the mapping
minimap2 -d catalogue.mmi /path/to/catalogue.fna.gz; # make index
minimap2 -t 8 -N 5 -ax sr catalogue.mmi --split-prefix mmsplit /path/to/reads/sample1.fw.fq.gz /path/to/reads/sample1.rv.fq.gz | samtools view -F 3584 -b --threads 8 > /path/to/bam/sample1.bam
  1. run VAMB
vamb --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C

It will be hard to help if the naming of the contigs in the fasta file and in the clusters.tsv file are not correctly formatted.

Dear@joacjo
Thanks for your reply! I have solved this problem.
But, I have another question:
Run the RF model, the result file 'resultdir/vamb_bins/vamb_bins.1.fna' , how to cluster to get the vOTUs ?
which tools and databse can we use to get the vOTUs ?

Thanks a lot !

@joacjo
Copy link
Collaborator

joacjo commented Oct 25, 2023

Hi Chao

The most straight forward to way to get vOTUs is following the approach described by the CheckV people

Copied from https://bitbucket.org/berkeleylab/checkv/src/master/

Rapid genome clustering based on pairwise ANI

First, create a blast+ database:
makeblastdb -in <my_seqs.fna> -dbtype nucl -out <my_db>

Next, use megablast from blast+ package to perform all-vs-all blastn of sequences:
blastn -query <my_seqs.fna> -db <my_db> -outfmt '6 std qlen slen' -max_target_seqs 10000 -o <my_blast.tsv> -num_threads 32

Next, calculate pairwise ANI by combining local alignments between sequence pairs:
anicalc.py -i <my_blast.tsv> -o <my_ani.tsv>

Finally, perform UCLUST-like clustering using the MIUVIG recommended-parameters (95% ANI + 85% AF):
aniclust.py --fna <my_seqs.fna> --ani <my_ani.tsv> --out <my_clusters.tsv> --min_ani 95 --min_tcov 85 --min_qcov 0

Relevant ani-scripts can be found the checkv repo.

@ChaoXianSen
Copy link
Author

Dear@joacjo
Thanks a lot ! Thanks for your reminding. I'll go over the instructions of checkV in detail.

But i have another problem:
' hmmsearch --cpu 16
-E 1.0e-05
-o R1_output.txt
--tblout ./R1_annotations/all.hmmVOG.tbl
AllVOG.hmm ./R1_proteins.faa '

This step runs very slowly ,is there anything I can do to speed up ?

I provid 16CPUs and 240G memeoy. ( #SBATCH -n 16 #SBATCH --mem=240G )
the file 'AllVOG.hmm' is about 3.2G, file 'R1_proteins.faa' is about 630M;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants