Genotyping MUC1 coding-VNTR in ADTKD-MUC1 using short-read sequencing (SRS) data. Vntyper pipeline embedded two different variant calling algorithms:
- Mapping free genotyping using kmer frequencies (Kestrel)
- Profile-HMM based method (code-adVNTR)
The tool can be downloaded by cloning from the github page:
# Make a directory that you want to download VNtyper
mkdir vntyper
git clone https://github.com/hassansaei/VNtyper.git
# Go to the directory that you downloaded the source code
cd VNtyper
The following command will automatically download and install all prerequisites:
chmod u+x install_prerequisites.sh
bash install_prerequisites.sh or ./install_prerequisites.sh
The requeirments are as follows:
-
Python >= 3.9 and libraries
- Pandas
pip3 install pandas
- numpy
pip3 install numpy
- regex
pip3 install regex
- biopython
pip3 install biopython
- setuptools==58
pip3 install setuptools
- pysam
pip3 install pysam
- Pandas
-
Install (BWA)
-
Download chr1.fa file form (UCSC genome browser)
-
Index fasta file with BWA
-
Install Singularity
-
Download (Kestrel)
-
Building singularity image for code-adVNTR
-
Download (VNTR database) for code-adVNTR
-
The MUC1 VNTR motif dictionary and index files are provided in the File directory
Docker images is also provided and can be pulled from docker hub. You have to make a directory to store both you inputs and outputs in the host machine. The instructions for installing docker on linux can be found (here)
mkdir shared
sudo docker pull saei/vntyper:1.2.0
The image files can also be downloaded and loaded via:
Sudo docker load Docker_VNtyper_1.2.0.tar
# Or use the scripts below:
cat Docker_vntyper_v1.2.0.tar | docker import - vntyper:1.2.0
Run docker with only the kmer method:
sudo docker run --rm -it -v /PATH to the shared directory/shared:/SOFT/shared saei/vntyper:1.0.0 \
-t 8 --bam -p /SOFT/VNtyper/ -ref /SOFT/VNtyper/Files/chr1.fa \
-ref_VNTR /SOFT/VNtyper/Files/MUC1-VNTR.fa \
-a /SOFT/shared/SAAMPLE.bam -t 8 -w /SOFT/shared/ -o SAMPLE_NAME --ignore_advntr
Run docker with both methods:
sudo docker run --rm -it -v /PATH to the shared directory/shared:/SOFT/shared saei/vntyper:1.0.0 \
-t 8 --bam -p /SOFT/VNtyper/ -ref /SOFT/VNtyper/Files/chr1.fa \
-ref_VNTR /SOFT/VNtyper/Files/MUC1-VNTR.fa -m /SOFT/VNtyper/Files/hg19_genic_VNTRs.db \
-a /SOFT/shared/SAMPLE.bam -t 8 -w /SOFT/shared/ -o SAMPLE_NAME
Use following command to see the help for running the tool.
python3 VNtyper.py --help
usage: VNtyper_FV.py [-h] -ref Referense [-r1 FASTQ1] [-r2 FASTQ2] -o OUTPUT -ref_VNTR Referense [-t THREADS] -p TOOLS_PATH -w WORKING_DIR [-m REFERENCE_VNTR]
[--ignore_advntr] [--bam] [--fastq] [-a ALIGNMENT]
Given raw fastq files, this pipeline genotype MUC1-VNTR using kestrel (Mapping-free genotyping) and Code-adVNTR mathods
optional arguments:
-h, --help show this help message and exit
-ref Referense, --reference_file Referense
FASTA-formatted reference file and indexes
-r1 FASTQ1, --fastq1 FASTQ1
Fastq file first pair
-r2 FASTQ2, --fastq2 FASTQ2
Fastq file second pair
-o OUTPUT, --output OUTPUT
Output file name
-ref_VNTR Referense, --reference_VNTR Referense
MUC1-specific reference file
-t THREADS, --threads THREADS
Number of threads (CPU)
-p TOOLS_PATH, --tools_path TOOLS_PATH
Path to the VNtyper directory
-w WORKING_DIR, --working_dir WORKING_DIR
the path to the output
-m REFERENCE_VNTR, --reference_vntr REFERENCE_VNTR
adVNTR reference vntr database
--ignore_advntr Skip adVNTR genotyping of MUC1-VNTR
--bam BAM file as an input
--fastq Paired-end fastq files as an input
-a ALIGNMENT, --alignment ALIGNMENT
Alignment File (with an index file .bai)
[Note] Since the program uses python3.9 logging system, it can not be executed using lower versions of the python.
Running only kmer-based genotyping:
python3 VNtyper.py --bam -ref Files/chr1.fa -a SAMPLE.bam -o SAMPLE_NAME -ref_VNTR Files/MUC1-VNTR.fa -t Threads -p VNtyper/ -w WORKING_DIRECTORY --ignore_advntr
[Note] This algorithm is far more faster than the second method.
Running both genotyping methods:
python3 VNtyper.py --bam -ref Files/chr1.fa -a SAMPLE.bam -o SAMPLE_NAME -ref_VNTR Files/MUC1-VNTR.fa -t Threads -p VNtyper/ -w WORKING_DIRECTORY -m Files/vntr_data/hg19_genic_VNTRs.db
[Note] This algorithm is far more slower than the first method.
We analyzed MUC1 region in 2300 samples from 1000G 30X project. The results from this analysis could be found (here)
for f in *.bam; do samtools depth -b MUC_hg19.bed $f | awk '{sum+=$3} END { print sum/NR}' > $f.coverage; done
MUC_hg19.bed is provide. MUC1_hg19.bed could also be replaced by : -r chr1:155160500-155162000
Here we provided five (example_1.bam to example_5.bam) MUC1 8C positive bam files for evaluation. Link to bam files: (Bam) Example_1 to 3 from NTI cohort and example_4 and 5 from renome cohort.
The tool creates a folder for each case in the working directory which is assigned by the user. Inside the folder there is directory for temporary files and log files, and the final output:
- Temp folder: Fastp QC report (.html) and log file for VNtyper
- The output of VNtyper '*_Final_result.tsv'
The Kestrel output is a VCF file, which is proceessed by VNtyper and final result is stored in *_Final_result.tsv. The result file contains information for the motifs, varinant types, position of the varinat and its corresponding depth. The output for code-adVNTR is a bed or vcf file with varinat information and Pvalue.
##NOTE: This tool is for research use only.
##NOTE: Clinically boosted WES data should be used to genotype MUC1 VNTR using WES data.
- Saei H, Morinière V, Heidet L, Gribouval O, Lebbah S, Tores F, Mautret-Godefroy M, Knebelmann B, Burtey S, Vuiblet V, Antignac C, Nitschké P, Dorval G. VNtyper enables accurate alignment-free genotyping of MUC1 coding VNTR using short-read sequencing data in autosomal dominant tubulointerstitial kidney disease. iScience. 2023 Jun 17;26(7):107171. doi: 10.1016/j.isci.2023.107171. PMID: 37456840; PMCID: PMC10338300.
- Peter A Audano, Shashidhar Ravishankar, Fredrik O Vannberg, Mapping-free variant calling using haplotype reconstruction from k-mer frequencies, Bioinformatics, Volume 34, Issue 10, May 2018, Pages 1659–1665, https://doi.org/10.1093/bioinformatics/btx753
- Park J, Bakhtiari M, Popp B, Wiesener M, Bafna V. Detecting tandem repeat variants in coding regions using code-adVNTR. iScience. 2022 Jul 19;25(8):104785. doi: 10.1016/j.isci.2022.104785. PMID: 35982790; PMCID: PMC9379575.