FASTQ to BAM to VCF. A bioinformatics staple.
This workflow for sequence alignment and variant calling is implemented with Snakemake for ease-of-use and reproducibility. Handles technical and biological replicates following the rna-seq-star-deseq2 workflow example.
If this is your first time using a Snakemake workflow, you may find it helpful to run our tutorial first.
If you are using a work or lab server, ask your sysadmin if git and conda are installed already. If so, skip to STEP 2.
To download miniconda:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
The installer will ask you some questions to complete installation. Review and accept the license, accept or change home location, and answer yes to placing it in your path.
To finish configuring miniconda:
source $HOME/.bashrc
To install git:
conda install git
In the terminal, navigate to your preferred location and clone this repository.
git clone https://github.com/kramundson/staples.git
cd staples
Note: On some server setups, conda may crash if your home folder is not writable. These commands will fix this by telling conda to store the environment in the current folder.
# Recommended: prevent conda from crashing if home folder is not writable
conda config --add envs_dirs ./.conda/envs
conda config --add pkgs_dirs ./.conda/pkgs
When you build the conda environment, Conda obtains all the software listed in environment.yaml
.
# Build staples conda environment
conda env create -f environment.yaml
Finally, you will need to activate the environment.
source activate staples
You only need to build the environment once. However, you'll need to activate the environment each time you log in. To deactivate the environment, use the command conda deactivate
.
You'll need to place the reference genome assembly for your organism in data/genome/
.
- If you are downloading the assembly from the Internet, you can use
wget
to download it directly into the folder.
You will also need to place the reads into the repository.
- If you have
.fastq
or.fastq.gz
files already, you may put them directly indata/reads/
. - If you are downloading from NCBI Sequence Read Archive, you may place the
.sra
files indata/sra/
and the workflow will convert them to.fastq.gz
.
The config file config.yaml
allows you to customize the workflow to your needs. An example is included; you will need to edit the config file to run your own data.
The purpose of units.tsv
is to associate sample information with each of your fastq files. It is a tab-delimited file with four required columns:
sample
: biological sampleunit
: unique combination of biolgical sample, library and sequencing runfq1
: forward reads fastq filefq2
: reverse reads fastq file
An example units.tsv
file is included in the workflow.
To run the workflow with 24 cores and save the printed output to run_1.out
:
snakemake --cores 24 > run_1.out 2>&1