A Snakemake workflow for quick quality control of Illumina MiSeq paired end data before sequencing on high throughput sequencing device (e.g. NextSeq).
Snakemake is best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it can be installed via Mambaforge. For other options see here.
Given that Mamba is installed, run
mamba create -c conda-forge -c bioconda --name snakemake snakemake
to install Snakemake in an isolated environment. For all following commands ensure that this environment is activated via
conda activate snakemake
First, create an appropriate project working directory on your system and enter it:
WORKDIR=path/to/project_workdir
mkdir -p ${WORKDIR}
cd ${WORKDIR}
In all following steps, we will assume that you are inside of that directory.
Second, to clone the full workflow run:
git clone https://github.com/josefawelling/QC_pre_NextSeq.git
To configure this workflow, modify config/config.yaml
according to your needs, following the explanations provided in the file. It is especially recommended to provide the correct adapter sequences, so they can be trimmed appropriately.
The sample sheet contains all samples to be analyzed.
You can choose to automatically create a sample sheet with all samples in a specified directory (modifications in config/config.yaml
). Only fastq.gz
files are taken into account. Additionally there is the option to rename the sequencers output FASTQ files during this step, e.g. from sampleID_S40_L001_R1_001.fastq.gz
to sampleID_R1.fastq.gz
.
To create the sample sheet and provide it for the workflow, run:
snakemake --cores all --use-conda create_sample_sheet
Samples to be analyzed can also be added manually to the sample sheet.
For each sample, a new line in config/pep/samples.csv
with the following content has to be defined:
- sample_name: name or identifier of sample
- fq1: path to read 1 in gzip FASTQ format
- fq2: path to read 2 in gzip FASTQ format
Given that the workflow has been properly deployed and configured, it can be executed as follows.
Fow running the workflow while deploying any necessary software via conda (using the Mamba package manager by default), run Snakemake with
snakemake --cores all --use-conda
Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow.
Note: By adding --dry-run
or (-n
) to the Snakemake command, you can see which steps shall be executed without actually running them.
The usage of this workflow is described in the Snakemake Workflow Catalog.
%%{init: {
'theme':'base',
'themeVariables': {
'secondaryColor': '#fff',
'tertiaryColor': '#fff',
'tertiaryBorderColor' : '#fff'}
}}%%
flowchart TB;
subgraph " "
direction TB
%% Nodes
A[/Illumina paired end reads/]
B["Trimming and Filtering <br> fastp"]
C["Quality control <br> fastQC"]
D[/MultiQC report/]
E["Taxonomy Assignment <br> Kraken 2"]
F["Abundance Analysis <br> Bracken"]
G["Mapping against human genome <br> minimap2"]
H[/Snakemake report/]
%% input & output node design
classDef in_output fill:#fff,stroke:#cde498,stroke-width:4px
class A,D,H in_output
%% rule node design
classDef rule fill:#cde498,stroke:#000
class B,C,E,F,G rule
%% Node links
A --> B
B --> C
B ---- D
C ---- D
B --> E
E ---> F
B --> G
D --- H
G --- H
F --- H
end
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and its DOI (see above).