Skip to content

Workflow for short quality control of MiSeq sequencing data before sequencing on high throughput devices (e.g. NextSeq)

License

Notifications You must be signed in to change notification settings

IKIM-Essen/QC_pre_NextSeq

Repository files navigation

QC for NextSeq

Snakemake GitHub actions status

A Snakemake workflow for quick quality control of Illumina MiSeq paired end data before sequencing on high throughput sequencing device (e.g. NextSeq).

Usage

Step 1: Install Snakemake

Snakemake is best installed via the Mamba package manager (a drop-in replacement for conda). If you have neither Conda nor Mamba, it can be installed via Mambaforge. For other options see here.

Given that Mamba is installed, run

    mamba create -c conda-forge -c bioconda --name snakemake snakemake

to install Snakemake in an isolated environment. For all following commands ensure that this environment is activated via

    conda activate snakemake

Step 2: Clone workflow

First, create an appropriate project working directory on your system and enter it:

    WORKDIR=path/to/project_workdir
    mkdir -p ${WORKDIR}
    cd ${WORKDIR}

In all following steps, we will assume that you are inside of that directory.

Second, to clone the full workflow run:

    git clone https://github.com/josefawelling/QC_pre_NextSeq.git

Step 3: Configure workflow

Config file

To configure this workflow, modify config/config.yaml according to your needs, following the explanations provided in the file. It is especially recommended to provide the correct adapter sequences, so they can be trimmed appropriately.

Sample sheet

The sample sheet contains all samples to be analyzed.

Auto creation

You can choose to automatically create a sample sheet with all samples in a specified directory (modifications in config/config.yaml). Only fastq.gz files are taken into account. Additionally there is the option to rename the sequencers output FASTQ files during this step, e.g. from sampleID_S40_L001_R1_001.fastq.gz to sampleID_R1.fastq.gz.
To create the sample sheet and provide it for the workflow, run:

    snakemake --cores all --use-conda create_sample_sheet

Manual creation or editing

Samples to be analyzed can also be added manually to the sample sheet. For each sample, a new line in config/pep/samples.csv with the following content has to be defined:

  • sample_name: name or identifier of sample
  • fq1: path to read 1 in gzip FASTQ format
  • fq2: path to read 2 in gzip FASTQ format

Step 4: Run workflow

Given that the workflow has been properly deployed and configured, it can be executed as follows.

Fow running the workflow while deploying any necessary software via conda (using the Mamba package manager by default), run Snakemake with

    snakemake --cores all --use-conda

Snakemake will automatically detect the main Snakefile in the workflow subfolder and execute the workflow.

Note: By adding --dry-run or (-n) to the Snakemake command, you can see which steps shall be executed without actually running them.

The usage of this workflow is described in the Snakemake Workflow Catalog.

Workflow Overview

%%{init: {
   'theme':'base',
   'themeVariables': {
      'secondaryColor': '#fff',
      'tertiaryColor': '#fff',
      'tertiaryBorderColor' : '#fff'}
   }}%%

flowchart TB;

   subgraph " "
      direction TB

      %% Nodes
      A[/Illumina paired end reads/]
      B["Trimming and Filtering <br> fastp"]
      C["Quality control <br> fastQC"]
      D[/MultiQC report/]
      E["Taxonomy Assignment <br> Kraken 2"]
      F["Abundance Analysis <br> Bracken"]
      G["Mapping against human genome <br> minimap2"]
      H[/Snakemake report/]
      

      %% input & output node design
      classDef in_output fill:#fff,stroke:#cde498,stroke-width:4px
      class A,D,H in_output
      %% rule node design
      classDef rule fill:#cde498,stroke:#000
      class B,C,E,F,G rule

      %% Node links
      A --> B
      B --> C
      B ---- D
      C ---- D
      B --> E
      E ---> F
      B --> G
      D --- H
      G --- H
      F --- H

   end

Loading

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and its DOI (see above).

About

Workflow for short quality control of MiSeq sequencing data before sequencing on high throughput devices (e.g. NextSeq)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published