Skip to content
Young edited this page Jul 3, 2024 · 8 revisions

Welcome to the Cecret wiki!

Consensus Extraction and Contig Reconstruction using Enriched libraries against a Template (CECRET)

---
Cecret
---
flowchart LR
fastq --> cleaning
cleaning --> alignment
reference --> alignment
alignment --> A[primer trimming]
B[primer schema] --> A
A --> consensus

Loading

This workflow is for intended for amplicon-based NGS libraries and an intended reference. There are options to skip primer removal, but there are no options to skip alignment to a reference.

There are several references and primer schemes supplied with this workflow which are listed in their corresponding subspecies workflow. More can be added if the reference is small. Please submit an issue to let us know what else we should include.

The primer scheme and reference fasta file may also be supplied by the end user.

Introduction

Named after the beautiful Cecret lake

Location: 40.570°N 111.622°W , Elevation: 9,875 feet (3,010 m), Hiking level: easy

(Image credit: Intermountain Healthcare)

Cecret was originally developed by @erinyoung at the Utah Public Health Laborotory for SARS-COV-2 sequencing with the artic/Illumina hybrid library prep workflow for MiSeq data with protocols here and here. This nextflow workflow, however, is flexible for many additional organisms and primer schemes as long as the reference genome is "small" and "good enough." In 2022, @tives82 added in contributions for Monkeypox virus, including converting IDT's primer scheme to NC_063383.1 coordinates. We are grateful to everyone that has contributed to this repo.

The library preparation method greatly impacts which bioinformatic tools are recommended for creating a consensus sequence. For example, amplicon-based library preparation methods will required primer trimming and an elevated minimum depth for base-calling. Some bait-derived library preparation methods have a PCR amplification step, and PCR duplicates will need to be removed. This has added complexity and several (admittedly confusing) options to this workflow. Please submit an issue if/when you run into issues.

It is possible to use this workflow to simply annotate fastas generated from any workflow or downloaded from GISAID or NCBI. There are also options for multiple sequence alignment (MSA) and phylogenetic tree creation from the fasta files.

Cecret is also part of the staphb-toolkit.

Dependencies

General Usage

The default usage of Cecret is to run on fastq files for SARS-CoV-2 sequencing.

nextflow run UPHL-BioNGS/Cecret -profile singularity --reads reads 

There are, however, a lot of ways this workflow can be adjusted. Cecret does include 100+ parameters after all. There are also only so many words a typical end user is willing to read to understand how to adjust these parameters for their use case. We've divided this wiki into sections of reading that we think a typical user will be able to absorb, but please create an issue if something is unclear.

Typical use-cases with wiki pages:

A complete list of all params with their default values can be found in (Cecret/nextflow_schema.json)[https://github.com/UPHL-BioNGS/Cecret/blob/master/nextflow_schema.json]

Cecret is a nextflow workflow that strings together a variety of tools, and would not be possible without them.

  • aci - for depth estimation over amplicons
  • artic network - for aligning and consensus creation of nanopore reads
  • bwa - for aligning reads to the reference
  • fastp - for cleaning reads ; optional, faster alternative to seqyclean
  • fastqc - for QC metrics
  • freyja - for multiple SARS-CoV-2 lineage classifications
  • heatcluster - for visualization of a SNP matrix
  • igv-reports - for creating igv-reports for each suspected variant
  • iqtree2 - for phylogenetic tree generation (optional, relatedness must be set to "true")
  • ivar - calling variants and creating a consensus fasta; optional primer trimmer
  • kraken2 - for read classification
  • mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
  • minimap2 - an alternative to bwa
  • multiqc - summary of results
  • nextalign - for phylogenetic tree generation (optional, relatedness must be set to "true", and msa must be set to "nextalign")
  • nextclade - for SARS-CoV-2 clade classification
  • pango-aliasor - to identify parent pangolin lineages
  • pangolin - for SARS-CoV-2 lineage classification
  • pangolincollapse - to identify parent pangolin lineages
  • phytreeviv - for visualization of the phylogenetic tree
  • samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files; optional duplication marking
  • seqyclean - for cleaning reads
  • snp-dists - for relatedness determination (optional, relatedness must be set to "true")
  • vadr - for annotating fastas like NCBI
  • viridian - for primer detection and trimming