-
Notifications
You must be signed in to change notification settings - Fork 27
FAQ
TELL US ABOUT IT!!!
- Github issue
- Send someone from UPHL on slack
Be sure to include the command that was used, what config file was used, and what the nextflow error was.
The multiqc report aggregates data across your samples into one file. Open the 'cecret/multiqc/multiqc_report.html' file with your favored browser. There tables and graphs are generated for 'General Statistics', 'Samtools stats', 'Samtools flagstats', 'FastQC', 'iVar', 'SeqyClean', 'Fastp', 'Pangolin', and 'Kraken2'.
In the history of this repository, there actually was an attempt to store fastq files here that the End User could use to test out this workflow. This made the repository very large and difficult to download.
There are several test profiles. These download fastq files from the ENA to use in the workflow. This does not always work due to local internet connectivity issues, but may work fine for everyone else.
nextflow run UPHL-BioNGS/Cecret -profile {docker or singularity},test
Another great resources is SARS-CoV-2 datasets, an effort of the CDC to provide a benchmark dataset for validating bioinformatic workflows. Fastq files from the nonviovoc, voivoc, and failed projects were downloaded from the SRA and put through this workflow and tested locally before releasing a new version.
The expected amount of time to run this workflow with 250 G RAM and 48 CPUs, 'params.maxcpus = 8', and 'params.medcpus = 4' is ~42 minutes. This corresponded with 25.8 CPU hours.
# for a collection of fastas
nextflow run UPHL-BioNGS/Cecret -profile singularity --fastas <directory with fastas>
# for a collection of fastas and multifastas
nextflow run UPHL-BioNGS/Cecret -profile singularity --fastas <directory with fastas> --multifastas <directory with multifastas>
The End User can run mafft, snpdists, and iqtree on a collection of fastas as well with
nextflow run UPHL-BioNGS/Cecret -profile singularity --relatedness true --fastas <directory with fastas> --multifastas <directory with multifastas>
The End User can have paired-end, singled-end, and fastas that can all be put together into one analysis.
nextflow run UPHL-BioNGS/Cecret -profile singularity --relatedness true --fastas <directory with fastas> --multifastas <directory with multifastas> --reads <directory with paire-end reads> --single_reads <directory with single-end reads>
The End User is more than welcome to look at an example here. Just remove the comments for the parameters that need to be adjusted and specify with -c
.
To get a copy of the config file, the End User can use the following command. This created edit_me.config in the current directory.
nextflow run UPHL-BioNGS/Cecret --config_file true
At UPHL, our config file is small enough to be put as a profile option, but the text of the config file would be as follows:
singularity.enabled = true
singularity.autoMounts = true
params {
reads = "Sequencing_reads/Raw"
kraken2 = true
kraken2_db = '/Volumes/IDGenomics_NAS/Data/kraken2_db/h+v'
vadr = false
}
And then run with
nextflow run UPHL-BioNGS/Cecret -c <path to custom config file>
There are two ways to do this.
cecret/aci
has two files : amplicon_depth.csv and amplicon_depth.png. There is a row for each sample in 'amplicon_depth.csv', and a column for each primer in the amplicon bedfile. The values contained within are reads that only map to the region specified in the amplicon bedfile and excludes reads that do not. A boxplot of these values is visualized in amplicon_depth.png.
cecret/samtools_ampliconstats
has a file for each sample.
Row number 126 (FDEPTH) has a column for each amplicon (also without a header). To get this row for all of the samples, grep
the keyword "FDEPTH" from each sample.
grep "^FDEPTH" cecret/samtools_ampliconstats/* > samtools_ampliconstats_all.tsv
There are corresponding images in cecret/samtools_plot_ampliconstats
for each sample.
The primer bedfile is the file with the start and stop of each primer sequence.
$ head configs/artic_V3_nCoV-2019.primer.bed
MN908947.3 30 54 nCoV-2019_1_LEFT nCoV-2019_1 +
MN908947.3 385 410 nCoV-2019_1_RIGHT nCoV-2019_1 -
MN908947.3 320 342 nCoV-2019_2_LEFT nCoV-2019_2 +
MN908947.3 704 726 nCoV-2019_2_RIGHT nCoV-2019_2 -
MN908947.3 642 664 nCoV-2019_3_LEFT nCoV-2019_1 +
MN908947.3 1004 1028 nCoV-2019_3_RIGHT nCoV-2019_1 -
MN908947.3 943 965 nCoV-2019_4_LEFT nCoV-2019_2 +
MN908947.3 1312 1337 nCoV-2019_4_RIGHT nCoV-2019_2 -
MN908947.3 1242 1264 nCoV-2019_5_LEFT nCoV-2019_1 +
MN908947.3 1623 1651 nCoV-2019_5_RIGHT nCoV-2019_1 -
The amplicon bedfile is the file with the start and stop of each intended amplicon.
$ head configs/artic_V3_nCoV-2019.insert.bed <==
MN908947.3 54 385 1 1 +
MN908947.3 342 704 2 2 +
MN908947.3 664 1004 3 1 +
MN908947.3 965 1312 4 2 +
MN908947.3 1264 1623 5 1 +
MN908947.3 1595 1942 6 2 +
MN908947.3 1897 2242 7 1 +
MN908947.3 2205 2568 8 2 +
MN908947.3 2529 2880 9 1 +
MN908947.3 2850 3183 10 2 +
Due to the many varieties of primer bedfiles, it is best if the End User supplied this file for custom primer sequences.
First of all, this is a great thing! Let us know if tools specific for your organism should be added to this workflow. There are already options for 'mpx' and 'other' species.
In a config file, change the following relevant parameters:
params.reference_genome
params.primer_bed
params.amplicon_bed #or set params.aci = false
params.gff_file #or set params.ivar_variants = false
And set
params.species = 'other'
params.pangolin = false
params.freyja = false
params.nextclade = false #or adjust nexclade_prep_options from '--name sars-cov-2' to the name of the relevent dataset
params.vadr = false #or configure the vadr container appropriately and params.vadr_reference
Although not perfect, if 'params.filter = true'
, then only the reads that were mapped to the reference are returned. This should eliminate all human contamination (as long as human is not part of the supplied reference) and all "problematic" incidental findings.
This workflow has too many bells and whistles. I really only care about generating a consensus fasta. How do I get rid of all the extras?
Change the parameters in a config file and set most of them to false.
params.species = 'none'
params.fastqc = false
params.ivar_variants = false
params.samtools_stats = false
params.samtools_coverage = false
params.samtools_depth = false
params.samtools_flagstat = false
params.samtools_ampliconstats = false
params.samtools_plot_ampliconstats = false
params.aci = false
params.pangolin = false
params.freyja = false
params.nextclade = false
params.vadr = false
params.multiqc = false
And, yes, this means I added some bells and whistles so the End User could turn off the bells and whistles. /irony
No. Prior versions supported a tool called bamsnap, which had the potential to be great. Due to time constraints, this feature is no longer supported.
Never fear, they are still in nextflow's work directory if the End User really needs them. They are no longer included in publishDir
because of size issues. The BAM files are still included in publishDir
, and most analyses for SAM files can be done with BAM files.
Personally, we liked having stderr
saved to a file because some of the tools using in this workflow print to stderr
instead of stdout
. We have found, however, that this puts all the error text into a file, which a lot of new-to-nextflow users had a hard time finding. It was easier to assist and troubleshoot with End Users when stderr
was printed normally.
What is in the works to get added to 'Cecret'? (These are waiting for either versions to stabilize or a docker container to be available.)
- AWS compliant config file
- https://github.com/chrisruis/bammix
- https://github.com/lenaschimmel/sc2rf
- masking options for phylogenetic relatedness
Bedtools multicov was replaced by ACI due to processing times, but there are other processes that take longer.
Right now, the processes that take the most time are
- ivar trim
- freyja