-
Notifications
You must be signed in to change notification settings - Fork 0
Lab 08: Putting it all together
The goal of this lab is to utilize the skills we've built in the past two days to make a simple RNA-Seq analysis pipeline that performs QC and maps reads to a reference genome using the custom built STAR image.
If you inspect exercises/08_putting_it_all_together
, you'll see a Nextflow pipeline using the same organization we used in labs 4 and 6.
In main.nf
, we first state that we're using DSL2, then import the modules containing all of our processes for a simplified RNA-Seq pipeline.
nextflow.enable.dsl=2
include { FASTQC } from "./modules/fastqc.nf"
include { MULTIQC } from "./modules/multiqc.nf"
include { STAR_INDEX } from "./modules/star.nf"
include { STAR_MAP } from "./modules/star.nf"
Now let's see our workflow.
workflow {
if (params.fastq_seqs) {
if (!params.skip_qc) {
ch_fastqs = Channel.fromFilePairs("${params.fastq_seqs}/*read{1,2}.fastq.gz", checkIfExists: true, flat:true)
FASTQC(ch_fastqs)
MULTIQC(FASTQC.out.ch_fastqc.collect())
}
STAR_INDEX(params.genome, params.annot)
STAR_MAP(ch_fastqs, STAR_INDEX.out.star_idx)
}
}
Our workflow checks if the params.fastq_seqs
variable is defined within an if
block. A second, nested if block allows for an optional fastqc and multiqc quality control statistics (toggled on or off with params.skip_qc
). After the optional QC steps, STAR_INDEX
creates an index which is passed to the alignment step, STAR_MAP
.
⭐ Take a moment to see if you can tell where every params
parameter is coming from.
Look at the nextflow.config
file.
params {
fasta_seqs = false
skip_qc = false
}
process {
publish_dir = "${params.publish_dir}"
withLabel: star {
cpus = 2
}
}
singularity {
enabled = true
cacheDir = "${HOME}/singularity/"
autoMounts = true
}
report {
enabled = true
file = "${process.publish_dir}/summary/report.html"
}
timeline{
enabled = true
file = "${process.publish_dir}/summary/timeline.html"
}
⭐ Notice anything new?
process {
publish_dir = "${params.publish_dir}"
withLabel: star {
cpus = 2
}
}
Here we're using the variable "publish_dir" from our 'params scope' (also our command line input) and we're making it available to the process scope
. We've also added a label
called "star".
Let's investigate what the label "star" from our configuration file actually does. In the modules/star.nf
file, if we look at the STAR_MAP
process, we see the directive label on the second line.
process STAR_MAP {
label "star"
publishDir(path: "${publish_dir}/star", mode: "symlink")
input:
tuple val(id), path(r1), path(r2)
path(star_index)
output:
tuple val(id), path("*.bam"), emit: ch_bam
script:
"""
STAR \
--genomeDir ${star_index} \
--readFilesIn ${r1} ${r2} \
--runThreadN ${task.cpus} \
--readFilesCommand zcat \
--outFileNamePrefix ${id} \
--outSAMtype BAM Unsorted
"""
}
Notice in the script, we are calling the variable task.cpus
even though we haven't seemed to have defined it anywhere in the script. The label "star" is actually associating the cpus
value from our nextflow.config
file with any process with the "star" label. The "publish_dir" variable we define in our config is applied to ALL processes, but by using labels, we can use very specific settings to a subset of our processes with ease.