Skip to content

Running Nextflow Pipelines

MattHuff edited this page Nov 12, 2021 · 1 revision

Nextflow is a DSL that allows users to create scalable pipelines. In simple terms, rather than creating multiple scripts for a single pipeline, Nextflow runs these multiple scripts in a single executable pipeline.

We will make commonly run processes, such as a gene expression pipeline, available for use with Nextflow. These scripts will be located in /pickett_centaur/scripts/nextflow in Centaur and /sphinx_local/scripts/nextflow in Sphinx.

To get Nextflow working, run spack load nextflow.

Nextflow Configuration File

Upon your first run with Nextflow, it will automatically configure settings for all future runs in the Nextflow pipeline. This will create a directory, located in your home directory, containing a contig file:

~/.nextflow/config

This contig file is used to define the general settings of Nextflow, such as how many cores you want to run at a time. If you need to use more cores than the default, edit this file to alter the number of cores.

Running a Nextflow job

In this example, we will run the expression_pipeline_PE.nf script, which runs the three steps of trimming (Skewer), alignment (STAR), and counting (htseq-count) of paired-end reads to a reference genome. This will require three parameters:

  • --input: The directory where your paired-end reads are located. In a case like this, each sample will have two files - one representing the forward strand (<sample_name>_1.fq) and one representing the reverse strand (<sample_name>_2.fq). If you only have forward strand filenames, or no read numbers in the file name at all, you will need to run the single-end script expression_pipeline_SE.nf` script instead.
  • --output: The directory where your output files will be located. The script will automatically create sub-directories within this directory corresponding to each step of the pipeline.
  • --ref_genome: The directory containing the indexed reference genome you will be aligning your reads to. At present, genome indexing will need to be run separately from the main pipeline.

With this in mind, the command to run the script is as follows:

nextflow -C ~/.nextflow/config run /pickett_centaur/scripts/nextflow/gene_expression/expression_pipeline_PE.nf --input <input_directory> --output <output_directory> --ref_genome <genome_directory>

Resuming a stopped job

If your nextflow job is interrupted due to an error, or some other unforeseen issue, you can resume the job where it left off by adding the -resume flag to the above script. This prevents having to rerun parts of the pipeline that have already run with no issues.