A centralized repository of Nextflow workflows that interact with Synapse.
The purpose of this repository is to provide a collection of Nextflow workflows that interact with Synapse by leveraging the Synapse Python Client. These workflows are intended to be used in a Nextflow Tower environment primarily, but they can also be executed using the Nextflow CLI on your local machine.
This repository is organized as follows:
- Individual process definitions, or modules, are stored in the
modules/
directory. - Modules are then combined into workflows, stored in the
workflows/
directory. These workflows are intended to capture the entire process of an interaction with Synapse. - Workflows are then represented in the
main.nf
script, which provides an entrypoint for each workflow.
Only one workflow can be used per nf-synapse
run. The configuration for a workflow run will need to include which workflow you intend to use (indicated by specifying entry
), along with all of the parameters required for that workflow.
In the example below, we provide the entry
parameter NF_SYNSTAGE
to indicate that we want to run the NF_SYNSTAGE
workflow. We also provide the input
parameter, which is required for NF_SYNSTAGE
.
nextflow run main.nf -profile docker -entry NF_SYNSTAGE --input path/to/input.csv
NF_SYNAPSE
is designed to be used on either side of a general-purpose Nextflow Workflow to stage input files from Synapse/SevenBridges to an S3 bucket, run a workflow of your choosing, and then index the output files from the S3 bucket back into Synapse.
flowchart LR;
A[NF_SYNAPSE:NF_SYNSTAGE]-->B[WORKFLOW];
B-->C[NF_SYNAPSE:NF_SYNINDEX];
See demo.py
in Sage-Bionetworks-Workflows/py-orca
for an example of accomplishing this goal with Python code.
For Nextflow Tower runs, you can configure your secrets using the Tower CLI or the Tower Web UI. If you are running the workflow locally, you can configure your secrets within the Nextflow CLI.
All included workflows require a SYNAPSE_AUTH_TOKEN
secret. You can generate a Synapse personal access token using this dashboard.
Current profiles
included in this repository are:
docker
: Indicates that you want to run the workflow using Docker for running process containers.conda
: Indicates that you want to use aconda
environment for running process containers.
The purpose of this workflow is to automate the process of staging Synapse and SevenBridges files to a Nextflow Tower-accessible location (e.g. an S3 bucket). In turn, these staged files can be used as input for a general-purpose (e.g. nf-core) workflow that doesn't contain platform-specific steps for staging data. This workflow is intended to be run first in preparation for other data processing workflows.
NF_SYNSTAGE
performs the following steps:
- Extract all Synapse and SevenBridges URIs (e.g.
syn://syn28521174
orsbg://63b717559fd1ad5d228550a0
) from a given text file. - Download the corresponding files from both platforms in parallel.
- Replace the URIs in the text file with their staged locations.
- Output the updated text file so it can serve as input for another workflow.
The examples below demonstrate how you would stage Synapse files in an S3 bucket called example-bucket
, but they can be adapted for other storage backends.
-
Prepare your input file containing the Synapse URIs. For example, the following CSV file follows the format required for running the
nf-core/rnaseq
workflow.Example: Uploaded to
s3://example-bucket/input.csv
sample,fastq_1,fastq_2,strandedness foobar,syn://syn28521174,syn://syn28521175,unstranded
-
Launch workflow using the Nextflow CLI, the Tower CLI, or the Tower web UI.
Example: Launched using the Nextflow CLI
nextflow run main.nf -profile docker -entry NF_SYNSTAGE --input path/to/input.csv
-
Retrieve the output file, which by default is stored in a
synstage/
subfolder within the parent directory of the input file. The Synapse and/or Seven Bridges URIs have been replaced with their staged locations. This file can now be used as the input for other workflows.Example: Downloaded from
s3://example-bucket/synstage/input.csv
sample,fastq_1,fastq_2,strandedness foobar,s3://example-scratch/synstage/syn28521174/foobar.R1.fastq.gz,s3://example-scratch/synstage/syn28521175/foobar.R2.fastq.gz,unstranded
If you are staging Seven Bridges files, there are a few differences that you will want to incorporate in your Nextflow run.
- You will need to configure
SB_AUTH_TOKEN
andSB_API_ENDPOINT
secrets.
- You can generate an authenication token and retrieve your API endpoint by logging in to the Seven Bridges portal you intend to stage files from, such as Seven Bridges CGC. From there, click on the "Developer" dropdown and then click "Authentication Token". A full list of Seven Bridges API endpoints can be found here
- When adding your URIs to your input file, SevenBridges file URIs should have the prefix
sbg://
. - There are two ways to get the ID of a file in SevenBridges:
- The first way involves logging into a SevenBridges portal, such as SevenBridges CGC, navigating to the file and copying the ID from the URL. For example, your URL might look like this: "https://cgc.sbgenomics.com/u/user_name/project/63b717559fd1ad5d228550a0/". From this url, you would copy the "63b717559fd1ad5d228550a0" piece and combine it with the
sbg://
prefix to have the complete URIsbg://63b717559fd1ad5d228550a0
. - The second way involves using the SBG CLI. To get the ID numbers that you need, run the
sb files list
command and specify the project that you are downloading files from. A list of all files in the project will be returned, and you will combine the ID number with the prefix for each file that you want to stage.
- The first way involves logging into a SevenBridges portal, such as SevenBridges CGC, navigating to the file and copying the ID from the URL. For example, your URL might look like this: "https://cgc.sbgenomics.com/u/user_name/project/63b717559fd1ad5d228550a0/". From this url, you would copy the "63b717559fd1ad5d228550a0" piece and combine it with the
Note: NF_SYNSTAGE
can handle either or both types of URIs in a single input file.
Check out the Quickstart section for example parameter values.
-
input
: (Required) A text file containing Synapse URIs (e.g.syn://syn28521174
). The text file can have any format (e.g. a single column of Synapse URIs, a CSV/TSV sample sheet for an nf-core workflow). -
outdir
: (Optional) An output location where the Synapse files will be staged. Currently, this location must be an S3 prefix for Nextflow Tower runs. If not provided, this will default to the parent directory of the input file. -
save_strategy
: (Optional) A string indicating where to stage the files within theoutdir
. Options include:id_folders
: Files will be staged in child folders named after the Synapse or Seven Bridges ID of the file. This is the default behavior.flat
: Files will be staged in top level of theoutdir
.
- The only way for the workflow to download Synapse files is by listing Synapse URIs in a file. You cannot provide a list of Synapse IDs or URIs to a parameter.
- The workflow doesn't check if newer versions exist for the files associated with the Synapse URIs. If you need to force-download a newer version, you should manually delete the staged version.
The purpose of this workflow is to automate the process of indexing files in an S3 bucket into Synapse. NF_SYNINDEX
is intended to be used after a general-purpose (e.g. nf-core) workflow that doesn't contain platform-specific steps for uploading/indexing data.
NF_SYNINDEX
performs the following steps:
- Gets the Synapse user ID for the account that provided the
SYNAPSE_AUTH_TOKEN
secret. - Updates or creates the
owner.txt
file in the S3 bucket to make the current user an owner. - Registers the S3 bucket as an external storage location for Synapse.
- Generates a list of all of the objects in the S3 bucket to be indexed.
- Recreates the folder structure of the S3 bucket in the Synapse project.
- Indexes the files in the S3 bucket into the Synapse project.
The examples below demonstrate how you would index files from an S3 bucket called example-bucket
into Synapse.
-
Prepare your S3 bucket by setting the output directory of your general-purpose workflow to a Nextflow Tower S3 bucket. Ideally, you want this S3 bucket to be persistent (not a
scratch
bucket) so that your files will remain accessible indefinitely.Example:
s3://example-bucket
file structure:example-dev-project-tower-bucket ├── child_folder ├── test.txt │ ├── child_child_folder │ │ └── test2.txt │ ├── test1.txt
-
Launch workflow using the Nextflow CLI, the Tower CLI, or the Tower web UI.
Example: Launched using the Nextflow CLI
nextflow run main.nf -profile docker -entry NF_SYNINDEX --s3_prefix s3://example-bucket --parent_id syn12345678
-
Retrieve the output file, which by default is stored in a
S3://example-bucket/synindex/under-syn12345678/
in our example. This folder will contain a mapping of Synapse URIs to their indexed Synapse IDs.
Check out the Quickstart section for example parameter values.
s3_prefix
: (Required) The S3 URI of the S3 bucket that contains the files to be indexed.parent_id
: (Required) The Synapse ID of the Synapse project or folder that the files will be indexed into.filename_string
: (Optional) A string that will be matched against the names of the files in the S3 bucket. If provided, only files that contain the string will be indexed.
At present, it is not possible for NF_SYNINDEX
to be run outside of Nextflow Tower. This is due to AWS permissions complications. Future work will include enabling the workflow to run on local machines/in virtual machines.