Merge pull request #127 from phac-nml/dev

Doc and Readme updates
phac-nml · Oct 4, 2024 · 1560548 · 1560548
2 parents f65fbd0 + 304bab3
commit 1560548
Show file tree

Hide file tree

Showing 17 changed files with 1,561 additions and 166 deletions.
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -169,3 +169,9 @@ dependentRequired
 errorMessage
 Samplesheet
 TSeemann's
+RASUSA
+downsampling
+Christy
+Marinier
+Petkau
+
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,10 +5,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## Unreleased
 
+### `Changed`
+
+- Added RASUSA for down sampling of Nanopore or PacBio data. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125)
+
 ### `Updated`
 
 - Documentation and workflow diagram has been updated. [PR 123](https://github.com/phac-nml/mikrokondo/pull/123)
 
+- Documentation and Readme has been updated. [PR 126](https://github.com/phac-nml/mikrokondo/pull/126)
+
 ## [0.4.2] - 2024-09-25
 
 ### `Fixed`

diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ This workflow will detect what pathogen(s) is present and apply the applicable m
 
 This software (currently unpublished) can be cited as:
 
-- Wells, M. "mikrokondo" Github <https://github.com/phac-nml/mikrokondo/>
+- Matthew Wells, James Robertson, Aaron Petkau, Christy-Lynn Peterson, Eric Marinier. "mikrokondo" Github <https://github.com/phac-nml/mikrokondo/>
 
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 
@@ -103,33 +103,24 @@ The above downloadable resources must be updated in the following places in your
 
 ```
 // Bakta db path, note the quotation marks
-bakta {
-    db = "/PATH/TO/BAKTA/DB"
-}
+bakta_db = "/PATH/TO/BAKTA/DB"
 
 // Decontamination minimap2 index, note the quotation marks
-r_contaminants {
-    mega_mm2_idx = "/PATH/TO/DECONTAMINATION/INDEX"
-}
+dehosting_idx = "/PATH/TO/DECONTAMINATION/INDEX"
 
 // kraken db path, not the quotation marks
-kraken {
-    db = "/PATH/TO/KRAKEN/DATABASE/"
-}
+kraken2_db = "/PATH/TO/KRAKEN/DATABASE/"
 
 // GTDB Mash sketch, note the quotation marks
-mash {
-    mash_sketch = "/PATH/TO/MASH/SKETCH/"
-}
+mash_sketch = "/PATH/TO/MASH/SKETCH/"
 
 // STARAMR database path, note the quotation marks
 // Passing in a StarAMR database is optional if one is not specified the database in the container will be used. You can just leave the db option as null if you do not wish to pass one
-staramr {
-  db = "/PATH/TO/STARMAR/DB"
-}
-
+staramr_db = "/PATH/TO/STARMAR/DB"
 ```
 
+The above parameters can be accessed for the command line as for passing arguments to the pipeline if not set in the `nextflow.config` file.
+
 # Getting Started
 ## Usage
 

diff --git a/conf/modules.config b/conf/modules.config
@@ -332,6 +332,20 @@ process {
         ]
     }
 
+    withName: RASUSA {
+        ext.args = ""
+        ext.parameters = params.rasusa
+        publishDir = [
+            [
+                path: { [ "${task.read_downsampled_directory_name}", "Rasusa" ].join(File.separator) },
+                mode: params.publish_dir_mode,
+                pattern: "*${params.rasusa.reads_ext}",
+                saveAs: { filename ->
+                    filename.equals('versions.yml') ? null : reformat_output(filename, "reads", "rasusa.sample", meta) }
+            ]
+        ]
+    }
+
 
     withName: SEQTK_SIZE {
         ext.args = ""

diff --git a/docs/subworkflows/clean_reads.md b/docs/subworkflows/clean_reads.md
@@ -1,49 +1,49 @@
-# Read Quality Control
-
-## subworkflows/local/clean_reads
-
-## Steps
-1. **Reads are decontaminated** using [minimap2](https://github.com/lh3/minimap2), against a 'sequencing off-target' index. This index contains:
-	- Reads associated with Humans (de-hosting)
-	- Known sequencing controls (phiX)
-
-2. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp)
-	- Currently no adapters are specified within FastP when it is run and auto-detection is used.
-	- FastP parameters can be altered within the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file.
-	- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. [Chopper](https://github.com/wdecoster/chopper) is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request.
-
-3. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output.
-
-4. **Read downsampling** (OPTIONAL) an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line.
-	- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash)
-	- Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp)
-	- Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk)
-
-5. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, this step assesses how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblies and contigs will be binned at a defined taxonomic level (default level: genus).
-
-6. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so default setting is to skip, `--skip_ont_header_cleaning true`.
-
-## Input
-- Next generation sequencing reads:
-	+ Short read - Illumina
-	+ Long read:
-		* Nanopore
-		* Pacbio
-- User submitted sample sheet		
-
-
-## Outputs
-- Reads
-	- FinalReads
-		- SAMPLE
-	- Processing
-		- Dehosting
-			- Trimmed
-				- FastP
-					- Seqtk
-				- MashSketches
-	- Quality
-		- RawReadQuality
-		- Trimmed
-			- FastP
-			- MashScreen
+# Read Quality Control
+
+## subworkflows/local/clean_reads
+
+## Steps
+1. **Reads are decontaminated** using [minimap2](https://github.com/lh3/minimap2), against a 'sequencing off-target' index. This index contains:
+	- Reads associated with Humans (de-hosting)
+	- Known sequencing controls (phiX)
+
+2. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp)
+	- Currently no adapters are specified within FastP when it is run and auto-detection is used.
+	- FastP parameters can be altered within the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file.
+	- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. [Chopper](https://github.com/wdecoster/chopper) is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request.
+
+3. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output.
+
+4. **Read down sampling** (OPTIONAL) an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line.
+	- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash)
+	- Total base pairs are taken from [FastP](https://github.com/OpenGene/fastp)
+	- Read down sampling is then performed using [Seqtk](https://github.com/lh3/seqtk) (Illumina) or [Rasusa](https://github.com/mbhall88/rasusa) (Nanopore or Pacbio).
+
+5. **Metagenomic assessment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, this step assesses how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblies and contigs will be binned at a defined taxonomic level (default level: genus).
+
+6. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so default setting is to skip, `--skip_ont_header_cleaning true`.
+
+## Input
+- Next generation sequencing reads:
+	+ Short read - Illumina
+	+ Long read:
+		* Nanopore
+		* Pacbio
+- User submitted sample sheet
+
+
+## Outputs
+- Reads
+	- FinalReads
+		- SAMPLE
+	- Processing
+		- Dehosting
+			- Trimmed
+				- FastP
+					- Seqtk
+				- MashSketches
+	- Quality
+		- RawReadQuality
+		- Trimmed
+			- FastP
+			- MashScreen
diff --git a/docs/usage/installation.md b/docs/usage/installation.md
@@ -1,64 +1,56 @@
-# Installation
-
-## Dependencies
-- Python (3.10>=)
-- Nextflow (22.10.1>=)
-- Container service (Docker, Singularity, Apptainer have been tested)
-- The source code: `git clone https://github.com/phac-nml/mikrokondo.git`
-
-**Dependencies can be installed with Conda (e.g. Nextflow and Python)**. 
-
-## To install mikrokondo
-Once all dependencies are installed (see below for instructions), to download the pipeline run:
-
-`git clone https://github.com/phac-nml/mikrokondo.git`
-
-## Installing Nextflow
-Nextflow is required to run mikrokondo (requires Linux), and instructions for its installation can be found at either: [Nextflow Home](https://www.nextflow.io/) or  [Nextflow Documentation](https://www.nextflow.io/docs/latest/getstarted.html#installation)
-
-## Container Engine
-Nextflow and Mikrokondo require the use of containers to run the pipeline, such as: Docker, Singularity (now apptainer), podman, gitpod, sifter and charliecloud. 
-
-> **NOTE:** Singularity was adopted by the Linux Foundation and is now called Apptainer. Singularity still exists, however newer installs will likely use Apptainer.
-
-## Docker or Singularity?
-Docker requires root privileges which can can make it a hassle to install on computing clusters, while there are work arounds, Apptainer/Singularity does not. Therefore, using Apptainer/Singularity is the recommended method for running the mikrokondo pipeline.
-
-### Issues
-Containers are not perfect, below is a list of some issues you may face using containers in mikrokondo, fixes for each issue will be detailed here as they are identified. 
-
-- **Exit code 137,** usually means the docker container used to much memory.
-
-## Resources to download
-- [GTDB Mash Sketch](https://zenodo.org/record/8408361): required for speciation and determination when sample is metagenomic
-- [Decontamination Index](https://zenodo.org/record/8408557): Required for decontamination of reads (this is a minimap2 index)
-- [Kraken2 std database](https://benlangmead.github.io/aws-indexes/k2): Required for binning of metagenommic data and is an alternative to using Mash for speciation
-- [Bakta database](https://zenodo.org/record/7669534): Running Bakta is optional and there is a light database option, however the full one is recommended. You will have to unzip and un-tar the database for usage.
-
-### Fields to update with resources
-It is recommended to store the above resources within the `databases` folder in the mikrokondo folder, this allows for a simple update to the names of the database in `nextflow.config` rather than a need for a full path description.
-
-Below shows where to update database resources in the `params` section of the `nextflow.config` file:
-
-```
-// Bakta db path, note the quotation marks
-bakta {
-    db = "/PATH/TO/BAKTA/DB"
-}
-
-// Decontamination minimap2 index, note the quotation marks
-r_contaminants {
-    mega_mm2_idx = "/PATH/TO/DECONTAMINATION/INDEX"
-}
-
-// kraken db path, not the quotation marks
-kraken {
-    db = "/PATH/TO/KRAKEN/DATABASE/"
-}
-
-// GTDB Mash sketch, note the quotation marks
-mash {
-    mash_sketch = "/PATH/TO/MASH/SKETCH/"
-}
-
-```
+# Installation
+
+## Dependencies
+- Python (3.10>=)
+- Nextflow (22.10.1>=)
+- Container service (Docker, Singularity, Apptainer have been tested)
+- The source code: `git clone https://github.com/phac-nml/mikrokondo.git`
+
+**Dependencies can be installed with Conda (e.g. Nextflow and Python)**.
+
+## To install mikrokondo
+Once all dependencies are installed (see below for instructions), to download the pipeline run:
+
+`git clone https://github.com/phac-nml/mikrokondo.git`
+
+## Installing Nextflow
+Nextflow is required to run mikrokondo (requires Linux), and instructions for its installation can be found at either: [Nextflow Home](https://www.nextflow.io/) or  [Nextflow Documentation](https://www.nextflow.io/docs/latest/getstarted.html#installation)
+
+## Container Engine
+Nextflow and Mikrokondo require the use of containers to run the pipeline, such as: Docker, Singularity (now apptainer), podman, gitpod, sifter and charliecloud.
+
+> **NOTE:** Singularity was adopted by the Linux Foundation and is now called Apptainer. Singularity still exists, however newer installs will likely use Apptainer.
+
+## Docker or Singularity?
+Docker requires root privileges which can can make it a hassle to install on computing clusters, while there are workarounds, Apptainer/Singularity does not. Therefore, using Apptainer/Singularity is the recommended method for running the mikrokondo pipeline.
+
+### Issues
+Containers are not perfect, below is a list of some issues you may face using containers in mikrokondo, fixes for each issue will be detailed here as they are identified.
+
+- **Exit code 137,** usually means the docker container used to much memory.
+
+## Resources to download
+- [GTDB Mash Sketch](https://zenodo.org/record/8408361): required for speciation and determination when sample is metagenomic
+- [Decontamination Index](https://zenodo.org/record/8408557): Required for decontamination of reads (this is a minimap2 index)
+- [Kraken2 std database](https://benlangmead.github.io/aws-indexes/k2): Required for binning of metagenomic data and is an alternative to using Mash for speciation
+- [Bakta database](https://zenodo.org/record/7669534): Running Bakta is optional and there is a light database option, however the full one is recommended. You will have to unzip and un-tar the database for usage.
+
+### Fields to update with resources
+It is recommended to store the above resources within the `databases` folder in the mikrokondo folder, this allows for a simple update to the names of the database in `nextflow.config` rather than a need for a full path description.
+
+Below shows where to update database resources in the `params` section of the `nextflow.config` file:
+
+```
+// Bakta db path, note the quotation marks
+bakta_db = "/PATH/TO/BAKTA/DB"
+
+// Decontamination minimap2 index, note the quotation marks
+dehosting_idx = "/PATH/TO/DECONTAMINATION/INDEX"
+
+// kraken db path, not the quotation marks
+kraken2_db = "/PATH/TO/KRAKEN/DATABASE/"
+
+// GTDB Mash sketch, note the quotation marks
+mash_sketch = "/PATH/TO/MASH/SKETCH/"
+
+```