-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #127 from phac-nml/dev
Doc and Readme updates
- Loading branch information
Showing
17 changed files
with
1,561 additions
and
166 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -169,3 +169,9 @@ dependentRequired | |
errorMessage | ||
Samplesheet | ||
TSeemann's | ||
RASUSA | ||
downsampling | ||
Christy | ||
Marinier | ||
Petkau | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,49 +1,49 @@ | ||
# Read Quality Control | ||
|
||
## subworkflows/local/clean_reads | ||
|
||
## Steps | ||
1. **Reads are decontaminated** using [minimap2](https://github.com/lh3/minimap2), against a 'sequencing off-target' index. This index contains: | ||
- Reads associated with Humans (de-hosting) | ||
- Known sequencing controls (phiX) | ||
|
||
2. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp) | ||
- Currently no adapters are specified within FastP when it is run and auto-detection is used. | ||
- FastP parameters can be altered within the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file. | ||
- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. [Chopper](https://github.com/wdecoster/chopper) is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request. | ||
|
||
3. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output. | ||
|
||
4. **Read downsampling** (OPTIONAL) an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line. | ||
- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash) | ||
- Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp) | ||
- Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk) | ||
|
||
5. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, this step assesses how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblies and contigs will be binned at a defined taxonomic level (default level: genus). | ||
|
||
6. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so default setting is to skip, `--skip_ont_header_cleaning true`. | ||
|
||
## Input | ||
- Next generation sequencing reads: | ||
+ Short read - Illumina | ||
+ Long read: | ||
* Nanopore | ||
* Pacbio | ||
- User submitted sample sheet | ||
|
||
|
||
## Outputs | ||
- Reads | ||
- FinalReads | ||
- SAMPLE | ||
- Processing | ||
- Dehosting | ||
- Trimmed | ||
- FastP | ||
- Seqtk | ||
- MashSketches | ||
- Quality | ||
- RawReadQuality | ||
- Trimmed | ||
- FastP | ||
- MashScreen | ||
# Read Quality Control | ||
|
||
## subworkflows/local/clean_reads | ||
|
||
## Steps | ||
1. **Reads are decontaminated** using [minimap2](https://github.com/lh3/minimap2), against a 'sequencing off-target' index. This index contains: | ||
- Reads associated with Humans (de-hosting) | ||
- Known sequencing controls (phiX) | ||
|
||
2. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp) | ||
- Currently no adapters are specified within FastP when it is run and auto-detection is used. | ||
- FastP parameters can be altered within the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file. | ||
- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. [Chopper](https://github.com/wdecoster/chopper) is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request. | ||
|
||
3. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output. | ||
|
||
4. **Read down sampling** (OPTIONAL) an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line. | ||
- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash) | ||
- Total base pairs are taken from [FastP](https://github.com/OpenGene/fastp) | ||
- Read down sampling is then performed using [Seqtk](https://github.com/lh3/seqtk) (Illumina) or [Rasusa](https://github.com/mbhall88/rasusa) (Nanopore or Pacbio). | ||
|
||
5. **Metagenomic assessment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, this step assesses how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblies and contigs will be binned at a defined taxonomic level (default level: genus). | ||
|
||
6. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so default setting is to skip, `--skip_ont_header_cleaning true`. | ||
|
||
## Input | ||
- Next generation sequencing reads: | ||
+ Short read - Illumina | ||
+ Long read: | ||
* Nanopore | ||
* Pacbio | ||
- User submitted sample sheet | ||
|
||
|
||
## Outputs | ||
- Reads | ||
- FinalReads | ||
- SAMPLE | ||
- Processing | ||
- Dehosting | ||
- Trimmed | ||
- FastP | ||
- Seqtk | ||
- MashSketches | ||
- Quality | ||
- RawReadQuality | ||
- Trimmed | ||
- FastP | ||
- MashScreen |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,64 +1,56 @@ | ||
# Installation | ||
|
||
## Dependencies | ||
- Python (3.10>=) | ||
- Nextflow (22.10.1>=) | ||
- Container service (Docker, Singularity, Apptainer have been tested) | ||
- The source code: `git clone https://github.com/phac-nml/mikrokondo.git` | ||
|
||
**Dependencies can be installed with Conda (e.g. Nextflow and Python)**. | ||
|
||
## To install mikrokondo | ||
Once all dependencies are installed (see below for instructions), to download the pipeline run: | ||
|
||
`git clone https://github.com/phac-nml/mikrokondo.git` | ||
|
||
## Installing Nextflow | ||
Nextflow is required to run mikrokondo (requires Linux), and instructions for its installation can be found at either: [Nextflow Home](https://www.nextflow.io/) or [Nextflow Documentation](https://www.nextflow.io/docs/latest/getstarted.html#installation) | ||
|
||
## Container Engine | ||
Nextflow and Mikrokondo require the use of containers to run the pipeline, such as: Docker, Singularity (now apptainer), podman, gitpod, sifter and charliecloud. | ||
|
||
> **NOTE:** Singularity was adopted by the Linux Foundation and is now called Apptainer. Singularity still exists, however newer installs will likely use Apptainer. | ||
## Docker or Singularity? | ||
Docker requires root privileges which can can make it a hassle to install on computing clusters, while there are work arounds, Apptainer/Singularity does not. Therefore, using Apptainer/Singularity is the recommended method for running the mikrokondo pipeline. | ||
|
||
### Issues | ||
Containers are not perfect, below is a list of some issues you may face using containers in mikrokondo, fixes for each issue will be detailed here as they are identified. | ||
|
||
- **Exit code 137,** usually means the docker container used to much memory. | ||
|
||
## Resources to download | ||
- [GTDB Mash Sketch](https://zenodo.org/record/8408361): required for speciation and determination when sample is metagenomic | ||
- [Decontamination Index](https://zenodo.org/record/8408557): Required for decontamination of reads (this is a minimap2 index) | ||
- [Kraken2 std database](https://benlangmead.github.io/aws-indexes/k2): Required for binning of metagenommic data and is an alternative to using Mash for speciation | ||
- [Bakta database](https://zenodo.org/record/7669534): Running Bakta is optional and there is a light database option, however the full one is recommended. You will have to unzip and un-tar the database for usage. | ||
|
||
### Fields to update with resources | ||
It is recommended to store the above resources within the `databases` folder in the mikrokondo folder, this allows for a simple update to the names of the database in `nextflow.config` rather than a need for a full path description. | ||
|
||
Below shows where to update database resources in the `params` section of the `nextflow.config` file: | ||
|
||
``` | ||
// Bakta db path, note the quotation marks | ||
bakta { | ||
db = "/PATH/TO/BAKTA/DB" | ||
} | ||
// Decontamination minimap2 index, note the quotation marks | ||
r_contaminants { | ||
mega_mm2_idx = "/PATH/TO/DECONTAMINATION/INDEX" | ||
} | ||
// kraken db path, not the quotation marks | ||
kraken { | ||
db = "/PATH/TO/KRAKEN/DATABASE/" | ||
} | ||
// GTDB Mash sketch, note the quotation marks | ||
mash { | ||
mash_sketch = "/PATH/TO/MASH/SKETCH/" | ||
} | ||
``` | ||
# Installation | ||
|
||
## Dependencies | ||
- Python (3.10>=) | ||
- Nextflow (22.10.1>=) | ||
- Container service (Docker, Singularity, Apptainer have been tested) | ||
- The source code: `git clone https://github.com/phac-nml/mikrokondo.git` | ||
|
||
**Dependencies can be installed with Conda (e.g. Nextflow and Python)**. | ||
|
||
## To install mikrokondo | ||
Once all dependencies are installed (see below for instructions), to download the pipeline run: | ||
|
||
`git clone https://github.com/phac-nml/mikrokondo.git` | ||
|
||
## Installing Nextflow | ||
Nextflow is required to run mikrokondo (requires Linux), and instructions for its installation can be found at either: [Nextflow Home](https://www.nextflow.io/) or [Nextflow Documentation](https://www.nextflow.io/docs/latest/getstarted.html#installation) | ||
|
||
## Container Engine | ||
Nextflow and Mikrokondo require the use of containers to run the pipeline, such as: Docker, Singularity (now apptainer), podman, gitpod, sifter and charliecloud. | ||
|
||
> **NOTE:** Singularity was adopted by the Linux Foundation and is now called Apptainer. Singularity still exists, however newer installs will likely use Apptainer. | ||
## Docker or Singularity? | ||
Docker requires root privileges which can can make it a hassle to install on computing clusters, while there are workarounds, Apptainer/Singularity does not. Therefore, using Apptainer/Singularity is the recommended method for running the mikrokondo pipeline. | ||
|
||
### Issues | ||
Containers are not perfect, below is a list of some issues you may face using containers in mikrokondo, fixes for each issue will be detailed here as they are identified. | ||
|
||
- **Exit code 137,** usually means the docker container used to much memory. | ||
|
||
## Resources to download | ||
- [GTDB Mash Sketch](https://zenodo.org/record/8408361): required for speciation and determination when sample is metagenomic | ||
- [Decontamination Index](https://zenodo.org/record/8408557): Required for decontamination of reads (this is a minimap2 index) | ||
- [Kraken2 std database](https://benlangmead.github.io/aws-indexes/k2): Required for binning of metagenomic data and is an alternative to using Mash for speciation | ||
- [Bakta database](https://zenodo.org/record/7669534): Running Bakta is optional and there is a light database option, however the full one is recommended. You will have to unzip and un-tar the database for usage. | ||
|
||
### Fields to update with resources | ||
It is recommended to store the above resources within the `databases` folder in the mikrokondo folder, this allows for a simple update to the names of the database in `nextflow.config` rather than a need for a full path description. | ||
|
||
Below shows where to update database resources in the `params` section of the `nextflow.config` file: | ||
|
||
``` | ||
// Bakta db path, note the quotation marks | ||
bakta_db = "/PATH/TO/BAKTA/DB" | ||
// Decontamination minimap2 index, note the quotation marks | ||
dehosting_idx = "/PATH/TO/DECONTAMINATION/INDEX" | ||
// kraken db path, not the quotation marks | ||
kraken2_db = "/PATH/TO/KRAKEN/DATABASE/" | ||
// GTDB Mash sketch, note the quotation marks | ||
mash_sketch = "/PATH/TO/MASH/SKETCH/" | ||
``` |
Oops, something went wrong.