diff --git a/CHANGELOG.md b/CHANGELOG.md index d22522fb..7b2e33bc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,7 +3,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## v1.0dev - [date] +## v1.0.0 - Black Labrador [2024-10-28] Initial release of nf-core/phaseimpute, created with the [nf-core](https://nf-co.re/) template. @@ -66,3 +66,12 @@ Initial release of nf-core/phaseimpute, created with the [nf-core](https://nf-co ### `Dependencies` ### `Deprecated` + +### `Contributors` + +[Louis Le Nezet](https://github.com/LouisLeNezet) +[Anabella Trigila](https://github.com/atrigila) +[Eugenia Fontecha](https://github.com/eugeniafontecha) +[Maxime U Garcia](https://github.com/maxulysse) +[Matias Romero Victorica](https://github.com/mrvictorica) +[Nicolas Schcolnicov](https://github.com/nschcolnicov) diff --git a/CITATIONS.md b/CITATIONS.md index 7cba9027..6cef95ce 100644 --- a/CITATIONS.md +++ b/CITATIONS.md @@ -18,6 +18,14 @@ > Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126. +- [GLIMPSE2](https://doi.org/10.1038/s41588-023-01438-3) + +> Rubinacci, S., Hofmeister, R. J., Sousa da Mota, B., & Delaneau, O. (2023). Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nature genetics 55, 1088–1090. + +- [STITCH](https://doi.org/10.1038/ng.3594) + +> Davies, R. W., Flint, J., Myers, S., & Mott, R.(2016). Rapid genotype imputation from sequence without reference panels. Nature genetics 48, 965–969. + - [Shapeit](https://odelaneau.github.io/shapeit5/) > Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w diff --git a/README.md b/README.md index 34fb0552..2febed25 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ -**Multi-steps pipeline dedicated to genetic imputation from simulation to validation** +**Multi-step pipeline dedicated to genetic imputation from simulation to validation** [![GitHub Actions CI Status](https://github.com/nf-core/phaseimpute/actions/workflows/ci.yml/badge.svg)](https://github.com/nf-core/phaseimpute/actions/workflows/ci.yml) [![GitHub Actions Linting Status](https://github.com/nf-core/phaseimpute/actions/workflows/linting.yml/badge.svg)](https://github.com/nf-core/phaseimpute/actions/workflows/linting.yml)[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/phaseimpute/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX) @@ -20,11 +20,11 @@ ## Introduction -**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data. Different steps are available each corresponding to a dedicated modes. +**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data. Different steps are available, each corresponding to a dedicated mode. ### Main steps of the pipeline -The **phaseimpute** pipeline is constituted of 5 main steps: +The **phaseimpute** pipeline consists of 5 main steps: | Metro map | Modes | | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -45,7 +45,7 @@ sample,file,index SAMPLE_1X,/path/to/.,/path/to/. ``` -Each row represents a bam or a cram file with its index file. All input files need to be of the same extension. +Each row represents a BAM or CRAM file along with its index file. All input files need to be of the same extension. For some tools and steps, you will also need to submit a samplesheet with the reference panel. A final samplesheet file for the reference panel may look something like the one below. This is for 3 chromosomes. @@ -80,18 +80,17 @@ For more details and further functionality, please refer to the [usage documenta Here is a short description of the different steps of the pipeline. For more information please refer to the [documentation](https://nf-core.github.io/phaseimpute/usage/). -| steps | Flow chart | Description | -| --------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| **--panelprep** | Panel preparation | The preprocessing mode is responsible to the preparation of the multiple input file that will be used by the phasing process.
The main processes are :
- **Haplotypes phasing** of the reference panel using [**Shapeit5**](https://odelaneau.github.io/shapeit5/).
- **Normalize** the reference panel to select only the necessary variants.
- **Chunking the reference panel** in a subset of region for all the chromosomes.
- **Extract** the positions where to perform the imputation. | -| **--impute** | Impute target | The imputation mode is the core mode of this pipeline.
It is constituted of 3 main steps:
- **Imputation**: Impute the target dataset on the reference panel using either:
  - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html): It's come with the necessety to compute the genotype likelihoods of the target dataset (done using [BCFTOOLS_mpileup](https://samtools.github.io/bcftools/bcftools.html#mpileup)).
  - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html)
  - [**Stitch**](https://github.com/rwdavies/stitch) This steps does not require a reference panel but needs to merge the samples.
  - [**Quilt**](https://github.com/rwdavies/QUILT)
- **Ligation**: all the different chunks are merged together then all chromosomes are reunited to output one VCF per sample. | -| **--simulate** | simulate_metro | The simulation mode is used to create artificial low informative genetic information from high density data. This allow to compare the imputed result to a _truth_ and therefore evaluate the quality of the imputation.
For the moment it is possible to simulate:
- Low-pass data by **downsample** BAM or CRAM using [SAMTOOLS_VIEW -s](https://www.htslib.org/doc/samtools-view.html) at different depth. | -| **--validate** | concordance_metro | This mode compare two vcf together to compute a summary of the differences between them.
This step use [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) concordance process. | +| steps | Flow chart | Description | +| --------------- | -------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **--panelprep** | Panel preparation | The preprocessing mode is responsible for preparing multiple input files that will be used by the phasing and imputation process.
The main processes are :
- **Haplotypes phasing** of the reference panel using [**Shapeit5**](https://odelaneau.github.io/shapeit5/).
- **Normalize** the reference panel to select only the necessary variants.
- **Chunking the reference panel** into a subset of regions for all the chromosomes.
- **Extract** the positions where to perform the imputation. | +| **--impute** | Impute target | The imputation mode is the core mode of this pipeline.
It consists of 3 main steps:
- **Imputation**: Impute the target dataset on the reference panel using either:
  - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html): It comes with the necessity to compute the genotype likelihoods of the target dataset (done using [BCFTOOLS_mpileup](https://samtools.github.io/bcftools/bcftools.html#mpileup)).
  - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html)
  - [**Stitch**](https://github.com/rwdavies/stitch) This step does not require a reference panel but needs to merge the samples.
  - [**Quilt**](https://github.com/rwdavies/QUILT)
- **Ligation**: all the different chunks are merged together then all chromosomes are reunited to output one VCF per sample. | +| **--simulate** | simulate_metro | The simulation mode is used to create artificial low informative genetic information from high density data. This allows the comparison of the imputed result to a _truth_ and therefore evaluates the quality of the imputation.
For the moment it is possible to simulate:
- Low-pass data by **downsample** BAM or CRAM using [SAMTOOLS_VIEW -s](https://www.htslib.org/doc/samtools-view.html) at different depth. | +| **--validate** | concordance_metro | This mode compares two VCF files together to compute a summary of the differences between them.
This step uses [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) concordance process. | ## Pipeline output To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/phaseimpute/results) tab on the nf-core website pipeline page. -For more details about the output files and reports, please refer to the -[output documentation](https://nf-co.re/phaseimpute/output). +For more details on the output files and reports, please refer to the [output documentation](https://nf-co.re/phaseimpute/output). ## Credits @@ -112,11 +111,14 @@ For further information or help, don't hesitate to get in touch on the [Slack `# ## Citations - - +If you use nf-core/phaseimpute for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -You can cite one of the main imputation methods ([`QUILT`](https://github.com/rwdavies/QUILT)) as follows: +An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file. + +You can cite the main imputation methods as follows: + +[`QUILT`](https://github.com/rwdavies/QUILT): > **Rapid genotype imputation from sequence with reference panels.** > @@ -124,29 +126,27 @@ You can cite one of the main imputation methods ([`QUILT`](https://github.com/rw > > _Nature genetics_ 2021 June 03. doi: [10.1038/s41588-021-00877-0](https://doi.org/10.1038/s41588-021-00877-0) -You can cite one of the main imputation methods ([`GLIMPSE`](https://github.com/odelaneau/GLIMPSE)) as follows: +[`GLIMPSE`](https://github.com/odelaneau/GLIMPSE): > **Efficient phasing and imputation of low-coverage sequencing data using large reference panels.** > > Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. > -> _Nature genetics_ 2021. doi:[]() +> _Nature genetics_ 2021. doi:[10.1038/s41588-020-00756-0](https://doi.org/10.1038/s41588-020-00756-0) > **Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes** > > Rubinacci, S., Hofmeister, R. J., Sousa da Mota, B., & Delaneau, O. > -> _Nature genetics_ 2023. doi:[]() +> _Nature genetics_ 2023. doi:[10.1038/s41588-023-01438-3](https://doi.org/10.1038/s41588-023-01438-3) -You can cite one of the main imputation methods ([`STITCH`](https://github.com/rwdavies/STITCH)) as follows: +[`STITCH`](https://github.com/rwdavies/STITCH): > **Rapid genotype imputation from sequence without reference panels.** > > Davies, R. W., Flint, J., Myers, S., & Mott, R. > -> _Nature genetics_ 2016 . doi: [](). - -An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file. +> _Nature genetics_ 2016 . doi: [10.1038/ng.3594](https://doi.org/10.1038/ng.3594). You can cite the `nf-core` publication as follows: diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml index 9bf18992..033877b8 100644 --- a/assets/multiqc_config.yml +++ b/assets/multiqc_config.yml @@ -1,7 +1,7 @@ report_comment: > - This report has been generated by the nf-core/phaseimpute + This report has been generated by the nf-core/phaseimpute analysis pipeline. For information about how to interpret these results, please see the - documentation. + documentation. report_section_order: "nf-core-phaseimpute-methods-description": order: -1000 diff --git a/conf/base.config b/conf/base.config index eda62680..2e591419 100644 --- a/conf/base.config +++ b/conf/base.config @@ -10,7 +10,6 @@ process { - // TODO nf-core: Check the defaults for all processes cpus = { 1 * task.attempt } memory = { 6.GB * task.attempt } time = { 4.h * task.attempt } @@ -24,7 +23,6 @@ process { // These labels are used and recognised by default in DSL2 files hosted on nf-core/modules. // If possible, it would be nice to keep the same label naming convention when // adding in your local modules too. - // TODO nf-core: Customise requirements for specific processes. // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors withLabel:process_single { cpus = { 1 } diff --git a/docs/development.md b/docs/development.md index a00dc1bd..62c5e17a 100644 --- a/docs/development.md +++ b/docs/development.md @@ -1,5 +1,9 @@ # Development +## Style + +Names of releases are composed of a color + a dog breed. + ## Features and tasks - [x] Add automatic detection of chromosome name to create a renaming file for the vcf files diff --git a/docs/output.md b/docs/output.md index 9ecaafec..6d3b1160 100644 --- a/docs/output.md +++ b/docs/output.md @@ -6,8 +6,6 @@ This document describes the output produced by the pipeline. Most of the plots a The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. - - ## Pipeline overview ## Panel preparation outputs `--steps panelprep` diff --git a/docs/usage.md b/docs/usage.md index b2c51bbf..4d515747 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -6,8 +6,6 @@ ## Introduction - - ## Samplesheet input You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. diff --git a/nextflow.config b/nextflow.config index ad6fcd64..34c76224 100644 --- a/nextflow.config +++ b/nextflow.config @@ -212,8 +212,7 @@ profiles { includeConfig !System.getenv('NXF_OFFLINE') && params.custom_config_base ? "${params.custom_config_base}/nfcore_custom.config" : "/dev/null" // Load nf-core/phaseimpute custom profiles from different institutions. -// TODO nf-core: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs -// includeConfig !System.getenv('NXF_OFFLINE') && params.custom_config_base ? "${params.custom_config_base}/pipeline/phaseimpute.config" : "/dev/null" +includeConfig !System.getenv('NXF_OFFLINE') && params.custom_config_base ? "${params.custom_config_base}/pipeline/phaseimpute.config" : "/dev/null" // Set default registry for Apptainer, Docker, Podman, Charliecloud and Singularity independent of -profile // Will not be used unless Apptainer / Docker / Podman / Charliecloud / Singularity are enabled @@ -276,7 +275,7 @@ manifest { description = """Phasing and imputation pipeline""" mainScript = 'main.nf' nextflowVersion = '!>=24.04.2' - version = '1.0dev' + version = '1.0.0' doi = '' } diff --git a/subworkflows/local/utils_nfcore_phaseimpute_pipeline/main.nf b/subworkflows/local/utils_nfcore_phaseimpute_pipeline/main.nf index ee064a39..ac2c462c 100644 --- a/subworkflows/local/utils_nfcore_phaseimpute_pipeline/main.nf +++ b/subworkflows/local/utils_nfcore_phaseimpute_pipeline/main.nf @@ -539,27 +539,40 @@ def genomeExistsError() { // Generate methods description for MultiQC // def toolCitationText() { - // TODO nf-core: Optionally add in-text citation tools to this list. // Can use ternary operators to dynamically construct based conditions, e.g. params["run_xyz"] ? "Tool (Foo et al. 2023)" : "", // Uncomment function in methodsDescriptionText to render in MultiQC report def citation_text = [ - "Tools used in the workflow included:", - "FastQC (Andrews 2010),", - "MultiQC (Ewels et al. 2016)", - "." - ].join(' ').trim() + "Tools used in the workflow included:", + "BCFtools (Danecek et al. 2021),", + params.tools ? params.tools.split(',').contains("glimpse") ? "GLIMPSE (Rubinacci et al. 2020)," : "" : "", + params.tools ? params.tools.split(',').contains("glimpse2") ? "GLIMPSE2 (Rubinacci et al. 2023)," : "": "", + params.tools ? params.tools.split(',').contains("quilt") ? "QUILT (Davies et al. 2021)," : "": "", + "SAMtools (Li et al. 2009),", + params.tools ? params.phased ? "SHAPEIT5 (Hofmeister et al. 2023)," : "": "", + params.tools ? params.phased ? "BEDtools (Quinlan and Hall 2010)," : "": "", + params.tools ? params.tools.split(',').contains("stitch") ? "STITCH (Davies et al. 2016)," : "": "", + "Tabix (Li et al. 2011),", + params.tools ? params.compute_freq ? "VCFlib (Garrison et al. 2022)," : "": "", + "." + ].join(' ').trim() return citation_text } def toolBibliographyText() { - // TODO nf-core: Optionally add bibliographic entries to this list. // Can use ternary operators to dynamically construct based conditions, e.g. params["run_xyz"] ? "
  • Author (2023) Pub name, Journal, DOI
  • " : "", // Uncomment function in methodsDescriptionText to render in MultiQC report def reference_text = [ - "
  • Andrews S, (2010) FastQC, URL: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
  • ", - "
  • Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354
  • " - ].join(' ').trim() + params.phased ? "
  • Quinlan AR, Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi:10.1093/bioinformatics/btq033.
  • ": "", + "
  • Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi:10.1093/bioinformatics/btp352.
  • ", + "
  • Li H. (2011). Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011 Mar 1;27(5):718-9. doi:10.1093/bioinformatics/btq671.
  • ", + params.tools ? params.tools.split(',').contains("quilt") ? "
  • Davies RW, Kucka M, Su D, Shi S, Flanagan M, Cunniff CM, Chan YF, & Myers S. (2021). Rapid genotype imputation from sequence with reference panels. Nature Genetics. doi:10.1038/s41588-021-00877-0.
  • " : "": "", + params.tools ? params.tools.split(',').contains("glimpse") ? "
  • Rubinacci S, Ribeiro DM, Hofmeister RJ, & Delaneau O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics. doi:10.1038/s41588-020-00756-0.
  • " : "": "", + params.tools ? params.tools.split(',').contains("glimpse2") ? "
  • Rubinacci S, Hofmeister RJ, Sousa da Mota B, & Delaneau O. (2023). Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nature Genetics. doi:10.1038/s41588-023-01438-3.
  • " : "": "", + params.phased ? "
  • Hofmeister RJ, Ribeiro DM, Rubinacci S, Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat Genet. 2023 Jul;55(7):1243-1249. doi:10.1038/s41588-023-01415-w.
  • " : "", + params.tools ? params.tools.split(',').contains("stitch") ? "
  • Davies RW, Flint J, Myers S, & Mott R. (2016). Rapid genotype imputation from sequence without reference panels. Nature Genetics.
  • " : "": "", + params.compute_freq ? "
  • Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P. (2022). A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput Biol 18(5).
  • " : "", + ].join(' ').trim() return reference_text } @@ -588,9 +601,8 @@ def methodsDescriptionText(mqc_methods_yaml) { meta["tool_citations"] = "" meta["tool_bibliography"] = "" - // TODO nf-core: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled! - // meta["tool_citations"] = toolCitationText().replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".") - // meta["tool_bibliography"] = toolBibliographyText() + meta["tool_citations"] = toolCitationText().replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".") + meta["tool_bibliography"] = toolBibliographyText() def methods_text = mqc_methods_yaml.text