Skip to content

Commit

Permalink
Merge pull request #1106 from nf-core/kallisto_quant
Browse files Browse the repository at this point in the history
Kallisto quantification
  • Loading branch information
drpatelh authored Nov 15, 2023
2 parents f189c95 + 5ad263a commit ab9df9a
Show file tree
Hide file tree
Showing 41 changed files with 1,361 additions and 494 deletions.
17 changes: 10 additions & 7 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -232,16 +232,19 @@ jobs:
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker --aligner hisat2 ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
salmon:
name: Test Salmon with workflow parameters
pseudo:
name: Test Pseudoaligners with workflow parameters
if: ${{ (github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/rnaseq')) && !contains(github.event.head_commit.message, '[ci fast]') }}
runs-on: ubuntu-latest
strategy:
matrix:
parameters:
- "--skip_qc"
- "--skip_alignment --skip_pseudo_alignment"
- "--salmon_index false --transcript_fasta false"
- "--pseudo_aligner salmon --skip_qc"
- "--pseudo_aligner salmon --skip_alignment --skip_pseudo_alignment"
- "--pseudo_aligner salmon --salmon_index false --transcript_fasta false"
- "--pseudo_aligner kallisto --skip_qc"
- "--pseudo_aligner kallisto --skip_alignment --skip_pseudo_alignment"
- "--pseudo_aligner kallisto --kallisto_index false --transcript_fasta false"
steps:
- name: Check out pipeline code
uses: actions/checkout@v2
Expand Down Expand Up @@ -280,6 +283,6 @@ jobs:
wget -qO- get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/
- name: Run pipeline with Salmon and various parameters
- name: Run pipeline with Salmon or Kallisto and various parameters
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker --pseudo_aligner salmon ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
10 changes: 7 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
Special thanks to the following for their contributions to the release:

- [Adam Talbot](https://github.com/adamrtalbot)
- [Jonathan Manning](https://github.com/pinin4fjords)
- [Júlia Mir Pedrol](https://github.com/mirpedrol)
- [Matthias Zepper](https://github.com/MatthiasZepper)
- [Maxime Garcia](https://github.com/maxulysse)
Expand All @@ -28,13 +29,16 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
- [PR #1083](https://github.com/nf-core/rnaseq/pull/1083) - Move local modules and subworkflows to subfolders
- [PR #1088](https://github.com/nf-core/rnaseq/pull/1088) - Updates contributing and code of conduct documents with nf-core template 2.10
- [PR #1091](https://github.com/nf-core/rnaseq/pull/1091) - Reorganise parameters in schema for better usability
- [PR #1106](https://github.com/nf-core/rnaseq/pull/1106) - Kallisto quantification
- [PR #1106](https://github.com/nf-core/rnaseq/pull/1106) - MultiQC [version bump](https://github.com/nf-core/rnaseq/pull/1106/commits/aebad067a10a45510a2b421da852cb436ae65fd8)
- [#1050](https://github.com/nf-core/rnaseq/issues/1050) - Provide custom prefix/suffix for summary files to avoid overwriting

### Software dependencies

| Dependency | Old version | New version |
| ----------------------- | ----------- | ----------- |
| `fastqc` | 0.11.9 | 0.12.1 |
| `multiqc` | 1.14 | 1.15 |
| `multiqc` | 1.14 | 1.17 |
| `ucsc-bedgraphtobigwig` | 377 | 445 |

> **NB:** Dependency has been **updated** if both old and new version information is present.
Expand All @@ -61,7 +65,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
### Enhancements & fixes

- [[#1011](https://github.com/nf-core/rnaseq/issues/1011)] - FastQ files from UMI-tools not being passed to fastp
- [[#1018](https://github.com/nf-core/rnaseq/issues/1018)] - Ability to skip both alignment and pseudo-alignment to only run pre-processing QC steps.
- [[#1018](https://github.com/nf-core/rnaseq/issues/1018)] - Ability to skip both alignment and pseudoalignment to only run pre-processing QC steps.
- [PR #1016](https://github.com/nf-core/rnaseq/pull/1016) - Updated pipeline template to [nf-core/tools 2.8](https://github.com/nf-core/tools/releases/tag/2.8)
- [PR #1025](https://github.com/nf-core/fetchngs/pull/1025) - Add `public_aws_ecr.config` to source mulled containers when using `public.ecr.aws` Docker Biocontainer registry
- [PR #1038](https://github.com/nf-core/rnaseq/pull/1038) - Updated error log for count values when supplying `--additional_fasta`
Expand Down Expand Up @@ -809,7 +813,7 @@ Major novel changes include:
- Added options to skip several steps
- Skip trimming using `--skipTrimming`
- Skip BiotypeQC using `--skipBiotypeQC`
- Skip Alignment using `--skipAlignment` to only use pseudo-alignment using Salmon
- Skip Alignment using `--skipAlignment` to only use pseudoalignment using Salmon

### Documentation updates

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
3. [`dupRadar`](https://bioconductor.org/packages/release/bioc/html/dupRadar.html)
4. [`Preseq`](http://smithlabresearch.org/software/preseq/)
5. [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
15. Pseudo-alignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/); _optional_)
15. Pseudoalignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/) or ['Kallisto'](https://pachterlab.github.io/kallisto/); _optional_)
16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

> **Note**
Expand Down
2 changes: 2 additions & 0 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ run_modules:
- hisat2
- rsem
- salmon
- kallisto
- samtools
- picard
- preseq
Expand Down Expand Up @@ -66,6 +67,7 @@ extra_fn_clean_exts:
- ".umi_dedup"
- "_val"
- ".markdup"
- "_primary"

# Customise the module search patterns to speed up execution time
# - Skip module sub-tools that we are not interested in
Expand Down
89 changes: 0 additions & 89 deletions bin/salmon_tx2gene.py

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@

library(SummarizedExperiment)

## Create SummarizedExperiment (se) object from Salmon counts
## Create SummarizedExperiment (se) object from counts

args <- commandArgs(trailingOnly = TRUE)
if (length(args) < 2) {
stop("Usage: salmon_se.r <coldata> <counts> <tpm>", call. = FALSE)
if (length(args) < 3) {
stop("Usage: summarizedexperiment.r <coldata> <counts> <tpm> <tx2gene>", call. = FALSE)
}

coldata <- args[1]
counts_fn <- args[2]
tpm_fn <- args[3]
tx2gene <- args[4]

tx2gene <- "salmon_tx2gene.tsv"
info <- file.info(tx2gene)
if (info$size == 0) {
tx2gene <- NULL
Expand Down
160 changes: 160 additions & 0 deletions bin/tx2gene.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
#!/usr/bin/env python
import logging
import argparse
import glob
import os
from collections import Counter, defaultdict, OrderedDict
from collections.abc import Set
from typing import Dict

# Configure logging
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)


def read_top_transcripts(quant_dir: str, file_pattern: str) -> Set[str]:
"""
Read the top 100 transcripts from the quantification file.
Parameters:
quant_dir (str): Directory where quantification files are located.
file_pattern (str): Pattern to match quantification files.
Returns:
set: A set containing the top 100 transcripts.
"""
try:
# Find the quantification file within the directory
quant_file_path = glob.glob(os.path.join(quant_dir, "*", file_pattern))[0]
with open(quant_file_path, "r") as file_handle:
# Read the file and extract the top 100 transcripts
return {line.split()[0] for i, line in enumerate(file_handle) if i > 0 and i <= 100}
except IndexError:
# Log an error and raise a FileNotFoundError if the quant file does not exist
logger.error("No quantification files found.")
raise FileNotFoundError("Quantification file not found.")


def discover_transcript_attribute(gtf_file: str, transcripts: Set[str]) -> str:
"""
Discover the attribute in the GTF that corresponds to transcripts, prioritizing 'transcript_id'.
Parameters:
gtf_file (str): Path to the GTF file.
transcripts (Set[str]): A set of transcripts to match in the GTF file.
Returns:
str: The attribute name that corresponds to transcripts in the GTF file.
"""
votes = Counter()
with open(gtf_file) as inh:
# Read GTF file, skipping header lines
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
# Parse attribute column and update votes for each attribute found
attributes = dict(item.strip().split(" ", 1) for item in cols[8].split(";") if item.strip())
votes.update(key for key, value in attributes.items() if value.strip('"') in transcripts)

if not votes:
# Log a warning if no matching attribute is found
logger.warning("No attribute in GTF matching transcripts")
return ""

# Check if 'transcript_id' is among the attributes with the highest votes
if "transcript_id" in votes and votes["transcript_id"] == max(votes.values()):
logger.info("Attribute 'transcript_id' corresponds to transcripts.")
return "transcript_id"

# If 'transcript_id' isn't the highest, determine the most common attribute that matches the transcripts
attribute, _ = votes.most_common(1)[0]
logger.info(f"Attribute '{attribute}' corresponds to transcripts.")
return attribute


def parse_attributes(attributes_text: str) -> Dict[str, str]:
"""
Parse the attributes column of a GTF file.
:param attributes_text: The attributes column as a string.
:return: A dictionary of the attributes.
"""
# Split the attributes string by semicolon and strip whitespace
attributes = attributes_text.strip().split(";")
attr_dict = OrderedDict()

# Iterate over each attribute pair
for attribute in attributes:
# Split the attribute into key and value, ensuring there are two parts
parts = attribute.strip().split(" ", 1)
if len(parts) == 2:
key, value = parts
# Remove any double quotes from the value
value = value.replace('"', "")
attr_dict[key] = value

return attr_dict


def map_transcripts_to_gene(
quant_type: str, gtf_file: str, quant_dir: str, gene_id: str, extra_id_field: str, output_file: str
) -> bool:
"""
Map transcripts to gene names and write the output to a file.
Parameters:
quant_type (str): The quantification method used (e.g., 'salmon').
gtf_file (str): Path to the GTF file.
quant_dir (str): Directory where quantification files are located.
gene_id (str): The gene ID attribute in the GTF file.
extra_id_field (str): Additional ID field in the GTF file.
output_file (str): The output file path.
Returns:
bool: True if the operation was successful, False otherwise.
"""
# Read the top transcripts based on quantification type
transcripts = read_top_transcripts(quant_dir, "quant.sf" if quant_type == "salmon" else "abundance.tsv")
# Discover the attribute that corresponds to transcripts in the GTF
transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)

if not transcript_attribute:
# If no attribute is found, return False
return False

# Open GTF and output file to write the mappings
# Initialize the set to track seen combinations
seen = set()

with open(gtf_file) as inh, open(output_file, "w") as output_handle:
# Parse each line of the GTF, mapping transcripts to genes
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
attr_dict = parse_attributes(cols[8])
if gene_id in attr_dict and transcript_attribute in attr_dict:
# Create a unique identifier for the transcript-gene combination
transcript_gene_pair = (attr_dict[transcript_attribute], attr_dict[gene_id])

# Check if the combination has already been seen
if transcript_gene_pair not in seen:
# If it's a new combination, write it to the output and add to the seen set
extra_id = attr_dict.get(extra_id_field, attr_dict[gene_id])
output_handle.write(f"{attr_dict[transcript_attribute]}\t{attr_dict[gene_id]}\t{extra_id}\n")
seen.add(transcript_gene_pair)

return True


# Main function to parse arguments and call the mapping function
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Map transcripts to gene names for tximport.")
parser.add_argument("--quant_type", type=str, help="Quantification type", default="salmon")
parser.add_argument("--gtf", type=str, help="GTF file", required=True)
parser.add_argument("--quants", type=str, help="Output of quantification", required=True)
parser.add_argument("--id", type=str, help="Gene ID in the GTF file", required=True)
parser.add_argument("--extra", type=str, help="Extra ID in the GTF file")
parser.add_argument("-o", "--output", dest="output", default="tx2gene.tsv", type=str, help="File with output")

args = parser.parse_args()
if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
logger.error("Failed to map transcripts to genes.")
Loading

0 comments on commit ab9df9a

Please sign in to comment.