Skip to content
Florence Thirion edited this page Nov 19, 2024 · 21 revisions

Introduction

Meteor2 is a plateform for quantitative metagenomics profiling of complex ecosystems. Meteor2 relies on genes catalogue to perform species-level taxonomic profiling, functional analysis and strain-level population structure inference of metagenomic shotgun sequencing data.

Dependencies

Besides python packages dependencies, Meteor requires:

Installation

Meteor is available with conda which includes all its dependencies:

conda create --name meteor -c conda-forge -c bioconda meteor

Or with pip with a recent Python 3.10+:

pip install meteor

You can test the installation of meteor with:

meteor test

Getting started

Typically, the process will follow these steps:

  1. Download or build a reference catalogue with meteor download
  2. Organize the fastq into a specific directory structure with meteor fastq
  3. Generate the gene count table with meteor mapping
  4. Compute taxonomical and/or functional abundances with meteor profile
  5. Merge sample profiles with meteor merge
  6. Generate strain profiles with meteor strain
  7. Build phylogenetic trees with meteor tree

To run following commands, you need to create a working directory:

mkdir $WORKING_DIR
cd $WORKING_DIR

1. Download your catalogue


Meteor2 requires to download locally a microbial gene catalogue. There are currently 10 catalogues available through Meteor2, each one corresponding to a specific ecosystem. Type meteor download --help to get the list of available ecosystems. All catalogues are structured in exactly the same way and contain everything you need to get the whole process done: fasta files of the genes, Metagenomic Species (MSP) definition, functional and taxonomic annotation, etc. You can learn more about catalogue files and structure here.

For a given ecosystem, two versions of the catalogue coexist: full and fast. The fast version is a streamlined version of the full version, reduced to the 100 core genes of each MSP. It uses less time and memory to process samples, with a slight loss of specificity. Of course, functional profiling, which demands accessory genes detection, estimation and annotation, cannot be perform in fast mode.

Here, for demonstration purpose, we will take advantage of the fast mode, and download the catalogue associated with the human gut ecosytem (1.2 G, ~2 minutes to get the .tar.gz file and extract it). Don't worry about the output directories, Meteor2 will create them if they do not exist:

meteor download -i human_gut --fast -o catalogue/

Don't find your favourite ecosystem? Feel free to open an issue so that we can add it in a future release!

If you have already built your own catalogue, and you want to use it with Meteor2, we explain here how you can achieve this.

2. Organize your sample directory


Meteor2 needs fastq files to be in a specific architecture (basically, one directory for one sample containing all its fastq, associated with specific json files). You will find comparable json files in all Meteor2 output. They are easily recognized thanks to their intruiguing suffix: _census_stage_N.json. In fastq directories, N = 0; in mapping directories, N = 1, etc. Don't overlook them, they are full of highly valuable information, such as "when? what parameters? which version?" or metrics directly computed on samples (examples will come later).

In this tutorial, we will work on samples from PRJEB42906, where healthy volunteers consumed a resistant dextrine or a control product and collected their feces at baseline and X weeks later. For saving time, we have selected 8 individuals from the resistant dextrine groups, and rarefied their samples (n = 16) to 100,000 reads each (200M in total). To download them and extract them in fastq/ directory, type:

wget

All the fastq files are in the same directory. Fortunately, Meteor2 will help you in creating the correct directory structure (here in sample/):

meteor fastq -i fastq/ -o sample/

Note that you do not need to create sample/; Meteor2 will create it if it does not exist (as all Meteor2 output directories).

Take a few seconds to navigate through sample/. You will notice that Meteor2 did not delete the files from fastq/ nor did it copied them: only symbolic links were created. So don't remove or delete the initial files! Have also a look at the json file that was created. It contains only minimal set of information needed for Meteor2, but if you need to keep more staff about your sample (for example, the sequencing perform), you can add a field 'additional_info' and put anything you want in it.

In this case, we had only one fastq per sample. If you have multiple fastq for one file, use the option -m so that Meteor2 knows they come from the same sample and should be treated as one unique fastq. For example, if filenames contain the string 'SAMPLE_01' or 'SAMPLE_02', you can specify to Meteor2 to create two directories SAMPLE_01/ and SAMPLE_02/ with the command -m SAMPLE_\\d+.

3. Generate the gene count table


Since you have your catalogue and your samples in the correct structure, you can finally start the actual sample processing! Mapping the fastq files against the gene catalogue is a crucial step, as the mapped files (in cram format) really are the central piece of the whole Meteor2 machinery. They will be used to generate the gene count table from which species and functions abundance arise, but also to call SNPs and profile your sample at strain level.

In short, Meteor2 calls bowtie2 to map reads against the catalogue genes; aligned reads are then filtered to keep only those above a customizable identity threshold (--id). When using a full catalogue, we will generally choose 95% identity. With the fast version, to decrease false positives number, we recommend to use 97% identity. These thresholds are used by Meteor2, which automatically adapts its default parameter to the kind of catalogue you specified. Bowtie2 output is a cram file: based on this, Meteor2 will count how many reads mapped on each gene and generate the gene count table. Different counting strategies are currently implemented (total, unique, or smart shared - the default one). Go to this page if you want to find more about counting stategies. By default, the cram file generated during the mapping step is deleted at the end of the task. If you are only interested in profiling at species and functional level, that's not an issue. However, if you want to perform strain analysis, you need to specify Meteor2 to keep this file with the option --kf. More precisely, this option ensures that you keep only the part of the cram fime required for strain profiling - i.e., reads mapping on the 100 core genes of an MSP.

To map reads from the sample ERS12377136 on your catalogue and keep the filtered cram file for further strain analysis, simply run the following command:

meteor mapping -i sample/ERS12377136 -r catalogue/hs_10_4_gut_taxo/ -o mapping/ --kf -t 8

ATTENTION: As you noticed, meteor mapping only accepts one sample directory as input! To run it on several samples, we have to make a loop (~ 5 minutes to run)...

for SUBDIR in sample/*/; do [ -d "$SUBDIR" ] && meteor mapping -i "$SUBDIR" -o mapping/ -r catalogue/hs_10_4_gut_taxo/ --kf -t 8; done

...or create a workflow!

You can now inspect the contents of mapping. Even if you specified the same output directory for all samples, Meteor2 created for each sample a specific subfolder, containing the (1) the gene count table, (2) the filtered cram file (and its index), and (3) a census json file. Check the json file to know which identity threshold was used by Meteor2! And most importantly, in the json file you will find information about mapping rate, i.e., the percentage of reads that found a match in the gene catalogue. You absolutely need this metrics to know if the catalogue is representative enough of your sample. A correct mapping rate shoud be above 60% (at minimum). If not, it means that your sample contains a significant amount of species that are not in the catalogue, and that a large part of the fastq information was missed. In such a case, a solution is to use another catalogue or create one, more adapted to your project.

If you take a look at the json file, you will see two different fields: the 'overall_alignment_rate' (number of reads that pass bowtie2 threshold divided by total read count) and 'counted_reads' (the actual number of reads that pass the identity threshold filter). In our example, these figures are very low: since Meteor2 fast mode uses only core genes, mapping rate metrics are not relevant to determine if the catalogue is appropriate.

Note: Meteor2 doesn't filter reads before mapping on catalogue. We thus recommend to first filter out reads with low-quality, length < 60nt or belonging to the host. If you want to perform rarefaction, you should also consider doing it prior to Meteor2.

4. Compute species and functions abundance


Now that we have the gene count table, we can compute species abundances. As a reminder, genes from the catalogue are clustered into Metagenomic Species Pangenomes based on co-abundance thanks to MSPminer. Into a single cluster, genes are classified as 'core' or 'accessory'. To correctly assess an MSP abundance, we simply need to compute the mean abundance over 100 core genes of the MSP - that's why we can use 'fast' catalogues.

To reduce false positive species that could arise from mismapped reads, abundance of the species without enough core genes detected is forced to 0. With the full catalogue, 'enough' core genes usually means 10% of the 100 core genes used for species profiling. In fast mode, the default threshold is set to 20%.

However, to be comparable between genes, counts must first be normalized for gene length. Two normalization methods are implemented in Meteor2: coverage or FPKM. For more details about the calculation, see here. Meteor2 also enable to perform rarefaction directly on the gene count table, to ensure same sequencing depth between samples. This technique is limited to single-read samples since it could bias paired-end samples. Anyway, if you want to rarefy your samples to a common sequencing depth, we recommend you to do prior to running Meteor2.

As for meteor mapping, meteor profile takes a single sample as input:

meteor profile -i mapping/ERS12377136 -r catalogue/hs_10_4_gut_taxo/ -o profiles/ -n coverage

To run meteor profile on all samples, run the following command:

for SUBDIR in mapping/*/; do [ -d "$SUBDIR" ] && meteor profile -i "$SUBDIR" -o profiles/ -r catalogue/hs_10_4_gut_taxo/ -n coverage; done

Have a look at the profiles/ directory. Again, Meteor2 created the same architecture, with all results of a sample in one subdirectory. In each subdirectory, the census_stage_2.json file has extended the census_stage_1.json to add new fields related to profiling. For example, we can now check that Meteor2 forced MSP with less than 20% of core genes detected at 0 (field 'msp_filter'). There are three other outputs: the gene raw count table (a symbolic link towards mapping output), the normalized gene count table, and the MSP table.

In the case of full catalogue, the same command would have created all functional tables along with the species tables, without any further parameter. More detais about how functions are computed can be found here.

5. Merge profile tables

As you have noticed, taxonomic profiles of different samples lie in different subdirectories. To gather them into a single file that will be more convenient for downstream analysis, you can run

To merge output from different samples into a single table, use the following command:

meteor merge -i /profiles -r catalogue/hs_10_4_gut_taxo/ -o merging/ -p demo

Meteor2 has created the directory merging/ (if it did not exist) and merged all sample profiles into a single tsv file. Of note, MSP with 0 abundance in all merged samples are removed from the merged tsv file. In the same time, Meteor2 also generated a taxonomy file (containing taxonomy of the detected MSP) ad a report file, regrouping all information from the census_state_2.json files. All files are prefixed with 'demo' as we specified with the parameter -p. Thus, a single merging folder can contain merging output from several runs, if you specify a different prefix. If not, the previous output will simply be deleted. By default, Meteor2 doesn't merge gene tables - to save time and memory. If you need them, for example to inspect accessory genes content of an MSP, add the parameter -g/

We now have the MSP abundance table that can be analyzed in Rserver or directly uploaded in shaman for example. According the the PRJEB42906 paper, Parabacteroides distasonis was increased by the resistant dextrin. Is it still the case with this sample selection? To find out, first download metadata:

wget
library(dplyr)
library(ggplot2)
library(ggpubr)

demo_metadata = read.delim("metadata/demo_metadata.tsv", stringsAsFactors = FALSE)

demo_msp = read.delim("merging/demo_msp.tsv",
                      stringsAsFactors = FALSE, row.names = 1)

all_metadata = demo_metadata %>%
  left_join(tibble::rownames_to_column(as.data.frame(t(demo_msp)), var = "sample"),
            by = join_by(sample))

ggplot(all_metadata, aes(timepoint, msp_0012)) +
  geom_boxplot() +
  geom_line(aes(group = individual_id), colour = "lightgrey", linetype = "dashed") +
  geom_point() +
  stat_compare_means(paired = T) +
  theme_pubr()

image

6. Profile strain


We observed that P. distasonis is strongly increased by resistant dextrin consumption - but not in all individuals! Is there possibly a strain effect?

We can answer that question by profiling strains in each sample. From the filtered cram file that we kept at the mapping step, Meteor2 will run freebayes to perform SNP calling and generate a consensus fasta for each MSP - at least, for each MSP with sufficient signal. A consensus fasta consists in the concatenated sequences of the MSP 100 core genes, where the original bases been replaced by alternative bases according to freebayes output. Positions where information is lacking (i.e., not enough reads mapped) are simply replaced by N. We usually consider MSP with at least 80% of the core genes covered at a minimum of 50%. For the purpose of demonstration, since sample depth was rarefied to 100,000 reads, we can lower these threshold to get some signal. WARNING: with such low thresholds results may not be meaningful, this is for demonstration only!

To run meteor strain on one sample:

meteor strain -i mapping/ERS12377136 -o strain/ -r catalogue/hs_10_4_gut_taxo/ -m 10 -c 0.1 --kc; done

To run meteor strain on all samples (takes ~30 minutes to run):

for SUBDIR in mapping/*/; do [ -d "$SUBDIR" ] && meteor strain -i "$SUBDIR" -o strain/ -r catalogue/hs_10_4_gut_taxo/ -m 10 -c 0.1 --kc; done

7. Build trees

Using the different MSP consensus fasta generated for each sample where signal was sufficient, Meteor2 computes mutation rates and build trees between strains from samples using a GTR+GAMMA model with the following command:

meteor tree -i strain/ -o tree/ -g 1 -t 10

For each MSP with more than 4 samples to follow, Meteor2 generated a tree (.tree file) and a mutation rate matrix (.tsv) file.

Citation

If you used Meteor2 in your work, please cite:

Ghlozlane et al. Meteor2: accurate profiling of microbial communities for shotgun metagenomics. Manuscript in preparation

The METEOR team

The main contributors to METEOR:

  • Franck Gauthier
  • Amine Ghozlane
  • Florian Plaza Oñate
  • Nicolas Pons
  • Florence Thirion

Acknowledgements

Special thanks to the following people:

  • Mathieu Almeida
  • Emmanuelle Le Chatelier
Clone this wiki locally