Skip to content

Preprocessing script to reduce noise in orthology inference

Notifications You must be signed in to change notification settings

vflorelo/OrthoPrep

Repository files navigation

OrthoPrep

TL/DR

OrthoPrep can increase the sensitivity of OrthoFinder by decreasing the MCL inflation coefficient and removing spurious BLAST hits that come from comparing sequences of significantly different sizes, and/or that may contain low complexity regions (LCRs) that may result in false positive hits.

Example run

Given a directory /home/vflorelo/sequences containing the following files:

  • Plasmodium_falciparum_3D7.faa
  • Plasmodium_reichenowi_CDC.faa
  • Toxoplasma_gondii_ME49.faa
  • Toxoplasma_gondii_RH88.faa
make_control_file.sh \
  --fasta_dir /home/vflorelo/sequences \
  --same_species 0.35 \
  --same_genus   0.40 \
  --diff_genus   0.45 \
  --decrement    0.05 > control_file.tsv

OrthoPrep.sh \
  --fasta_dir     /home/vflorelo/sequences \
  --masking       TRUE \
  --eff_len       TRUE \
  --search_method diamond \
  --blast_filter  TRUE \
  --control_file  control_file.tsv \
  --threads 16

orthofinder.py -b OrthoPrep-2024-08-01

Dependencies

Implementation

OrthoPrep takes a directory containing fasta files with the proteomes of the organisms to be studied, it then calculates the effective length of each protein sequence by excluding low complexity regions. The hard-masked sequences are then subjected to all-versus-all comparisons using diamond but the reciprocal best hits are constrained by applying a size filter.

For any protein a length-interval is calculated, if the length-interval of the corresponding hit overlaps with that of the query, the pair is kept as a potential reciprocal best hit, if the intervals don't overlap, the pair is excluded.

Once the potential reciprocal best hits are determined, the sequences are reverted to remove the hard-masking and the OrthoFinder analysis can resume normaly.

The length intervals can be expressed as a fixed fraction of the protein length, or they can be adjusted based on specific comparisons:

Species 1 Species 2 Size filter
Toxoplasma gondii ME49 Toxoplasma gondii RH88 Strict
Plasmodium falciparum 3D7 Plasmodium reichenowi CDC Medium
Plasmodium falciparum 3D7 Toxoplasma gondii RH88 Relaxed

In the example above, proteins coming from the same species (but different strains) are not expected to vary too much in length, thus a strict filter (up to 35% of length tolerance) can be used to keep only meaningful reciprocal best hits. For proteins coming from the same genus but different species, a medium filter might be useful (up to 40% of length tolerance), and for species of different genera, a relaxed filter is recommended (up to 45% of length tolerance).

Options

Given a directory /home/vflorelo/sequences containing the following files:

  • Plasmodium_falciparum_3D7.faa
  • Plasmodium_reichenowi_CDC.faa
  • Toxoplasma_gondii_ME49.faa
  • Toxoplasma_gondii_RH88.faa

The user can run orthoprep in several ways

General preprocessing:

It is assumed that orthologous proteins can vary in size due to insertion/deletion events as well as contraction/expansion of repetitive elements. Thus, if the proteomes of two closely related organisms are compared, it is not expected that their proteins vary too much in length if they indeed come from the same ancestor. However, if the organisms are more distantly related, their proteins might display more variance in size.

OrthoPrep requires a control file specifying the tolerance in protein length difference, this file can be prepared with the ancilliary script make_control_file.sh

make_control_file.sh \
  --fasta_dir /home/vflorelo/sequences \
  --same_species 0.35 \
  --same_genus   0.40 \
  --diff_genus   0.45 \
  --decrement    0.05 > control_file.tsv

The contents of the resulting tsv file will be:

Plasmodium_falciparum_3D7.faa Plasmodium_falciparum_3D7.faa 0.35 0.3
Plasmodium_falciparum_3D7.faa Plasmodium_reichenowi_CDC.faa 0.40 0.35
Plasmodium_falciparum_3D7.faa Toxoplasma_gondii_ME49.faa 0.45 0.4
Plasmodium_falciparum_3D7.faa Toxoplasma_gondii_RH88.faa 0.45 0.4
Plasmodium_reichenowi_CDC.faa Plasmodium_reichenowi_CDC.faa 0.35 0.3
Plasmodium_reichenowi_CDC.faa Toxoplasma_gondii_ME49.faa 0.45 0.4
Plasmodium_reichenowi_CDC.faa Toxoplasma_gondii_RH88.faa 0.45 0.4
Toxoplasma_gondii_ME49.faa Toxoplasma_gondii_ME49.faa 0.35 0.3
Toxoplasma_gondii_ME49.faa Toxoplasma_gondii_RH88.faa 0.40 0.35
Toxoplasma_gondii_RH88.faa Toxoplasma_gondii_RH88.faa 0.35 0.3

Indicating the accepted intervals for keeping potential RBHs when comparing the specified proteomes:

  • When two proteomes of the same species are compared, the tolerance for length differences is at most 35% of the length of the compared proteins
  • When two proteomes of the same genus but different species are compared, the tolerance for length differences is at most 40% of the length of the compared proteins
  • When the compared proteomes come from different genera, the tolerance for length differences is at most 45% of the length of the compared proteins

The user can also specify a fixed tolerance for all comparisons, but that would be fairly unrealistic, specially when species of a broad phylogenetic span are included in the analysis. Once the control file is ready, the user can apply OrthoPrep to the sequences in the study.

Full run mode:

In this mode, OrthoPrep masks the sequences by detecting LCRs with seg and fLPS2, then calculates the effective lengths based on the extent of LCRs. BLAST hits are detected using diamond (or any valid option present in the OrthoFinder configuration) using the masked sequences. Once the BLAST comparisons have finished, the hit table is filtered with the length constrains specified in control_file.tsv.

OrthoPrep.sh \
  --fasta_dir     /home/vflorelo/sequences \
  --masking       TRUE \
  --eff_len       TRUE \
  --search_method diamond \
  --blast_filter  TRUE \
  --control_file  control_file.tsv \
  --threads 16

LCR masking but not filtering:

In this mode, the sequences are masked for LCRs, but no filter is applied to the RBH detection. This setting is useful to estimate the amount of noise introduced by LCRs.

OrthoPrep.sh \
  --fasta_dir     /home/vflorelo/sequences \
  --masking       TRUE \
  --eff_len       FALSE \
  --search_method diamond \
  --blast_filter  FALSE \
  --threads 16

Size filtering without masking:

In this mode, the program calculates the effective length of each protein by detecting LCRs, however, the sequences remain unmasked for the RBH detection. Once the BLAST comparisons have finished, the hit table is filtered with the length constrains specified in control_file.tsv. This setting might be useful to detect orthologs in which the LCRs are actually a functional part of the proteins and they may truly reflect speciation events

OrthoPrep.sh \
  --fasta_dir     /home/vflorelo/sequences \
  --masking       FALSE \
  --eff_len       TRUE \
  --search_method diamond \
  --blast_filter  TRUE \
  --control_file  control_file.tsv \
  --threads 16

Size filtering without detection of LCRs:

In this mode, the program does not detect LCRs, thus the size filtering is rather strict and might lead to overclustered orthogroups, might be useful to analyse strain from closely related species

OrthoPrep.sh \
  --fasta_dir     /home/vflorelo/sequences \
  --masking       FALSE \
  --eff_len       FALSE \
  --search_method diamond \
  --blast_filter  TRUE \
  --control_file  control_file.tsv \
  --threads 16

Output

OrthoPrep produces a directory that contains filtered diamond/BLAST files which can be directly passed to OrthoFinder with the option -b

Background

The detection of orthologs is always challenging and the results may vary depending on the employed software, the number and diversity of the organisms included in a given analysis. There are many available programs for orthology inference, and their performance, accuracy and recall are evaluated by the Quest for Orthologs (QfO) consortium.

From the current QfO assessment, OrthoFinder appears to be one of the most popular ref and accurate programs ref. While the default options for OrthoFinder work generally well for any set of organisms, the intrinsic properties of the studied proteins can often result in problematic orthology assessment

A general issue in homology detection is the presence of false matches resulting from unrelated proteins sharing repetitive regions or regions with very biased amino acid composition, the so-called Low Complexity Regions (LCRs). This issue has been addressed previously in softwares such as InParanoid ref, however, newer versions of InParanoid have removed the LCR filtering step due to significantly increased run times ref.

Due to the local alignments used by diamond or BLAST, it might be possible that two proteins are placed in the same orthogroup despite being of considerable different sizes. The difference in protein size can arise from fused genes, large insertions/deletions and/or expansion/contraction of LCRs and repetitive elements. It has been proposed that protein function can be influenced by protein size, thus, two proteins of significantly different sizes might not be likely to perform the same function.

In order to address the presence of LCRs, as well as the differences in protein length, and their potential effects on orthology inference, we developed OrthoPrep, an OrthoFinder preprocessor that limits the number of significant matches based on sequence composition and effective search space.

The full pipeline includes:

  • Detection of LCRs using seg and fLPS2
  • Adjustment of the proteins' effective length based on the presence of LCRs
  • Masking of sequences before running diamond/BLAST
  • Selecting only pairs of proteins of comparable size with a taxonomy-based tolerance

By default, OrthoFinder uses diamond with a permissive e-value (1e-3) and soft-masking of common motifs with the --more-sensitive option. Genes belonging to the same orthogroup are connected by a network approach using MCL with an inflation coefficient of 1.5.

Initial assessment of OrthoPrep

Apicomplexan dataset

For the different tests we ran, we used the reference proteomes of the following organisms:

Organism Protein count Source Organism Protein count Source
Babesia bovis 3974 PiroplasmaDB Cardiosporidium cionae 4566 RefSeq
Cryptosporidium parvum 3944 CryptoDB Cyclospora cayetanensis 7153 ToxoDB
Eimeria tenella 7268 ToxoDB Haemoproteus tartakovskyi 4860 PlasmoDB
Hepatocystis 5341 PlasmoDB Neospora caninum 7364 ToxoDB
Nephromyces 10628 RefSeq Plasmodium falciparum 5389 PlasmoDB
Plasmodium gaboni 5356 PlasmoDB Plasmodium gallinaceum 5286 PlasmoDB
Plasmodium knowlesi 5328 PlasmoDB Plasmodium reichenowi 5699 PlasmoDB
Plasmodium vivax 6523 PlasmoDB Sarcocystis neurona 6965 ToxoDB
Theileria annulata 3795 PiroplasmaDB Toxoplasma gondii 8322 ToxoDB

Diamond/BLAST, MCL and size filtering parameters

To assess the impact of OrthoPrep, we ran OrthoFinder under several conditions:

  • We used two e-value thresholds: 1e-3 and 1e-6
  • We used diamond with and without the --more-sensitive option
  • We used two inflation coefficient values for MCL: 1.5 and 1.1

We compared the orthogroups obtained using all the combinations of the above parameters with and without preprocessing the sequences with OrthoPrep

In the following table we include the number of genes found in the orthogroup containing PfEMP1, a well characterised gene in Plasmodium falciparum

PfEMP1

Test e-value diamond I.C. OrthoPrep B. bov C. cio C. par C. cay E. ten H. tar Hep N. can Nep P. fal P. gab P. gal P. kno P. rei P. viv S. neu T. ann T. gon Total
1 1e-3 more_sensitive 1.5 × 0 0 0 0 0 0 0 0 0 56 0 0 0 0 0 0 0 0 56
2 1e-3 more_sensitive 1.5 0 0 0 0 0 0 0 0 0 47 0 0 0 1 0 0 0 0 48
3 1e-3 more_sensitive 1.1 × 1 1 0 4 5 6 9 5 4 80 114 12 11 107 20 5 1 6 391
4 1e-3 more_sensitive 1.1 0 0 0 0 0 0 0 0 0 58 0 0 0 60 0 0 0 0 118
5 1e-3 N/A 1.5 × 0 0 0 0 0 0 0 0 0 53 0 0 0 0 0 0 0 0 53
6 1e-3 N/A 1.5 0 0 0 0 0 0 0 0 0 47 0 0 0 0 0 0 0 0 47
7 1e-3 N/A 1.1 × 0 0 0 1 1 1 4 1 2 72 104 1 4 99 3 1 0 1 295
8 1e-3 N/A 1.1 0 0 0 0 0 0 0 0 0 61 6 0 0 61 0 0 0 0 128
9 1e-6 more_sensitive 1.5 × 0 0 0 0 0 0 0 0 0 56 0 0 0 0 0 0 0 0 56
10 1e-6 more_sensitive 1.5 0 0 0 0 0 0 0 0 0 47 0 0 0 1 0 0 0 0 48
11 1e-6 more_sensitive 1.1 × 1 1 0 4 5 6 9 5 4 75 107 3 6 102 5 6 1 6 346
12 1e-6 more_sensitive 1.1 0 0 0 0 0 0 6 0 0 39 41 6 60 40 99 0 0 0 291
13 1e-6 N/A 1.5 × 0 0 0 0 0 0 0 0 0 51 0 0 0 0 0 0 0 0 51
14 1e-6 N/A 1.5 0 0 0 0 0 0 0 0 0 46 0 0 0 0 0 0 0 0 46
15 1e-6 N/A 1.1 × 0 0 0 1 1 1 3 1 2 72 102 1 4 99 3 1 0 1 292
16 1e-6 N/A 1.1 0 0 0 0 0 0 0 0 0 61 7 0 0 63 0 0 0 0 131

PfEMP1 is known to be present only in Plasmodium species belonging to the Laverania group, P. falciparum (~60 copies, ref), P. reichenowi (>50 copies, ref ) and to a lesser extent in P. gaboni (ref). When we used the default settings, we were not able to capture the PfEMP1 orthologs in P. reichenowi (1). By decreasing the inflation coefficient to 1.1 we were able to identify orthologs in P. reichenowi, however, by doing this without filtering the results, several false-positive matches were obtained (3,7,11,15), leading to the wrong conclusion that PfEMP1 has homologues in species outside of the Laverania group, and even outside of the Plasmodium genus. However, when we preprocessed the sequences with OrthoPrep (4,8,12,16) we obtained cleaner results that are in line with previous findings about PfEMP1 (ref). The e-value threshold did not impact significantly the distribution of PfEMP1, however, the --more-sensitive option of diamond produced unexpected results, even when using OrthoPrep (4 vs 8, 12 vs 16).

Distribution of orthogroups

After using PfEMP1 as an initial proxy of the OrthoPrep parameters, we then proceeded to assess the effectiveness of OrthoPrep by comparing the inferred orthogroups with those inferred using OrthoFinder without any filters applied to the sequences. The most evident effect was a significant increase in the number of orthogroups with only two species, and a slight decrease in the number of orthogroups containing all 18 species in the assessment. Interestingly, the number of orthogroups containing 3,4,6,7,9-15 species increased by running OrthoPrep on the sequences.

of_vs_op

Metrics of the inferred orthogroups

Orthogroup protein count

By applying a protein-length filter, OrthoPrep effectively changed the number of orthogroups in the Apicomplexan dataset, however, the number of proteins per orthogroup was not significantly different between the orthogroups inferred using OrthoPrep+OrthoFinder, and the ones inferred using OrthoFinder alone.

apis_count
Distribution of protein counts per orthogroup under different conditions. Red bars correspond to orthogroups inferred with OrthoFinder alone, blue bars correspond to orthogroups inferred with OrthoPrep+OrthoFinder. The total number of orthogroups is shown in the plot legends
Protein median length

OrthoPrep limits the number of candidates for the detection of reciprocal best hits, as a consequence, there was a slight shift in the protein median length per orthogroup in all the tested conditions.

apis_median
Distribution of protein median length per orthogroup under different conditions. Red bars correspond to orthogroups inferred with OrthoFinder alone, blue bars correspond to orthogroups inferred with OrthoPrep+OrthoFinder. The total number of orthogroups is shown in the plot legends
Protein length median absolute deviation (MAD)

One interesting metric to assess was the protein length median absolute deviation. Values approaching 0 reflect the uniformity of protein lengths, whereas higher values correspond to orthogroups containing proteins with highly variable protein lengths. In all the tested conditions, OrthoPrep significantly shifted the MAD per orthogroup, regardless of the e-value or inflation coefficient, which in turn results in multiple sequence alignments of higher quality and phylogenetic signal (ref).

apis_mad
Distribution of protein length median absolute deviation per orthogroup under different conditions. Red bars correspond to orthogroups inferred with OrthoFinder alone, blue bars correspond to orthogroups inferred with OrthoPrep+OrthoFinder. The total number of orthogroups is shown in the plot legends

Orthobench

Once we explored the Apicomplexan dataset, we wanted to evaluate the impact of OrthoPrep by benchmarking different parameters of OrthoPrep and OrthoFinder against a collection of expert-curated reference orhtogroups (ref).

The benchmarks include a set of 12 reference proteomes:

Caenorhabditis elegans Canis familiaris Ciona intestinalis
Danio rerio Drosophila melanogaster Gallus gallus
Homo sapiens Monodelphis domestica Mus musculus
Pan troglodytes Rattus norvegicus Tetraodon nigroviridis

We used the same combinations of parameters as above, but we also included two types of filtering: strict filtering where the difference in sizes between protein pairs can be up to 45% in the case of proteins coming from different genera, and a relaxed filter where the differences in size can be up to 60% (see table below)

Query Subject Strict filter Relaxed filter
Homo sapiens Homo sapiens [0.30,0.35] [0.25,0.35]
Drosophila simulans Drosophila melanogaster [0.35,0.40] [0.40,0.50]
Homo sapiens Drosophila melanogaster [0.40,0.45] [0.50,0.60]

In total we had 24 combinations, 16 corresponding to OrthoPrep+OrthoFinder, and 8 corresponding to Orthofinder alone (see figure below)

Precision and recall

precision_recall

In all cases OrthoPrep increased the precision of the assigned orthogroups, and although there was a decrease in the recall, the difference in precision was larger than the decrement in recall. The use of a smaller inflation coefficient resulted in a higher recall but the precision was severely affected. The relaxed filter appear to give better results than the strict filter, but the effectiveness of the filter will always depend on the phylogenetic scope of the organisms included in the study. There was no significant difference between using the --more-sensitive option for diamond.

Number of recovered genes and correctly assigned orthogroups

The benchmarks from orthobench include the number of orthogroups that were correctly identified when compared to the reference orthogroups, that is, all the expected genes in a given orthogroup were identified in a single orthogroup that did not contain any additional genes. Whereas the number of correctly identified orthogroups using OrthoPrep+OrthoFinder is well on par with those identified with OrthoFinder alone, the total number of genes that were assigned to orthogroups was at least 5% lower when using OrthoPrep+OrthoFinder.

exacts_rescue

Metrics of the inferred orthogroups

As it was the case with the apicomplexan dataset, OrthoPrep slightly increased the number of orthogroups in the bilateria dataset, the median length was slightly decreased and the protein length median absolute deviation was significantly decreased.

Orthogroup protein count
orthobench_counts
Distribution of protein counts per orthogroup under different conditions. Red bars correspond to orthogroups inferred with OrthoFinder alone, blue bars correspond to orthogroups inferred with OrthoPrep+OrthoFinder. The total number of orthogroups is shown in the plot legends
Protein median length
orthobench_median
Distribution of protein median length per orthogroup under different conditions. Red bars correspond to orthogroups inferred with OrthoFinder alone, blue bars correspond to orthogroups inferred with OrthoPrep+OrthoFinder. The total number of orthogroups is shown in the plot legends
Protein length median absolute deviation (MAD)
orthobench_median
Distribution of protein length median absolute deviation per orthogroup under different conditions. Red bars correspond to orthogroups inferred with OrthoFinder alone, blue bars correspond to orthogroups inferred with OrthoPrep+OrthoFinder. The total number of orthogroups is shown in the plot legends

Unassigned proteins

One concern about the use of OrthoPrep is the proportion of proteins that were classified as unassigned. It is expected from gene content alone, that several genes and proteins are classified as unique or exclusive to specific taxa. In some cases, OrthoPrep classified up to 45% of the total genes as exclusive, this measure however, depends extensively on the datasets included and the filters the user sets on different taxonomic levels, for instance, in the Bilateria dataset, using a relaxed filter resulted in ~1% more of genes assigned to orthogroups.

missing_counts
Counts of proteins classified as unassigned under different settings. Cells highlighted in pink correspond to the default settings of OrthoFinder

Lengths of unassigned proteins

There was a slight difference in length distribution of the unassigned proteins only in the Apicomplexan dataset, this could be solved by adjusting the limits for OrthoPrep in the control file.

apis_missing_lengths
Length distribution of proteins classified as unassigned using either OrthoPrep+OrthoFinder (blue bars) or OrthoFinder alone (red bars)

Discussion

Orthogroup assignment remains a difficult, yet an essential task for comparative genomics and evolutionary biology. To this day there is not a one tool to rule them all, and on top of that, results may vary depending on the number of organisms included in the study, the origin, quality and completeness of the starting data. In this work we incorporate some basic assumptions to the process of orthogroup assignment to increase the precision of the inferred orthogroups, we used two datasets and several conditions for orthogroup assignment and we have demonstrated that incorporating a composition-bias size-exclusion filter, we can increase the precision of the inferred orthogroups using OrthoFinder. The tradeoff is an increase in the number of proteins classified as unassigned to any orthogroup. However, this increase can be marginal, as shown in the bilateria dataset. In comparison, SonicParanoid2 is able to capture as much as 80% of the Apicomplexa dataset, yet it is unable to classify important proteins such as PfEMP1 in a defined orthogroup, despite relaxing the inflation coefficient to 1.2.

About

Preprocessing script to reduce noise in orthology inference

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published