OrthoPrep can increase the sensitivity of OrthoFinder by decreasing the MCL inflation coefficient and removing spurious BLAST hits that come from comparing sequences of significantly different sizes, and/or that may contain low complexity regions (LCRs) that may result in false positive hits.
Given a directory /home/vflorelo/sequences
containing the following files:
- Plasmodium_falciparum_3D7.faa
- Plasmodium_reichenowi_CDC.faa
- Toxoplasma_gondii_ME49.faa
- Toxoplasma_gondii_RH88.faa
make_control_file.sh \
--fasta_dir /home/vflorelo/sequences \
--same_species 0.35 \
--same_genus 0.40 \
--diff_genus 0.45 \
--decrement 0.05 > control_file.tsv
OrthoPrep.sh \
--fasta_dir /home/vflorelo/sequences \
--masking TRUE \
--eff_len TRUE \
--search_method diamond \
--blast_filter TRUE \
--control_file control_file.tsv \
--threads 16
orthofinder.py -b OrthoPrep-2024-08-01
- seg
- fLPS2
- OrthoFinder and dependencies therein
- bedTools
- jq
- GNU parallel
OrthoPrep takes a directory containing fasta files with the proteomes of the organisms to be studied, it then calculates the effective length of each protein sequence by excluding low complexity regions. The hard-masked sequences are then subjected to all-versus-all comparisons using diamond but the reciprocal best hits are constrained by applying a size filter.
For any protein a length-interval is calculated, if the length-interval of the corresponding hit overlaps with that of the query, the pair is kept as a potential reciprocal best hit, if the intervals don't overlap, the pair is excluded.
Once the potential reciprocal best hits are determined, the sequences are reverted to remove the hard-masking and the OrthoFinder analysis can resume normaly.
The length intervals can be expressed as a fixed fraction of the protein length, or they can be adjusted based on specific comparisons:
Species 1 | Species 2 | Size filter |
---|---|---|
Toxoplasma gondii ME49 | Toxoplasma gondii RH88 | Strict |
Plasmodium falciparum 3D7 | Plasmodium reichenowi CDC | Medium |
Plasmodium falciparum 3D7 | Toxoplasma gondii RH88 | Relaxed |
In the example above, proteins coming from the same species (but different strains) are not expected to vary too much in length, thus a strict filter (up to 35% of length tolerance) can be used to keep only meaningful reciprocal best hits. For proteins coming from the same genus but different species, a medium filter might be useful (up to 40% of length tolerance), and for species of different genera, a relaxed filter is recommended (up to 45% of length tolerance).
Given a directory /home/vflorelo/sequences
containing the following files:
- Plasmodium_falciparum_3D7.faa
- Plasmodium_reichenowi_CDC.faa
- Toxoplasma_gondii_ME49.faa
- Toxoplasma_gondii_RH88.faa
The user can run orthoprep in several ways
It is assumed that orthologous proteins can vary in size due to insertion/deletion events as well as contraction/expansion of repetitive elements. Thus, if the proteomes of two closely related organisms are compared, it is not expected that their proteins vary too much in length if they indeed come from the same ancestor. However, if the organisms are more distantly related, their proteins might display more variance in size.
OrthoPrep requires a control file specifying the tolerance in protein length difference, this file can be prepared with the ancilliary script make_control_file.sh
make_control_file.sh \
--fasta_dir /home/vflorelo/sequences \
--same_species 0.35 \
--same_genus 0.40 \
--diff_genus 0.45 \
--decrement 0.05 > control_file.tsv
The contents of the resulting tsv file will be:
Plasmodium_falciparum_3D7.faa | Plasmodium_falciparum_3D7.faa | 0.35 | 0.3 |
Plasmodium_falciparum_3D7.faa | Plasmodium_reichenowi_CDC.faa | 0.40 | 0.35 |
Plasmodium_falciparum_3D7.faa | Toxoplasma_gondii_ME49.faa | 0.45 | 0.4 |
Plasmodium_falciparum_3D7.faa | Toxoplasma_gondii_RH88.faa | 0.45 | 0.4 |
Plasmodium_reichenowi_CDC.faa | Plasmodium_reichenowi_CDC.faa | 0.35 | 0.3 |
Plasmodium_reichenowi_CDC.faa | Toxoplasma_gondii_ME49.faa | 0.45 | 0.4 |
Plasmodium_reichenowi_CDC.faa | Toxoplasma_gondii_RH88.faa | 0.45 | 0.4 |
Toxoplasma_gondii_ME49.faa | Toxoplasma_gondii_ME49.faa | 0.35 | 0.3 |
Toxoplasma_gondii_ME49.faa | Toxoplasma_gondii_RH88.faa | 0.40 | 0.35 |
Toxoplasma_gondii_RH88.faa | Toxoplasma_gondii_RH88.faa | 0.35 | 0.3 |
Indicating the accepted intervals for keeping potential RBHs when comparing the specified proteomes:
- When two proteomes of the same species are compared, the tolerance for length differences is at most 35% of the length of the compared proteins
- When two proteomes of the same genus but different species are compared, the tolerance for length differences is at most 40% of the length of the compared proteins
- When the compared proteomes come from different genera, the tolerance for length differences is at most 45% of the length of the compared proteins
The user can also specify a fixed tolerance for all comparisons, but that would be fairly unrealistic, specially when species of a broad phylogenetic span are included in the analysis. Once the control file is ready, the user can apply OrthoPrep to the sequences in the study.
In this mode, OrthoPrep masks the sequences by detecting LCRs with seg and fLPS2, then calculates the effective lengths based on the extent of LCRs.
BLAST hits are detected using diamond (or any valid option present in the OrthoFinder configuration) using the masked sequences.
Once the BLAST comparisons have finished, the hit table is filtered with the length constrains specified in control_file.tsv
.
OrthoPrep.sh \
--fasta_dir /home/vflorelo/sequences \
--masking TRUE \
--eff_len TRUE \
--search_method diamond \
--blast_filter TRUE \
--control_file control_file.tsv \
--threads 16
In this mode, the sequences are masked for LCRs, but no filter is applied to the RBH detection. This setting is useful to estimate the amount of noise introduced by LCRs.
OrthoPrep.sh \
--fasta_dir /home/vflorelo/sequences \
--masking TRUE \
--eff_len FALSE \
--search_method diamond \
--blast_filter FALSE \
--threads 16
In this mode, the program calculates the effective length of each protein by detecting LCRs, however, the sequences remain unmasked for the RBH detection. Once the BLAST comparisons have finished, the hit table is filtered with the length constrains specified in control_file.tsv
. This setting might be useful to detect orthologs in which the LCRs are actually a functional part of the proteins and they may truly reflect speciation events
OrthoPrep.sh \
--fasta_dir /home/vflorelo/sequences \
--masking FALSE \
--eff_len TRUE \
--search_method diamond \
--blast_filter TRUE \
--control_file control_file.tsv \
--threads 16
In this mode, the program does not detect LCRs, thus the size filtering is rather strict and might lead to overclustered orthogroups, might be useful to analyse strain from closely related species
OrthoPrep.sh \
--fasta_dir /home/vflorelo/sequences \
--masking FALSE \
--eff_len FALSE \
--search_method diamond \
--blast_filter TRUE \
--control_file control_file.tsv \
--threads 16
OrthoPrep produces a directory that contains filtered diamond/BLAST files which can be directly passed to OrthoFinder with the option -b
The detection of orthologs is always challenging and the results may vary depending on the employed software, the number and diversity of the organisms included in a given analysis. There are many available programs for orthology inference, and their performance, accuracy and recall are evaluated by the Quest for Orthologs (QfO) consortium.
From the current QfO assessment, OrthoFinder appears to be one of the most popular ref and accurate programs ref. While the default options for OrthoFinder work generally well for any set of organisms, the intrinsic properties of the studied proteins can often result in problematic orthology assessment
A general issue in homology detection is the presence of false matches resulting from unrelated proteins sharing repetitive regions or regions with very biased amino acid composition, the so-called Low Complexity Regions (LCRs). This issue has been addressed previously in softwares such as InParanoid ref, however, newer versions of InParanoid have removed the LCR filtering step due to significantly increased run times ref.
Due to the local alignments used by diamond or BLAST, it might be possible that two proteins are placed in the same orthogroup despite being of considerable different sizes. The difference in protein size can arise from fused genes, large insertions/deletions and/or expansion/contraction of LCRs and repetitive elements. It has been proposed that protein function can be influenced by protein size, thus, two proteins of significantly different sizes might not be likely to perform the same function.
In order to address the presence of LCRs, as well as the differences in protein length, and their potential effects on orthology inference, we developed OrthoPrep, an OrthoFinder preprocessor that limits the number of significant matches based on sequence composition and effective search space.
The full pipeline includes:
- Detection of LCRs using seg and fLPS2
- Adjustment of the proteins' effective length based on the presence of LCRs
- Masking of sequences before running diamond/BLAST
- Selecting only pairs of proteins of comparable size with a taxonomy-based tolerance
By default, OrthoFinder uses diamond with a permissive e-value (1e-3) and soft-masking of common motifs with the --more-sensitive
option. Genes belonging to the same orthogroup are connected by a network approach using MCL with an inflation coefficient of 1.5.
For the different tests we ran, we used the reference proteomes of the following organisms:
Organism | Protein count | Source | Organism | Protein count | Source |
---|---|---|---|---|---|
Babesia bovis | 3974 | PiroplasmaDB | Cardiosporidium cionae | 4566 | RefSeq |
Cryptosporidium parvum | 3944 | CryptoDB | Cyclospora cayetanensis | 7153 | ToxoDB |
Eimeria tenella | 7268 | ToxoDB | Haemoproteus tartakovskyi | 4860 | PlasmoDB |
Hepatocystis | 5341 | PlasmoDB | Neospora caninum | 7364 | ToxoDB |
Nephromyces | 10628 | RefSeq | Plasmodium falciparum | 5389 | PlasmoDB |
Plasmodium gaboni | 5356 | PlasmoDB | Plasmodium gallinaceum | 5286 | PlasmoDB |
Plasmodium knowlesi | 5328 | PlasmoDB | Plasmodium reichenowi | 5699 | PlasmoDB |
Plasmodium vivax | 6523 | PlasmoDB | Sarcocystis neurona | 6965 | ToxoDB |
Theileria annulata | 3795 | PiroplasmaDB | Toxoplasma gondii | 8322 | ToxoDB |
To assess the impact of OrthoPrep, we ran OrthoFinder under several conditions:
- We used two e-value thresholds: 1e-3 and 1e-6
- We used diamond with and without the
--more-sensitive
option - We used two inflation coefficient values for MCL: 1.5 and 1.1
We compared the orthogroups obtained using all the combinations of the above parameters with and without preprocessing the sequences with OrthoPrep
In the following table we include the number of genes found in the orthogroup containing PfEMP1, a well characterised gene in Plasmodium falciparum
Test | e-value | diamond | I.C. | OrthoPrep | B. bov | C. cio | C. par | C. cay | E. ten | H. tar | Hep | N. can | Nep | P. fal | P. gab | P. gal | P. kno | P. rei | P. viv | S. neu | T. ann | T. gon | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1e-3 | more_sensitive | 1.5 | × | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 |
2 | 1e-3 | more_sensitive | 1.5 | ✓ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 48 |
3 | 1e-3 | more_sensitive | 1.1 | × | 1 | 1 | 0 | 4 | 5 | 6 | 9 | 5 | 4 | 80 | 114 | 12 | 11 | 107 | 20 | 5 | 1 | 6 | 391 |
4 | 1e-3 | more_sensitive | 1.1 | ✓ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 58 | 0 | 0 | 0 | 60 | 0 | 0 | 0 | 0 | 118 |
5 | 1e-3 | N/A | 1.5 | × | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 53 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 53 |
6 | 1e-3 | N/A | 1.5 | ✓ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47 |
7 | 1e-3 | N/A | 1.1 | × | 0 | 0 | 0 | 1 | 1 | 1 | 4 | 1 | 2 | 72 | 104 | 1 | 4 | 99 | 3 | 1 | 0 | 1 | 295 |
8 | 1e-3 | N/A | 1.1 | ✓ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 61 | 6 | 0 | 0 | 61 | 0 | 0 | 0 | 0 | 128 |
9 | 1e-6 | more_sensitive | 1.5 | × | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 56 |
10 | 1e-6 | more_sensitive | 1.5 | ✓ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 48 |
11 | 1e-6 | more_sensitive | 1.1 | × | 1 | 1 | 0 | 4 | 5 | 6 | 9 | 5 | 4 | 75 | 107 | 3 | 6 | 102 | 5 | 6 | 1 | 6 | 346 |
12 | 1e-6 | more_sensitive | 1.1 | ✓ | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 39 | 41 | 6 | 60 | 40 | 99 | 0 | 0 | 0 | 291 |
13 | 1e-6 | N/A | 1.5 | × | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 51 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 51 |
14 | 1e-6 | N/A | 1.5 | ✓ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 46 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 46 |
15 | 1e-6 | N/A | 1.1 | × | 0 | 0 | 0 | 1 | 1 | 1 | 3 | 1 | 2 | 72 | 102 | 1 | 4 | 99 | 3 | 1 | 0 | 1 | 292 |
16 | 1e-6 | N/A | 1.1 | ✓ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 61 | 7 | 0 | 0 | 63 | 0 | 0 | 0 | 0 | 131 |
PfEMP1 is known to be present only in Plasmodium species belonging to the Laverania group, P. falciparum (~60 copies, ref), P. reichenowi (>50 copies, ref ) and to a lesser extent in P. gaboni (ref). When we used the default settings, we were not able to capture the PfEMP1 orthologs in P. reichenowi (1). By decreasing the inflation coefficient to 1.1 we were able to identify orthologs in P. reichenowi, however, by doing this without filtering the results, several false-positive matches were obtained (3,7,11,15), leading to the wrong conclusion that PfEMP1 has homologues in species outside of the Laverania group, and even outside of the Plasmodium genus. However, when we preprocessed the sequences with OrthoPrep (4,8,12,16) we obtained cleaner results that are in line with previous findings about PfEMP1 (ref). The e-value threshold did not impact significantly the distribution of PfEMP1, however, the --more-sensitive
option of diamond produced unexpected results, even when using OrthoPrep (4 vs 8, 12 vs 16).
After using PfEMP1 as an initial proxy of the OrthoPrep parameters, we then proceeded to assess the effectiveness of OrthoPrep by comparing the inferred orthogroups with those inferred using OrthoFinder without any filters applied to the sequences. The most evident effect was a significant increase in the number of orthogroups with only two species, and a slight decrease in the number of orthogroups containing all 18 species in the assessment. Interestingly, the number of orthogroups containing 3,4,6,7,9-15 species increased by running OrthoPrep on the sequences.
By applying a protein-length filter, OrthoPrep effectively changed the number of orthogroups in the Apicomplexan dataset, however, the number of proteins per orthogroup was not significantly different between the orthogroups inferred using OrthoPrep+OrthoFinder, and the ones inferred using OrthoFinder alone.
OrthoPrep limits the number of candidates for the detection of reciprocal best hits, as a consequence, there was a slight shift in the protein median length per orthogroup in all the tested conditions.
One interesting metric to assess was the protein length median absolute deviation. Values approaching 0 reflect the uniformity of protein lengths, whereas higher values correspond to orthogroups containing proteins with highly variable protein lengths. In all the tested conditions, OrthoPrep significantly shifted the MAD per orthogroup, regardless of the e-value or inflation coefficient, which in turn results in multiple sequence alignments of higher quality and phylogenetic signal (ref).
Once we explored the Apicomplexan dataset, we wanted to evaluate the impact of OrthoPrep by benchmarking different parameters of OrthoPrep and OrthoFinder against a collection of expert-curated reference orhtogroups (ref).
The benchmarks include a set of 12 reference proteomes:
Caenorhabditis elegans | Canis familiaris | Ciona intestinalis |
Danio rerio | Drosophila melanogaster | Gallus gallus |
Homo sapiens | Monodelphis domestica | Mus musculus |
Pan troglodytes | Rattus norvegicus | Tetraodon nigroviridis |
We used the same combinations of parameters as above, but we also included two types of filtering: strict filtering where the difference in sizes between protein pairs can be up to 45% in the case of proteins coming from different genera, and a relaxed filter where the differences in size can be up to 60% (see table below)
Query | Subject | Strict filter | Relaxed filter |
---|---|---|---|
Homo sapiens | Homo sapiens | [0.30,0.35] | [0.25,0.35] |
Drosophila simulans | Drosophila melanogaster | [0.35,0.40] | [0.40,0.50] |
Homo sapiens | Drosophila melanogaster | [0.40,0.45] | [0.50,0.60] |
In total we had 24 combinations, 16 corresponding to OrthoPrep+OrthoFinder, and 8 corresponding to Orthofinder alone (see figure below)
In all cases OrthoPrep increased the precision of the assigned orthogroups, and although there was a decrease in the recall, the difference in precision was larger than the decrement in recall. The use of a smaller inflation coefficient resulted in a higher recall but the precision was severely affected. The relaxed filter appear to give better results than the strict filter, but the effectiveness of the filter will always depend on the phylogenetic scope of the organisms included in the study. There was no significant difference between using the --more-sensitive
option for diamond.
The benchmarks from orthobench include the number of orthogroups that were correctly identified when compared to the reference orthogroups, that is, all the expected genes in a given orthogroup were identified in a single orthogroup that did not contain any additional genes. Whereas the number of correctly identified orthogroups using OrthoPrep+OrthoFinder is well on par with those identified with OrthoFinder alone, the total number of genes that were assigned to orthogroups was at least 5% lower when using OrthoPrep+OrthoFinder.
As it was the case with the apicomplexan dataset, OrthoPrep slightly increased the number of orthogroups in the bilateria dataset, the median length was slightly decreased and the protein length median absolute deviation was significantly decreased.
One concern about the use of OrthoPrep is the proportion of proteins that were classified as unassigned. It is expected from gene content alone, that several genes and proteins are classified as unique or exclusive to specific taxa. In some cases, OrthoPrep classified up to 45% of the total genes as exclusive, this measure however, depends extensively on the datasets included and the filters the user sets on different taxonomic levels, for instance, in the Bilateria dataset, using a relaxed filter resulted in ~1% more of genes assigned to orthogroups.
Counts of proteins classified as unassigned under different settings. Cells highlighted in pink correspond to the default settings of OrthoFinder |
There was a slight difference in length distribution of the unassigned proteins only in the Apicomplexan dataset, this could be solved by adjusting the limits for OrthoPrep in the control file.
Length distribution of proteins classified as unassigned using either OrthoPrep+OrthoFinder (blue bars) or OrthoFinder alone (red bars) |
Orthogroup assignment remains a difficult, yet an essential task for comparative genomics and evolutionary biology. To this day there is not a one tool to rule them all, and on top of that, results may vary depending on the number of organisms included in the study, the origin, quality and completeness of the starting data. In this work we incorporate some basic assumptions to the process of orthogroup assignment to increase the precision of the inferred orthogroups, we used two datasets and several conditions for orthogroup assignment and we have demonstrated that incorporating a composition-bias size-exclusion filter, we can increase the precision of the inferred orthogroups using OrthoFinder. The tradeoff is an increase in the number of proteins classified as unassigned to any orthogroup. However, this increase can be marginal, as shown in the bilateria dataset. In comparison, SonicParanoid2 is able to capture as much as 80% of the Apicomplexa dataset, yet it is unable to classify important proteins such as PfEMP1 in a defined orthogroup, despite relaxing the inflation coefficient to 1.2.