Proportion of sites under selection #1772

mridna · 2024-12-04T14:09:57Z

I wanted your advice on how best to calculate the proportion of sites under positive selection across the entire phylogeny (~30 species) for ~2000 genes. Our intention is to compare different functional categories and see if there are differences in how many sites are under positive selection. We are not concerned about what specific sites are under selection. What we are ultimately trying to check is even if the whole gene is not evolving under positive selection, are there still some sites under positive selection and does this proportion vary based on function. With this in mind, how would you approach calculating proportion of positively selected sites? We are currently thinking about this in two ways:

Proportion of sites with omega > 1 from BUSTED. Here we are a little stumped on how to proceed. We have the model averaged p-values for each alignment, but should the proportion be taken from the best fitting model, since it seems to be different based on BUSTED settings (with/without SRV and MH)? And if there is no significant p-value for positive selection, should we consider that there are no sites with omega > 1 even if BUSTED shows some codons are evolving in this rate class.
Use number of significant positively selected sites from MEME (+MH) and normalise it for number of sites in the alignment to arrive at a significant proportion of sites under positive selection.

Hope this was explained clearly!

Best regards,
Mridula

spond · 2024-12-06T22:23:09Z

Dear @mridna,

Well, a principled (but not something that you could easily) method would be use your functional class (categorical) as a part of the phylogenetic fit. In other words, ω and its weight would be functions of the functional class (e.g., as a random effect), and all 2000 genes would be fit jointly to directly estimate the effect.

Both approaches you have are sensible but ad hoc.

For BUSTED, I would recommend the following
- Use the estimate of the weight allocated to ω > 1 directly. You can model average the weight the same way you average the p-value if you already have multiple models.
- Consider some measure of ω value as well. For example, a gene with ω = 10 and weight of 1% is probably more "selected" than a gene with ω = 2 and weight of 1%. Use some variance stabilized transform of ω e.g. log (&omega). For example you could compute something like log (ω) * weight (ω) for ω > 1 and take that as a measure of "selection"
- I would not filter by p-value.
For MEME, you need to select a cutoff p-value to decide what sites are selected. Obviously this is going to have a major effect on the estimates of what fraction of the gene is subject to selection. Also, if you want a more comparable estimate to BUSTED, you could weight each site by what fraction of the site has ω > 1 (MEME outputs this as p⁺).

As an example, here are 10 randomly selected genes from https://pubmed.ncbi.nlm.nih.gov/30620335/ using different measures of "selected weight". I ran BUSTED-E (https://www.biorxiv.org/content/10.1101/2024.11.13.620707v1) and MEME. The alignments, snakefile to run the tools, and a simple Python script to compute various results are in the attached archive.

Mostly, just be aware that there will be a number of hard-to-quantify factors influencing the estimates. You should also explore how the definition of the "fraction of alignment" influences your downstream analyses. If you get very different results based on what you select, then probably there's not much sense in placing too much faith in any associations you find.

Best,
Sergei

issue1772.zip

mridna · 2024-12-09T16:07:41Z

Dear Sergei,

Thank you so much for your detailed response and suggestions, I really appreciate it!

Best regards,
Mridula

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proportion of sites under selection #1772

Proportion of sites under selection #1772

mridna commented Dec 4, 2024

spond commented Dec 6, 2024

mridna commented Dec 9, 2024

Proportion of sites under selection #1772

Proportion of sites under selection #1772

Comments

mridna commented Dec 4, 2024

spond commented Dec 6, 2024

mridna commented Dec 9, 2024