Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omega from internal branches #1755

Open
evandiego83 opened this issue Oct 30, 2024 · 10 comments
Open

omega from internal branches #1755

evandiego83 opened this issue Oct 30, 2024 · 10 comments

Comments

@evandiego83
Copy link

Hello,

I am trying to estimate omega from internal branches of the tree. While I can label the tree I would like to estimate what is the ω of the internal branches. I have tried FitMG94 as it quick but it does seem to recognize hyphy labels in the tree unless I am missing something.

Alternatively, I could use BUSTED/abSREL and specify the branches that way but is there a better way to estimate ω for internal branches using the MG94xREV model

Many thanks,
Evan

@spond
Copy link
Member

spond commented Nov 4, 2024

Dear @evandiego83,

If all you want is a single ω per branch, then FitMG94.bf is the way to go.
By default, HyPhy will label internal braches that have no labels as NodeXXX, where XXX is some unique integer.
If you want to override this behavior use notation like so (you can obviously use any names that conform to the Newick standard).

((a,b)anc_ab, (c,d)anc_cd

For example (using the attached file)

hyphy /path/to/FitMG94.bf --alignment /path/to/CD2.txt --type local

....

### Estimating confidence intervals for dN/dS along each branch

|            Branch            |     Length     |     dN/dS      |Approximate dN/dS CI|
|:----------------------------:|:--------------:|:--------------:|:------------------:|
|             PIG              |     0.192      |     1.344      |   0.966 - 1.809    |
|             COW              |     0.253      |     1.920      |   1.456 - 2.480    |
|           PIG_COW            |     0.103      |     1.484      |   0.838 - 2.308    |
|            HORSE             |     0.209      |     1.244      |   0.926 - 1.634    |
|             CAT              |     0.276      |     1.601      |   1.240 - 2.037    |
|      PIG_COW_HORSE_CAT       |     0.066      |     0.663      |   0.275 - 1.201    |
|           RHMONKEY           |     0.004      |10000000000.0...|0.000 - 10000.000...|
|            BABOON            |     0.002      |     0.000      |   0.000 - 0.686    |
|       RHMONKEY_BABOON        |     0.026      |     0.401      |   0.139 - 0.817    |
|            HUMAN             |     0.000      |     1.000      |0.000 - 10000.000...|
|            CHIMP             |     0.002      |10000000000.0...|0.000 - 10000.000...|
|         HUMAN_CHIMP          |     0.018      |     0.368      |   0.052 - 0.923    |
|            Node10            |     0.110      |     1.914      |   1.270 - 2.727    |
|            Node3             |     0.290      |     0.432      |   0.317 - 0.573    |
|             RAT              |     0.066      |     1.090      |   0.610 - 1.719    |
|            MOUSE             |     0.122      |     0.525      |   0.343 - 0.755    |


And in the corresponding json file, these estimates will be recorded as follows (under the Confidence Intervals key)

  "branch attributes": {
    "0": {
      "BABOON": {
        "Confidence Intervals": {
          "LB": 0,
          "MLE": 0,
          "UB": 0.6856172937043802
        },
        "Nucleotide GTR": 0.001678804018170922,
        "Standard MG94": 0.001816952143408297,
        "dN": 1.267559344528279e-10,
        "dS": 0.008607790066156536,
        "nonsynonymous": 1e-10,
        "original name": "BABOON",
        "synonymous": 0.001816952143408297
      },
...

Best,
Sergei
CD2.txt

@evandiego83
Copy link
Author

Thanks @spond

Sorry fi I am not clear. If I had multiple sequences per strain and different strains within the alignment what would be best the best strategy to estimate ω for internal branches for each strain using the MG94xREV model?

Similar to your recent influenza preprint https://doi.org/10.1101/2024.08.19.606826 where ω: mean estimate on 2.3.4.4b clade internal branches was measured.

Many thanks
Evan

@spond
Copy link
Member

spond commented Nov 5, 2024

Dear @evandiego83,

Are you interested in estimatign mean ω for specific clades? For example, in the tree below, there'd be three ω which branches sharing the same color also sharing the same ω?

image

Best,
Sergei

@evandiego83
Copy link
Author

Hi @spond

Yes exactly as illustrated but not for terminal branches of the clades.

Evan

@spond
Copy link
Member

spond commented Nov 5, 2024

Dear @evandiego83,

Got it. One more question: how are your nodes labeled? Can you infer which clade a sequence is from its name?

Best,
Sergei

@evandiego83
Copy link
Author

Hi @spond,

Yes, the label can be inferred from the sequence name!

Evan

@spond
Copy link
Member

spond commented Nov 6, 2024

Dear @evandiego83,

OK, I need to make a small change to FitMG94.bf. Stay tuned. The preprint uses estimates from FEL or MEME, which are done as a part of the pre-fit of the more complex model. This is an overkill.

Best,
Sergei

@spond
Copy link
Member

spond commented Nov 11, 2024

Dear @evandiego83,

Here's an example using HA.fas and HA.nwk which are attached; this is a small HA alignment with 4 H5N1 lineages. The main task is to label the tree. I do this using https://github.com/veg/hyphy-analyses/tree/master/LabelTrees

You will need to adjust the regular expressions and labels as needed depending on what your data look like.

  1. Label 2.3.3.4b internal branches. Do not label leaves.
hyphy label-tree.bf --tree HA.nwk --regexp '2_3_4_4b' --label '2.3.4.4b' --leaf-nodes Skip --output HA-labeled.nwk 
image
  1. Label 2.3.2.1c internal branches
hyphy label-tree.bf --tree HA-labeled.nwk --regexp '2_3_2_1c' --label '2.3.2.1c' --leaf-nodes Skip --output HA-labeled.nwk
image
  1. Label 2.3.2.1a internal branches
hyphy label-tree.bf --tree HA-labeled.nwk --regexp '2_3_2_1a' --label '2.3.2.1a' --leaf-nodes Skip --output HA-labeled.nwk
image
  1. Label 2.2.1 internal branches
hyphy label-tree.bf --tree HA-labeled.nwk --regexp '2_2_1' --label '2.2.1' --leaf-nodes Skip --output HA-labeled.nwk
image
  1. Run FitMG94.bf using partitioned mode.
hyphy FitMG94.bf --alignment HA.fas --tree HA-labeled.nwk --type partitioned 
...

### Obtaining branch lengths and nucleotide substitution biases under the nucleotide GTR model

>kill-zero-lengths => Yes
* Log(L) = -4602.61, AIC-c =  9271.31 (33 estimated parameters)
* 1 partition. Total tree length by partition (subs/site)  0.259

### Fitting Standard MG94
* Log(L) = -4369.41, AIC-c =  8827.31 (44 estimated parameters)
* non-synonymous/synonymous rate ratio for ** =   0.1864 (95% profile CI   0.1429-  0.2378)
* non-synonymous/synonymous rate ratio for *2.2.1* =   0.3782 (95% profile CI   0.1945-  0.6468)
* non-synonymous/synonymous rate ratio for *2.3.2.1a* =   0.2534 (95% profile CI   0.1213-  0.4541)
* non-synonymous/synonymous rate ratio for *2.3.2.1c* =   0.0461 (95% profile CI   0.0155-  0.1003)
* non-synonymous/synonymous rate ratio for *2.3.4.4b* =   0.1213 (95% profile CI   0.0842-  0.1678)

The estimates are also available in the JSON file HA.fas.FITTER.json,

image

Best,
Sergei
Archive.zip

@evandiego83
Copy link
Author

evandiego83 commented Dec 4, 2024

Dear @spond

Many thanks for this. This works great. My only question is it possible to output the number of synonymous sites and non-synonymous sites for each lineage in addition to the ratio as already reported??

The reason I ask is I want to calculate the relative ratio to see if there is any effect due to an excess of non-synonymous changes. Assuming that S is the number of synonymous sites, N is the number of non-synonymous sites and ω is the dN/dS ratio. If the proportional contribution to the overall rate from synonymous sites is s, then the proportional contribution to the overall rate from non-synonymous sites is equal to (N/S)(ω)s. Then the relative rate expected in the outbreak sequences compared to the reference sequences is

image

Evan

@spond
Copy link
Member

spond commented Dec 4, 2024

Dear @evandiego83,

Synonymous and non-synonymous sites are really fictional quantities. There was a lot of debate about how to define them in the early years of codon models (1990s). You can estimate S and N for existing sequences, simply by counting. However you cannot estimate them for any branches that include unobserved sequences. What we usually do in HyPhy, if someone need S and N, is to compute them from the equilibrium frequency distribution (which is assumed by the model anyway). See box on page 14 of https://hyphy.org/resources/hyphybook2007.pdf

When S and N are computed that way they are the same for all branches, which may not be useful for you.

I suppose you could infer ancestral sequences and compute S and N for those nodes as well. Let me know what you think.

Best,
Sergei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants