Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataMonkey Error Message: "- had 7747 sites" #1752

Open
hannahg3009 opened this issue Oct 25, 2024 · 4 comments
Open

DataMonkey Error Message: "- had 7747 sites" #1752

hannahg3009 opened this issue Oct 25, 2024 · 4 comments

Comments

@hannahg3009
Copy link

Hi! I'm a Master's Student using Aliview and DataMonkey to do my thesis, but I keep running into the same issue when I input my data into DataMonkey. Based on a set of instructions from a PhD student, I made a phylogenetic tree on 10k Trees and combined it in a file with my sequences from Aliview. I added other information based on the instructions like the number of characters, taxa, etc. I've already gone through and removed the stop codons, but I keep getting the same attached error message; this is my first time using any of these databases, so it's very likely I'm making a small mistake, but any help would be appreciated! I'm attaching the error message I keep getting and also my .txt file which includes my tree and my sequences. Thanks!
analysis vdr.txt
Screen Shot 2024-10-25 at 1 28 10 PM

@spond
Copy link
Member

spond commented Oct 28, 2024

Dear @hannahg3009,

Are you able to export in anything other than NEXUS? I've come across a similar issue with another user before, and the problem is that this specific site generates broken NEXUS. For example,

  1. The sequences in your alignment have 1281 characters each, but the NEXUS header declares NCHAR = 7686.
  2. The block which tells you how to label the tree
1 Papio_anubis,
2 Pongo_abelli,
3 Gorilla_gorilla_gorilla,
4 Homo_sapiens,
5 Pan_paniscus,
6 Pan_troglodyytes_troglodytes;

Does not match the list of sequence names in the DATA block. I fixed the alignment for you, so you can submit to Datamonkey, but unless you are able to get valid NEXUS exported, you will have similar issues with other alignment.

Best,
Sergei

vdr-fixed.txt

@hannahg3009
Copy link
Author

Dear @hannahg3009,

Are you able to export in anything other than NEXUS? I've come across a similar issue with another user before, and the problem is that this specific site generates broken NEXUS. For example,

  1. The sequences in your alignment have 1281 characters each, but the NEXUS header declares NCHAR = 7686.
  2. The block which tells you how to label the tree
1 Papio_anubis,
2 Pongo_abelli,
3 Gorilla_gorilla_gorilla,
4 Homo_sapiens,
5 Pan_paniscus,
6 Pan_troglodyytes_troglodytes;

Does not match the list of sequence names in the DATA block. I fixed the alignment for you, so you can submit to Datamonkey, but unless you are able to get valid NEXUS exported, you will have similar issues with other alignment.

Best, Sergei

vdr-fixed.txt

Hi,
Thank you so much!! I'm not sure how the formatting would go for files
other than NEXUS, since that was the only reference I was given. For my
second analysis .txt file I fixed the tree labels like you mentioned, and
for the VDR file I was confused on what NCHAR meant (I thought it was the
overall total which is why I put 7686 instead of the number for each
sequence). My second set of sequences has a different number of alleles for
each, but I cut them to the same length (2404) to see if that would make a
difference. I tried with my updated file and it still isn't working, so I
was wondering if you could take a look at it? I can attach my unedited
sequences with the different numbers of alleles for reference as well.
Thank you so much for your help, your notes on the VDR file really came in
handy!

adam33_analysis.txt
combined_adam33 (original).txt

@spond
Copy link
Member

spond commented Nov 13, 2024

Dear @hannahg3009,

The issue with the attached file is not NEXUS per-se, but the fact that you do not have a valid codon-aware alignment.

 hyphy % hyphy busted --alignment /Users/sergei/Downloads/adam33_analysis.txt                                                   

Analysis Description
--------------------
BUSTED (branch-site unrestricted statistical test of episodic
diversification) uses a random effects branch-site model fitted jointly
to all or a subset of tree branches in order to test for alignment-wide
evidence of episodic diversifying selection. Assuming there is evidence
of positive selection (i.e. there is an omega > 1), BUSTED will also
perform a quick evidence-ratio style analysis to explore which
individual sites may have been subject to selection. v2.0 adds support
for synonymous rate variation, and relaxes the test statistic to 0.5
(chi^2_0 + chi^2_2). Version 2.1 adds a grid search for the initial
starting point. Version 2.2 changes the grid search to LHC, and adds an
initial search phase to use adaptive Nedler-Mead. Version 3.0 implements
the option for branch-site variation in synonymous substitution rates.
Version 3.1 adds HMM auto-correlation option for SRV, and binds SRV
distributions for multiple branch sets. Version 4.0 adds support for
multiple hits (MH), ancestral state reconstruction saved to JSON, and
profiling of branch-site level support for selection / multiple hits.
Version 4.2 adds calculation of MH-attributable fractions of
substitutions. Version 4.5 adds an 'error absorption' component
[experimental] 

- __Requirements__: in-frame codon alignment and a phylogenetic tree (optionally annotated
with {})

- __Citation__: *Gene-wide identification of episodic selection*, Mol Biol Evol.
32(5):1365-71, *Synonymous Site-to-Site Substitution Rate Variation
Dramatically Inflates False Positive Rates of Selection Analyses: Ignore
at Your Own Peril*, Mol Biol Evol. 37(8):2430-2439

- __Written by__: Sergei L Kosakovsky Pond

- __Contact Information__: [email protected]

- __Analysis Version__: 4.5


>code => Universal
*** PROBLEM WITH SEQUENCE ' HOMO_SAPIENS' (2404 nt long, stop codons shown in capital letters)

atgggctggaggccccggagagctcgggggaccccgttgctgctgctgctactactgctgctgctctggccagtgccaggcgccggggtgcttcaaggacatatccctgggcagccagtcaccccgcactgggtcctggatggacaaccctggcgcaccgtcagcctggaggagccggtctcgaagccagacatggggctggtggccctggaggctgaaggccaggagctcctgcttgagctggagaagaaccacaggctgctggccccaggatacatagaaacccactacggcccagatgggcagccagtggtgctggcccccaaccacacggatcattgccactaccaagggcgagtaaggggcttccccgactcctgggtagtcctctgcacctgctctgggatgagtggcctgatcaccctcagcaggaatgccagctattatctgcgtccctggccaccccggggctccaaggacttctcaacccacgagatctttcggatggagcagctgctcacctggaaaggaacctgtggccacagggatcctgggaacaaagcgggcatgaccagccttcctggtggtccccagagcaggggcaggcgagaagcgcgcaggacccggaagtacctggaactgtacattgtggcagaccacaccctgttcttgactcggcaccgaaacttgaaccacaccaaacagcgtctcctggaagtcgccaactacgtggaccagcttctcaggactctggacattcaggtggcgctgaccggcctggaggtgtggaccgagcgggaccgcagccgcgtcacgcaggacgccaacgccacgctctgggccttcctgcagtggcgccgggggctgtgggcgcagcggccccacgactccgcgcagctgctcacgggccgcgccttccagggcgccacagtgggcctggcgcccgtcgagggcatgtgccgcgccgagagctcgggaggcgtgagcacggaccactcggagctccccatcggcgccgcagccaccatggcccatgagatcggccacagcctcggcctcagccacgaccccgacggctgctgcgtggaggctgcggccgagtccggaggctgcgtcatggctgcggccaccgggcacccgtttccgcgcgtgttcagcgcctgcagccgccgccagctgcgcgccttcttccgcaaggggggcggcgcttgcctctccaatgccccggaccccggactcccggtgccgccggcgctctgcgggaacggcttcgtggaagcgggcgaggagtgtgactgcggccctggccaggagtgccgcgacctctgctgctttgctcacaactgctcgctgcgcccgggggcccagtgcgcccacggggactgctgcgtgcgctgcctgctgaagccggctggagcgctgtgccgccaggccatgggtgactgtgacctccctgagttttgcacgggcacctcctcccactgtcccccagacgtttacctactggacggctcaccctgtgccaggggcagtggctactgctgggatggcgcatgtcccacgctggagcagcagtgccagcagctctgggggcctggctcccacccagctcccgaggcctgtttccaggtggtgaactctgcgggagatgctcatggaaactgcggccaggacagcgagggccacttcctgccctgtgcagggagggatgccctgtgtgggaagctgcagtgccagggtggaaagcccagcctgctcgcaccgcacatggtgccagtggactctaccgttcacctagatggccaggaagtgacttgtcggggagccttggcactccccagtgcccagctggacctgcttggcctgggcctggtagagccaggcacccagtgtggacctagaatggtgtgccagagcaggcgctgcaggaagaatgccttccaggagcttcagcgctgcctgactgcctgccacagccacggggtttgcaatagcaaccataactgccactgtgctccaggctgggctccacccttctgtgacaagccaggctttggtggcagcatggacagtggccctgtgcaggctgaaaaccatgacaccttcctgctggccatgctcctcagcgtcctgctgcctctgctcccaggggccggcctggcctggtgttgctaccgactcccaggagcccatctgcagcgatgcagctggggctgcagaagggaccctgcgtgcagtggccccaaagatggcccacacagggaccaccccctgggcggcgttcaccccatggagttgggccccacagccactggacagccctggcccctggaccctgagaactctcatgagcccagcagccaccctgagaagcctctgccagcagtctcgcctgacccccaaG

A possible solution is to use https://github.com/veg/hyphy-analyses/tree/master/codon-msa
I attach the result of doing this on your data.

$hyphy ~/Development/hyphy-analyses/codon-msa/pre-msa.bf --input /Users/sergei/Downloads/adam33_analysis.txt
$mafft /Users/sergei/Downloads/adam33_analysis.txt_protein.fas > /Users/sergei/Downloads/adam33_analysis.txt_protein.msa
$hyphy ~/Development/hyphy-analyses/codon-msa/post-msa.bf --protein-msa /Users/sergei/Downloads/adam33_analysis.txt_protein.msa  --nucleotide-sequences /Users/sergei/Downloads/adam33_analysis.txt_nuc.fas --output /Users/sergei/Downloads/adam33_analysis.msa

Best,
Sergei

adam33_analysis.msa.zip

@hannahg3009
Copy link
Author

hannahg3009 commented Dec 10, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants