This repository contains all the scripts and data to reproduce the results of:
D. K. Sydykova, C. O. Wilke (2018). Theory of measurement for site-specific evolutionary rates in amino-acid sequences
-
analytical_rates
contains rates that were calculated using analytical derivations. The following list is the files contained in this direcotry and their descriptions.-
all_sites_aa.csv
contians site-wise rates for every site in egg white lysozyme (PDB ID: 132L) calculated for different times (columnsite
inall_sites_aa.csv
directly corresponds to columnSITE
in132L_A_foldx_ddG.txt
). These rates were calculated assuming that the true model is a mutation-selection model, and the inference model is Jukes-Cantor (equations 3-5). This file was generated with the commandpython analytical_rate_aa.py -m 125 -q q_matrix/amino_acid/ -o all_sites_aa.csv
. -
ten_sites_aa.csv
contains site-wise rates for the first ten sites in egg white lysozyme (PDB ID: 132L) calculated for different times (columnsite
inten_sites_aa.csv
directly corresponds to columnSITE
in132L_A_foldx_ddG.txt
). These rates were calculated assuming that the true model is a mutation-selection model, and the inference model is Jukes-Cantor (equations 3-5). This file was generated with the commandpython analytical_rate_aa.py -m 10 -q q_matrix/amino_acid/ -o ten_sites_aa.csv
. -
ten_sites_aa_true_JC.csv
contains site-wise rates for ten sites that were calculated under the assumption that both the true model and the inference model are Jukes-Cantor. This file was generated with the commandpython analytical_rate_aa_true_JC.py -m 10 -o ten_sites_aa_true_JC.csv
. -
ten_sites_aa_QM.csv
contains site-wise rates for when rate is measured with an arbitrary QM matrix and for when rate is measured with a Jukes-Cantor matrix (equation 1). -
ten_sites_codon.csv
contians site-wise rates for every site in egg white lysozyme (PDB ID: 132L) calculated for different times (columnsite
inall_sites_aa.csv
directly corresponds to columnSITE
in132L_A_foldx_ddG.txt
). These rates were calculated assuming that the true model is a codon mutation-selection model, and the inference model is an amino acid Jukes-Cantor (equation 6 and equations 22S and 24S). This file was generated with the commandpython analytical_rate_codon.py -m 10 -q q_matrix/codon/ -o ten_sites_codon.csv
.
-
-
inferred_rates
contains files with site-wise rates inferred with HyPhy. There are two directories ininferred rates
,raw_rates
andprocessed_rates
.raw_rates
contains individual files for a simulated alignment (one file per alignment), andprocessed_rates
contains concatenated files fromraw_rates
.The following list describes the directories contained inraw_rates
:JC
, inferred rates when the true and the inference models are both Jukes-Cantor-like.all_sites
, inferred rates when the true model is MutSel, and the inference model is either Jukes-Cantor-like (JC), WAG, JTT, or LG.site_dupl
, rates inferred with JC for alignments with different number of site duplicates.ten_sites
, inferred rates when the true model is MutSel, and the inference model is JC.translated
, inferred rates when the true model is a codon MutSel model, and the inference model is amino acid JC.
-
q_matrix
contains site-wise substitution matrices Q used for simulating alignments and for calculating site-wise rates. There are two directories inq_matrix
,amino_acid
andcodon
.amino acid
contains amino acid substitution matrices, andcodon
contains codon substitution matrices.Files that start with
132L_A
indicate substitution matrices that were calculated using data from Echave et al (2015) for egg white lysozyme (PDB ID: 132L). For example, file132L_A_site79_q_matrix.npy
corresponds to the substitution matrix calculated for site 79. The site positions here correspond to the site positions given by the file132L_A_foldx_ddG.txt
, which was directly copied from the git repository for Echave et al (2015) https://github.com/wilkelab/therm_constraints_rate_variation. These matrices were calculated according to the mutation-selection (MutSel) theory by Halpern and Bruno (1998). The script to calculate amino acid MutSel matrices issrc/calculate_aa_mutsel_Q.py
, and the script to calculate codon MutSel matrices issrc/calculate_codon_mutsel_Q.py
.Files that start with
site0_JC
indicate substitution matrices Q defined as Q = r(k)QJC. Here, QJC is the Jukes-Cantor-like matrix, and r(k) is the true rate at site k. True rates were generated by the scriptsrc/analytical_rate_aa_true_JC.py
and were stored inanalytical_rates/ten_sites_aa_true_JC.csv
. The script to calculate amino acid substitution matrices Q defined as Q = r(k)QJC issrc/calculate_aa_true_JC_Q.py
. -
trees
contains tree files that were used for simulating alignments. Each file name stores the number of branches (n
) in the tree and the branch lengths (bl
). For example, filen2_bl0.005.tre
describes a tree with 2 branches of lengths 0.005 each. -
hyphy
contains all scripts and files to run HyPhy. -
plots
contains plots generated for the manuscript. -
src
contains code to run the analysis.