Skip to content

Commit

Permalink
changed some nomenclature
Browse files Browse the repository at this point in the history
  • Loading branch information
percolator committed Sep 2, 2024
1 parent 4845d63 commit 2f7e636
Showing 1 changed file with 46 additions and 14 deletions.
60 changes: 46 additions & 14 deletions bibook/protein/matrix.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -211,21 +211,21 @@
"\n",
"#### Background Frequencies ($ q_i $)\n",
"\n",
"The probability $ q_i $ represents the background frequency of amino acid $ i $ across a set of alignments or within a single alignment, depending on the context. It is calculated by counting the occurrence of each amino acid $ i $ in all alignments and then dividing by the total number of amino acid occurrences.\n",
"The probability $ q_i $ represents the background frequency of amino acid $ i $ across a set of alignments or within a single alignment, depending on the context. It is estimated by counting the occurrence of each amino acid $ i $ in all alignments and then dividing by the total number of amino acid occurrences.\n",
"\n",
"##### Calculation of $ q_i $:\n",
"##### Estimation of $ \\hat{q_i} $:\n",
"\n",
"$ q_i = \\frac{n_i}{N} $\n",
"$ \\hat{q_i} = \\frac{n_i}{N} $\n",
"\n",
"where $ n_i $ is the number of times amino acid $ i $ appears in the alignments, and $ N $ is the total number of amino acid residues in all alignments.\n",
"\n",
"#### Joint Probabilities ($ p_{ij} $)\n",
"\n",
"The probability $ p_{ij} $ quantifies how often a particular substitution between amino acids $ i $ and $ j $ occurs. For symmetric matrices, the probabilities $ p_{ij} $ and $ p_{ji} $ are averaged since the direction of substitution does not impact the score.\n",
"\n",
"##### Calculation of $ p_{ij} $:\n",
"##### Calculation of $ \\hat{p_{ij}} $:\n",
"\n",
"$ p_{ij} = \\frac{f_{ij} + f_{ji}}{2T} $\n",
"$ \\hat{p_{ij}} = \\frac{f_{ij} + f_{ji}}{2T} $\n",
"\n",
"where $ f_{ij} $ and $ f_{ji} $ are the frequencies of observing amino acids $ i $ and $ j $ substituted for one another in the alignments, and $ T $ is the total number of substitutions observed.\n",
"\n",
Expand All @@ -234,11 +234,10 @@
"1. **Collect Alignment Data**: Aggregate multiple protein sequence alignments that are representative of the evolutionary distances or functional similarities of interest.\n",
"\n",
"2. **Count Occurrences and Substitutions**:\n",
" - Count each amino acid's occurrence to calculate $ p_i $.\n",
" - Count each amino acid's occurrence to estimate $ \\hat{p_i} $.\n",
" - Count each substitution pair (both $ ij $ and $ ji $) to determine $ f_{ij} $ and $ f_{ji} $.\n",
"\n",
"3. **Normalize and Average**:\n",
" - Normalize these counts to ensure they sum up to 1, yielding $ p_i $.\n",
" - Average the counts of $ f_{ij} $ and $ f_{ji} $ to reflect the symmetric nature of substitutions.\n",
"\n",
"#### Example in Python\n",
Expand All @@ -248,15 +247,39 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Background Frequencies (q_i): {'A': 0.13636363636363635, 'D': 0.18181818181818182, 'E': 0.09090909090909091, 'I': 0.13636363636363635, 'L': 0.045454545454545456, 'M': 0.09090909090909091, 'P': 0.18181818181818182, 'T': 0.045454545454545456, 'V': 0.045454545454545456, 'W': 0.045454545454545456}\n",
"Joint Probabilities (p_ij): {('I', 'L'): 0.09090909090909091, ('A', 'A'): 0.09090909090909091, ('M', 'M'): 0.09090909090909091, ('A', 'V'): 0.09090909090909091, ('P', 'P'): 0.18181818181818182, ('D', 'E'): 0.18181818181818182, ('T', 'W'): 0.09090909090909091, ('I', 'I'): 0.09090909090909091, ('D', 'D'): 0.09090909090909091}\n"
"Background Frequencies (q_i):\n",
"A: 0.136\n",
"D: 0.182\n",
"E: 0.091\n",
"I: 0.136\n",
"L: 0.045\n",
"M: 0.091\n",
"P: 0.182\n",
"T: 0.045\n",
"V: 0.045\n",
"W: 0.045\n",
"\n",
"Joint Probabilities (p_ij):\n",
"('I', 'L'): 0.045\n",
"('A', 'A'): 0.091\n",
"('M', 'M'): 0.091\n",
"('A', 'V'): 0.045\n",
"('P', 'P'): 0.182\n",
"('D', 'E'): 0.091\n",
"('T', 'W'): 0.045\n",
"('I', 'I'): 0.091\n",
"('D', 'D'): 0.091\n",
"('L', 'I'): 0.045\n",
"('V', 'A'): 0.045\n",
"('E', 'D'): 0.091\n",
"('W', 'T'): 0.045\n"
]
}
],
Expand All @@ -282,14 +305,23 @@
" for i in range(min(len(seq1), len(seq2))): # Ensure index is within the shortest sequence\n",
" pair = tuple(sorted([seq1[i], seq2[i]])) # Sort the pair to handle symmetry\n",
" if pair in pair_counts:\n",
" pair_counts[pair] += 1\n",
" pair_counts[pair] += 1.0\n",
" else:\n",
" pair_counts[pair] = 1\n",
" pair_counts[pair] = 1.0\n",
"# Normalize and ensure symmetry in the pair counts\n",
"for (a1, a2), count in list(pair_counts.items()):\n",
" if a1 != a2:\n",
" reversed_pair = (a2, a1)\n",
" avg_count = pair_counts[(a1, a2)] / 2.0\n",
" pair_counts[(a1, a2)] = avg_count\n",
" pair_counts[reversed_pair] = avg_count\n",
"\n",
"\n",
"p_ij = {pair: count / sum(pair_counts.values()) for pair, count in pair_counts.items()}\n",
"\n",
"print(\"Background Frequencies (q_i):\", q_i)\n",
"print(\"Joint Probabilities (p_ij):\", p_ij)\n"
"print(\"Background Frequencies (q_i):\\n\" + \"\\n\".join(f\"{key}: {value:.3f}\" for key, value in q_i.items()))\n",
"print()\n",
"print(\"Joint Probabilities (p_ij):\\n\" + \"\\n\".join(f\"{key}: {value:.3f}\" for key, value in p_ij.items()))\n"
]
},
{
Expand Down

0 comments on commit 2f7e636

Please sign in to comment.