changed some nomenclature

statisticalbiotechnology · Sep 2, 2024 · 2f7e636 · 2f7e636
1 parent 4845d63
commit 2f7e636
Showing 1 changed file with 46 additions and 14 deletions.
diff --git a/bibook/protein/matrix.ipynb b/bibook/protein/matrix.ipynb
@@ -211,21 +211,21 @@
     "\n",
     "#### Background Frequencies ($ q_i $)\n",
     "\n",
-    "The probability $ q_i $ represents the background frequency of amino acid $ i $ across a set of alignments or within a single alignment, depending on the context. It is calculated by counting the occurrence of each amino acid $ i $ in all alignments and then dividing by the total number of amino acid occurrences.\n",
+    "The probability $ q_i $ represents the background frequency of amino acid $ i $ across a set of alignments or within a single alignment, depending on the context. It is estimated by counting the occurrence of each amino acid $ i $ in all alignments and then dividing by the total number of amino acid occurrences.\n",
     "\n",
-    "##### Calculation of $ q_i $:\n",
+    "##### Estimation of $ \\hat{q_i} $:\n",
     "\n",
-    "$ q_i = \\frac{n_i}{N} $\n",
+    "$ \\hat{q_i} = \\frac{n_i}{N} $\n",
     "\n",
     "where $ n_i $ is the number of times amino acid $ i $ appears in the alignments, and $ N $ is the total number of amino acid residues in all alignments.\n",
     "\n",
     "#### Joint Probabilities ($ p_{ij} $)\n",
     "\n",
     "The probability $ p_{ij} $ quantifies how often a particular substitution between amino acids $ i $ and $ j $ occurs. For symmetric matrices, the probabilities $ p_{ij} $ and $ p_{ji} $ are averaged since the direction of substitution does not impact the score.\n",
     "\n",
-    "##### Calculation of $ p_{ij} $:\n",
+    "##### Calculation of $ \\hat{p_{ij}} $:\n",
     "\n",
-    "$ p_{ij} = \\frac{f_{ij} + f_{ji}}{2T} $\n",
+    "$ \\hat{p_{ij}} = \\frac{f_{ij} + f_{ji}}{2T} $\n",
     "\n",
     "where $ f_{ij} $ and $ f_{ji} $ are the frequencies of observing amino acids $ i $ and $ j $ substituted for one another in the alignments, and $ T $ is the total number of substitutions observed.\n",
     "\n",
@@ -234,11 +234,10 @@
     "1. **Collect Alignment Data**: Aggregate multiple protein sequence alignments that are representative of the evolutionary distances or functional similarities of interest.\n",
     "\n",
     "2. **Count Occurrences and Substitutions**:\n",
-    "   - Count each amino acid's occurrence to calculate $ p_i $.\n",
+    "   - Count each amino acid's occurrence to estimate $ \\hat{p_i} $.\n",
     "   - Count each substitution pair (both $ ij $ and $ ji $) to determine $ f_{ij} $ and $ f_{ji} $.\n",
     "\n",
     "3. **Normalize and Average**:\n",
-    "   - Normalize these counts to ensure they sum up to 1, yielding $ p_i $.\n",
     "   - Average the counts of $ f_{ij} $ and $ f_{ji} $ to reflect the symmetric nature of substitutions.\n",
     "\n",
     "#### Example in Python\n",
@@ -248,15 +247,39 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Background Frequencies (q_i): {'A': 0.13636363636363635, 'D': 0.18181818181818182, 'E': 0.09090909090909091, 'I': 0.13636363636363635, 'L': 0.045454545454545456, 'M': 0.09090909090909091, 'P': 0.18181818181818182, 'T': 0.045454545454545456, 'V': 0.045454545454545456, 'W': 0.045454545454545456}\n",
-      "Joint Probabilities (p_ij): {('I', 'L'): 0.09090909090909091, ('A', 'A'): 0.09090909090909091, ('M', 'M'): 0.09090909090909091, ('A', 'V'): 0.09090909090909091, ('P', 'P'): 0.18181818181818182, ('D', 'E'): 0.18181818181818182, ('T', 'W'): 0.09090909090909091, ('I', 'I'): 0.09090909090909091, ('D', 'D'): 0.09090909090909091}\n"
+      "Background Frequencies (q_i):\n",
+      "A: 0.136\n",
+      "D: 0.182\n",
+      "E: 0.091\n",
+      "I: 0.136\n",
+      "L: 0.045\n",
+      "M: 0.091\n",
+      "P: 0.182\n",
+      "T: 0.045\n",
+      "V: 0.045\n",
+      "W: 0.045\n",
+      "\n",
+      "Joint Probabilities (p_ij):\n",
+      "('I', 'L'): 0.045\n",
+      "('A', 'A'): 0.091\n",
+      "('M', 'M'): 0.091\n",
+      "('A', 'V'): 0.045\n",
+      "('P', 'P'): 0.182\n",
+      "('D', 'E'): 0.091\n",
+      "('T', 'W'): 0.045\n",
+      "('I', 'I'): 0.091\n",
+      "('D', 'D'): 0.091\n",
+      "('L', 'I'): 0.045\n",
+      "('V', 'A'): 0.045\n",
+      "('E', 'D'): 0.091\n",
+      "('W', 'T'): 0.045\n"
      ]
     }
    ],
@@ -282,14 +305,23 @@
     "    for i in range(min(len(seq1), len(seq2))):  # Ensure index is within the shortest sequence\n",
     "        pair = tuple(sorted([seq1[i], seq2[i]]))  # Sort the pair to handle symmetry\n",
     "        if pair in pair_counts:\n",
-    "            pair_counts[pair] += 1\n",
+    "            pair_counts[pair] += 1.0\n",
     "        else:\n",
-    "            pair_counts[pair] = 1\n",
+    "            pair_counts[pair] = 1.0\n",
+    "# Normalize and ensure symmetry in the pair counts\n",
+    "for (a1, a2), count in list(pair_counts.items()):\n",
+    "    if a1 != a2:\n",
+    "        reversed_pair = (a2, a1)\n",
+    "        avg_count = pair_counts[(a1, a2)] / 2.0\n",
+    "        pair_counts[(a1, a2)] = avg_count\n",
+    "        pair_counts[reversed_pair] = avg_count\n",
+    "\n",
     "\n",
     "p_ij = {pair: count / sum(pair_counts.values()) for pair, count in pair_counts.items()}\n",
     "\n",
-    "print(\"Background Frequencies (q_i):\", q_i)\n",
-    "print(\"Joint Probabilities (p_ij):\", p_ij)\n"
+    "print(\"Background Frequencies (q_i):\\n\" + \"\\n\".join(f\"{key}: {value:.3f}\" for key, value in q_i.items()))\n",
+    "print()\n",
+    "print(\"Joint Probabilities (p_ij):\\n\" + \"\\n\".join(f\"{key}: {value:.3f}\" for key, value in p_ij.items()))\n"
    ]
   },
   {