diff --git a/docs/lab02.html b/docs/lab02.html
index 3d58ef6..4738de2 100644
--- a/docs/lab02.html
+++ b/docs/lab02.html
@@ -378,7 +378,7 @@ <h2 class="anchored" data-anchor-id="packages">Package(s)</h2>
 <section id="schedule" class="level2">
 <h2 class="anchored" data-anchor-id="schedule">Schedule</h2>
 <ul>
-<li>08.00 - 08.15: <a href="https://raw.githack.com/r4bds/r4bds.github.io/main/pre_course_questionnaire_summary.html">Pre-course Survey Walk-through</a></li>
+<li>08.00 - 08.15: <a href="https://raw.githack.com/r4bds/r4bds.github.io/main/pre_course_questionnaire_summary.html">pre-course anonymous questionaire Walk-through</a></li>
 <li>08.15 - 08.30: Recap: RStudio Cloud, RStudio and R - The Very Basics (Live session)</li>
 <li>08.30 - 09.00: <a href="https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab02.html">Lecture</a></li>
 <li>09.00 - 09.15: Break</li>
diff --git a/docs/lab05.html b/docs/lab05.html
index b74c249..6b230a9 100644
--- a/docs/lab05.html
+++ b/docs/lab05.html
@@ -568,18 +568,18 @@ <h4 class="anchored" data-anchor-id="the-subject-meta-data">The Subject Meta Dat
 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 30
-   Experiment Subject `Cell Type` `Target Type` Cohort          Age Gender Race 
-   &lt;chr&gt;        &lt;dbl&gt; &lt;chr&gt;       &lt;chr&gt;         &lt;chr&gt;         &lt;dbl&gt; &lt;chr&gt;  &lt;chr&gt;
- 1 eMR12         1770 PBMC        C19_cI        COVID-19-Con…    NA &lt;NA&gt;   &lt;NA&gt; 
- 2 eHO132          26 PBMC        C19_cI        COVID-19-Con…    65 F      White
- 3 eQD109        1349 PBMC        C19_cI        COVID-19-Con…    61 M      &lt;NA&gt; 
- 4 eEE240       20795 naive_CD8   C19_cI        Healthy (No …    23 M      White
- 5 eQD131        2267 PBMC        C19_cI        COVID-19-Exp…    NA &lt;NA&gt;   &lt;NA&gt; 
- 6 ePD80         1027 PBMC        C19_cI        COVID-19-Con…    67 M      &lt;NA&gt; 
- 7 eJL154          83 PBMC        C19_cI        COVID-19-Exp…    35 F      Nati…
- 8 eNL187        2686 B-CD8-_PBMC C19_cII       COVID-19-Con…    NA &lt;NA&gt;   &lt;NA&gt; 
- 9 eOX54        10881 naive_CD8   C19_cI        Healthy (No …    39 F      Afri…
-10 eOX49        10943 naive_CD8   C19_cI        Healthy (No …    21 M      White
+   Experiment Subject `Cell Type`       `Target Type` Cohort    Age Gender Race 
+   &lt;chr&gt;        &lt;dbl&gt; &lt;chr&gt;             &lt;chr&gt;         &lt;chr&gt;   &lt;dbl&gt; &lt;chr&gt;  &lt;chr&gt;
+ 1 eXL27        19830 naive_CD8         C19_cI        Health…    24 M      White
+ 2 eHO141        3238 PBMC              C19_cI        COVID-…    NA &lt;NA&gt;   &lt;NA&gt; 
+ 3 eHH175       20300 naive_CD8         C19_cI        Health…    28 M      White
+ 4 eHO124        3819 PBMC              C19_cI        Health…    62 M      &lt;NA&gt; 
+ 5 ePD100        1811 PBMC              C19_cI        COVID-…    66 M      &lt;NA&gt; 
+ 6 ePD85         5869 naive_CD8         C19_cI        Health…    27 F      &lt;NA&gt; 
+ 7 eJL161     1005703 PBMC              C19_cI        COVID-…    31 F      White
+ 8 eHH173       19829 naive_CD8         C19_cI        Health…    50 M      White
+ 9 eLH59         1770 B- depleted PBMCs C19_cII       COVID-…    NA &lt;NA&gt;   &lt;NA&gt; 
+10 ePD73         4423 naive_CD8         minigene_Set1 Health…    37 F      White
 # ℹ 22 more variables: `HLA-A...9` &lt;chr&gt;, `HLA-A...10` &lt;chr&gt;,
 #   `HLA-B...11` &lt;chr&gt;, `HLA-B...12` &lt;chr&gt;, `HLA-C...13` &lt;chr&gt;,
 #   `HLA-C...14` &lt;chr&gt;, DPA1...15 &lt;chr&gt;, DPA1...16 &lt;chr&gt;, DPB1...17 &lt;chr&gt;,
@@ -643,16 +643,16 @@ <h5 class="anchored" data-anchor-id="stop-make-sure-you-handled-how-nas-are-deno
 <pre><code># A tibble: 10 × 27
    Experiment Cohort      Age Gender Race  `HLA-A...9` `HLA-A...10` `HLA-B...11`
    &lt;chr&gt;      &lt;chr&gt;     &lt;dbl&gt; &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;       &lt;chr&gt;        &lt;chr&gt;       
- 1 eHO126     COVID-19…    37 F      &lt;NA&gt;  "A*01:01:0… "A*24:02:01" "B*07:02:01"
- 2 eJL160     COVID-19…    52 F      Afri… "A*01:01:0… "A*02:01:01" "B*44:02:01"
- 3 eNL192     COVID-19…    NA &lt;NA&gt;   &lt;NA&gt;  ""          ""           ""          
- 4 eHO130     Healthy …    28 F      White "A*02:01"   "A*03:01"    "B*07:02"   
- 5 eLH45      COVID-19…    53 M      &lt;NA&gt;  "A*02:01:0… "A*03:01:01" "B*07:02:01"
- 6 eQD134     COVID-19…    NA &lt;NA&gt;   &lt;NA&gt;  "A*24:07:0… "A*34:01:01" "B*15:02:01"
- 7 eJL161     COVID-19…    31 F      White "A*01:01:0… "A*02:01:01" "B*08:01:01"
- 8 eQD116     COVID-19…    66 F      &lt;NA&gt;  "A*03:01:0… "A*11:01:01" "B*35:01:01"
- 9 ePD86      COVID-19…    58 M      White "A*02:01:0… "A*26:01:01" "B*44:27:01"
-10 eLH47      COVID-19…    35 F      White "A*01:01:0… "A*02:01:01" "B*07:02:01"
+ 1 eJL154     COVID-19…    35 F      Nati… "A*02:01:0… "A*29:02:01" "B*15:02:01"
+ 2 eXL31      Healthy …    28 M      White "A*02:01"   "A*29:02"    "B*07:02"   
+ 3 ePD90      COVID-19…    29 M      &lt;NA&gt;  ""          ""           ""          
+ 4 eQD116     COVID-19…    66 F      &lt;NA&gt;  "A*03:01:0… "A*11:01:01" "B*35:01:01"
+ 5 eMR23      COVID-19…    22 F      &lt;NA&gt;  ""          ""           ""          
+ 6 eJL158     COVID-19…    33 M      White "A*02:01:0… "A*24:02:01" "B*15:01:01"
+ 7 eAV93      Healthy …    41 M      White "A*11:01"   "A*68:01"    "B*35:01"   
+ 8 eLH45      COVID-19…    53 M      &lt;NA&gt;  "A*02:01:0… "A*03:01:01" "B*07:02:01"
+ 9 eLH59      COVID-19…    NA &lt;NA&gt;   &lt;NA&gt;  "A*01:01:0… "A*02:01:01" "B*40:01:02"
+10 eNL192     COVID-19…    NA &lt;NA&gt;   &lt;NA&gt;  ""          ""           ""          
 # ℹ 19 more variables: `HLA-B...12` &lt;chr&gt;, `HLA-C...13` &lt;chr&gt;,
 #   `HLA-C...14` &lt;chr&gt;, DPA1...15 &lt;chr&gt;, DPA1...16 &lt;chr&gt;, DPB1...17 &lt;chr&gt;,
 #   DPB1...18 &lt;chr&gt;, DQA1...19 &lt;chr&gt;, DQA1...20 &lt;chr&gt;, DQB1...21 &lt;chr&gt;,
@@ -867,16 +867,16 @@ <h5 class="anchored" data-anchor-id="stop-make-sure-you-handled-how-nas-are-deno
 <pre><code># A tibble: 10 × 11
    Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   
    &lt;chr&gt;      &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
- 1 eQD114     COVID-19-C…    73 M      &lt;NA&gt;  "A*0… "A*2… "B*0… "B*4… "C*0… "C*1…
- 2 eNL189     COVID-19-E…    NA &lt;NA&gt;   &lt;NA&gt;  ""    ""    ""    ""    ""    ""   
- 3 eQD112     COVID-19-C…    65 M      &lt;NA&gt;  "A*2… "A*2… "B*0… "B*3… "C*0… "C*0…
- 4 eLH46      COVID-19-C…    57 F      White "A*0… "A*2… "B*3… "B*5… "C*1… "C*1…
- 5 eAV88      Healthy (N…    24 M      White "A*0… "A*0… "B*2… "B*4… "C*0… "C*0…
- 6 eNL187     COVID-19-C…    NA &lt;NA&gt;   &lt;NA&gt;  ""    ""    ""    ""    ""    ""   
- 7 eJL154     COVID-19-E…    35 F      Nati… "A*0… "A*2… "B*1… "B*4… "C*0… "C*1…
- 8 eQD124     COVID-19-B…    40 F      White "A*0… "A*0… "B*1… "B*5… "C*0… "C*0…
- 9 eQD134     COVID-19-C…    NA &lt;NA&gt;   &lt;NA&gt;  "A*2… "A*3… "B*1… "B*1… "C*0… "C*0…
-10 ePD83      Healthy (N…    29 F      Asian "A*0… "A*0… "B*1… "B*4… "C*0… "C*0…</code></pre>
+ 1 eLH54      COVID-19-C…    NA &lt;NA&gt;   &lt;NA&gt;  "A*0… "A*0… "B*0… "B*4… "C*0… "C*0…
+ 2 eHH173     Healthy (N…    50 M      White "A*0… "A*0… "B*3… "B*4… "C*0… "C*0…
+ 3 eMR25      COVID-19-C…    21 F      &lt;NA&gt;  ""    ""    ""    ""    ""    ""   
+ 4 ePD91      COVID-19-C…    52 M      White ""    ""    ""    ""    ""    ""   
+ 5 eDH105     COVID-19-C…    32 F      &lt;NA&gt;  "A*2… "A*2… "B*4… "B*4… "C*0… "C*0…
+ 6 ePD87      COVID-19-C…    47 M      White "A*0… "A*2… "B*0… "B*0… "C*0… "C*0…
+ 7 eAM23      COVID-19-C…    48 M      &lt;NA&gt;  "A*1… "A*2… "B*1… "B*5… "C*0… "C*1…
+ 8 eOX54      Healthy (N…    39 F      Afri… "A*0… "A*2… "B*1… "B*5… "C*0… "C*1…
+ 9 eMR12      COVID-19-C…    NA &lt;NA&gt;   &lt;NA&gt;  "A*0… "A*0… "B*4… "B*5… "C*0… "C*1…
+10 eQD131     COVID-19-E…    NA &lt;NA&gt;   &lt;NA&gt;  "A*0… "A*3… "B*1… "B*5… "C*0… "C*0…</code></pre>
 </div>
 </div>
 <p>Now, we have a beautiful <code>tidy</code> dataset, recall that this entails, that each row is an observation, each column is a variable and each cell holds one value.</p>
@@ -892,16 +892,16 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <pre><code># A tibble: 10 × 7
    `TCR BioIdentity`            TCR Nucleotide Seque…¹ Experiment `ORF Coverage`
    &lt;chr&gt;                        &lt;chr&gt;                  &lt;chr&gt;      &lt;chr&gt;         
- 1 CSGQQGYEQYF+TCRBV29-01+TCRB… ACTCTGACTGTGAGCAACATG… eXL30      surface glyco…
- 2 CASSPRTTPAPQHF+TCRBV19-01+T… GTGACATCGGCCCAAAAGAAC… eEE224     membrane glyc…
- 3 CASSEVGTLEAFF+TCRBV25-01+TC… ACCCTGGAGTCTGCCAGGCCC… eEE228     membrane glyc…
- 4 CSARLGQGSYEQYF+TCRBV20-X+TC… GTGACCAGTGCCCATCCTGAA… eXL30      surface glyco…
- 5 CASSEGLGGYEQYF+TCRBV06-01+T… NNNNTGTCGGCTGCTCCCTCC… eEE240     ORF3a         
- 6 CASSHLDRGSYNEQFF+TCRBV04-01… GCCCTGCAGCCAGAAGACTCA… eEE226     surface glyco…
- 7 CASSERDPRQETQYF+TCRBV27-01+… GAGTCGCCCAGCCCCAACCAG… eQD114     ORF1ab        
- 8 CASSVGGRSYEQYF+TCRBV09-01+T… CTGAGCTCTCTGGAGCTGGGG… ePD82      ORF3a         
- 9 CASSPAPIAYEQYF+TCRBV06-05+T… NNNNTGTCGGCTGCTCCCTCC… eQD131     surface glyco…
-10 CASSQETANTGELFF+TCRBV04-02+… CACACCCTGCAGCCAGAAGAC… eLH51      nucleocapsid …
+ 1 CASSEAPGLEFGNTIYF+TCRBV02-0… ACAAAGCTGGAGGACTCAGCC… eXL30      ORF1ab        
+ 2 CASSHEDRGRPGELFF+TCRBV03-01… TCCCTGGAGCTTGGTGACTCT… eMR17      ORF1ab        
+ 3 CASSPPTDTQYF+TCRBV27-01+TCR… CTGATCCTGGAGTCGCCCAGC… eXL31      surface glyco…
+ 4 CASSLGLAGEQYF+TCRBV07-02+TC… ACGATCCAGCGCACAGAGCAG… eDH113     ORF1ab        
+ 5 CASSYWNEQFF+TCRBV06-05+TCRB… NNNNNNNNNNNNNTGTCGGCT… eOX52      ORF1ab        
+ 6 CASSLVGGDPSTDTQYF+TCRBV13-0… TTGGAGCTGGGGGACTCAGCC… eEE226     ORF1ab        
+ 7 CASSIGVGRAYEQYF+TCRBV19-01+… ACATCGGCCCAAAAGAACCCG… ePD83      ORF3a         
+ 8 CSALGQGNVQFF+TCRBV29-01+TCR… CTGACTGTGAGCAACATGAGC… eEE226     ORF6          
+ 9 CASSQLRYTEAFF+TCRBV04-03+TC… CACCTACACACCCTGCAGCCA… eHO124     ORF1ab        
+10 CASSLFGRGPTYNEQFF+TCRBV27-0… CCCAGCCCCAACCAGACCTCT… eLH43      ORF1ab        
 # ℹ abbreviated name: ¹​`TCR Nucleotide Sequence`
 # ℹ 3 more variables: `Amino Acids` &lt;chr&gt;, `Start Index in Genome` &lt;dbl&gt;,
 #   `End Index in Genome` &lt;dbl&gt;</code></pre>
@@ -937,18 +937,18 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 3
-   Experiment `TCR BioIdentity`                           `Amino Acids`         
-   &lt;chr&gt;      &lt;chr&gt;                                       &lt;chr&gt;                 
- 1 eAV93      CASSILLAGGTDTQYF+TCRBV27-01+TCRBJ02-03      FLWLLWPVT,FLWLLWPVTL,…
- 2 eMR15      CASSLIQGANTEAFF+TCRBV07-09+TCRBJ01-01       CPDGVKHVY,DGVKHVYQL,F…
- 3 eQD123     CASSPQGAGSLYEQYF+TCRBV04-01+TCRBJ02-07      FLQSINFVR,FLQSINFVRI,…
- 4 eOX54      CSADTQYF+TCRBV20-01+TCRBJ02-03              FIASFRLFA,SYFIASFRLF,…
- 5 eOX52      CASSQDAGLANEQYF+TCRBV03-01/03-02+TCRBJ02-07 ELYSPIFLI,LYSPIFLIV,Q…
- 6 eAV93      CASSLVATGELFF+TCRBV05-04+TCRBJ02-02         AFPFTIYSL,GYINVFAFPF,…
- 7 eLH44      CASSLNPGEGPQNIQYF+TCRBV28-01+TCRBJ02-04     AFPFTIYSL,GYINVFAFPF,…
- 8 eAV93      CASSSRTSGWYNEQFF+TCRBV11-02+TCRBJ02-01      AIPTNFTISV,AYSNNSIAIP…
- 9 eEE240     CASSPPGAPMGQPQHF+TCRBV27-01+TCRBJ01-05      FLNGSCGSV             
-10 eOX54      CATSDLPSTGTEVTGELFF+TCRBV24-01+TCRBJ02-02   QYIKWPWYI,YEQYIKWPW,Y…</code></pre>
+   Experiment `TCR BioIdentity`                       `Amino Acids`             
+   &lt;chr&gt;      &lt;chr&gt;                                   &lt;chr&gt;                     
+ 1 eEE226     CASSQDSDGGGNTIYF+TCRBV04-02+TCRBJ01-03  AEAELAKNVSL,AELAKNVSLDNVL 
+ 2 eOX54      CASSTVGGPFQPQHF+TCRBV12-X+TCRBJ01-05    MMISAGFSL                 
+ 3 eXL27      CASRKTTDTQYF+TCRBV27-01+TCRBJ02-03      AFLLFLVLI,FLAFLLFLV,FYLCF…
+ 4 eHO135     CAWRRGGKLFF+TCRBV30-01+TCRBJ01-04       AYKTFPPTEPK,KTFPPTEPK     
+ 5 eXL36      CASSVAAAVSYNEQFF+TCRBV09-01+TCRBJ02-01  VDDPCPIHFY,VVDDPCPIHFY,YV…
+ 6 eEE228     CASSFPQNTQYF+TCRBV07-09+TCRBJ02-03      FLWLLWPVT,FLWLLWPVTL,LWLL…
+ 7 eEE226     CASSSRTEGSTDTQYF+TCRBV11-02+TCRBJ02-03  EEHVQIHTI                 
+ 8 eOX52      CASSVEGTVNEKLFF+TCRBV09-01+TCRBJ01-04   FVDGVPFVV                 
+ 9 eQD137     CSVVSGISYNEQFF+TCRBV29-01+TCRBJ02-01    AFLLFLVLI,FLAFLLFLV,FYLCF…
+10 ePD83      CASSIGLGLAEYNEQFF+TCRBV19-01+TCRBJ02-01 SEHDYQIGGYTEKW,YQIGGYTEK,…</code></pre>
 </div>
 </div>
 <ul>
@@ -963,18 +963,18 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="at">size =</span> <span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 5
-   Experiment CDR3b              V_gene     J_gene     `Amino Acids`            
-   &lt;chr&gt;      &lt;chr&gt;              &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;                    
- 1 eEE240     CASSPYGGTEAFF      TCRBV07-06 TCRBJ01-01 YLNTLTLAV                
- 2 eXL30      CATPPRGGTGELFF     TCRBV07-09 TCRBJ02-02 AFLLFLVLI,FLAFLLFLV,FYLC…
- 3 eMR16      CASSLVWGAKNIQYF    TCRBV07-08 TCRBJ02-04 AFPFTIYSL,GYINVFAFPF,INV…
- 4 eQD111     CASSMTSSRDEQYF     TCRBV27-01 TCRBJ02-07 HTTDPSFLGRY              
- 5 eEE224     CASSFGLGSDPFF      TCRBV27-01 TCRBJ02-01 AFPFTIYSL,GYINVFAFPF,INV…
- 6 eXL32      CASSPPTPAGWANEKLFF TCRBV28-01 TCRBJ01-04 TVLSFCAFA,VLSFCAFAV      
- 7 eQD126     CASRPPDGGIYEQYF    TCRBV06-05 TCRBJ02-07 HTTDPSFLGRY              
- 8 eQD128     CASSELAGPQETQYF    TCRBV06-01 TCRBJ02-05 GYQPYRVVVL,PYRVVVLSF,QPY…
- 9 eEE228     CSVLQGTEAFF        TCRBV29-01 TCRBJ01-01 AEAELAKNVSL,AELAKNVSLDNVL
-10 eQD113     CASSSTSGGNEQFF     TCRBV07-08 TCRBJ02-01 IGAGICASY,IPIGAGICASY    </code></pre>
+   Experiment CDR3b              V_gene           J_gene     `Amino Acids`      
+   &lt;chr&gt;      &lt;chr&gt;              &lt;chr&gt;            &lt;chr&gt;      &lt;chr&gt;              
+ 1 ePD76      CASSPRPGLAGGRDTQYF TCRBV07-06       TCRBJ02-03 SSANNCTFEY,VYSSANN…
+ 2 eEE226     CASSEASMNTEAFF     TCRBV06-01       TCRBJ01-01 LPAADLDDF          
+ 3 eOX49      CASSRQTEAFF        TCRBV03-01/03-02 TCRBJ01-01 FLNGSCGSV          
+ 4 eOX43      CASSLRGTGESEFF     TCRBV12-X        TCRBJ02-01 AFPFTIYSL,GYINVFAF…
+ 5 eOX43      CASSHAASRSYEQYF    TCRBV04-01       TCRBJ02-07 APKEIIFL,KEIIFLEGE…
+ 6 eEE226     CASSHWSVAEETQYF    TCRBV03-01/03-02 TCRBJ02-05 KLSYGIATV          
+ 7 eJL164     CSASERSTTLGQTTQYF  TCRBV20-01       TCRBJ02-03 KLWAQCVQL          
+ 8 eEE228     CAISDRLISGSTGELFF  TCRBV10-03       TCRBJ02-02 FPNITNLCPF,QPTESIV…
+ 9 eQD132     CASSSRTKGYEQYF     TCRBV06-05       TCRBJ02-07 STQDLFLPFF,TQDLFLP…
+10 eOX52      CASSIGPLDSYGYTF    TCRBV19-01       TCRBJ01-02 KLSYGIATV          </code></pre>
 </div>
 </div>
 <details>
@@ -1004,18 +1004,18 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="at">size =</span> <span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 6
-   Experiment CDR3b            V_gene     J_gene     `Amino Acids`    n_peptides
-   &lt;chr&gt;      &lt;chr&gt;            &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;                 &lt;dbl&gt;
- 1 eEE226     CATQLPSTDTQYF    TCRBV06-05 TCRBJ02-03 GRLQSLQTY,LITGR…          3
- 2 eEE226     CASSSPGAGTGELFF  TCRBV27-01 TCRBJ02-02 HTTDPSFLGRY               1
- 3 eXL31      CASSLGGEQYF      TCRBV27-01 TCRBJ02-07 TEKSNIIRGW                1
- 4 eEE228     CASRYSEAYEQYF    TCRBV27-01 TCRBJ02-07 FPPTSFGPL                 1
- 5 eXL27      CAPRRGAGVSEAFF   TCRBV28-01 TCRBJ01-01 DFLEYHDVR,EDFLE…          5
- 6 eHO141     CASSVSGTGDADTQYF TCRBV06-01 TCRBJ02-03 YLQPRTFL,YLQPRT…          3
- 7 eAV91      CASSPERLGYTF     TCRBV28-01 TCRBJ01-02 ITDVFYKENSY,SEY…          2
- 8 eEE240     CASSPDRLAGEQYF   TCRBV04-02 TCRBJ02-07 KLSYGIATV                 1
- 9 eAV93      CSVRDFLYNEQFF    TCRBV29-01 TCRBJ02-01 AFPFTIYSL,GYINV…          7
-10 eEE226     CASSHYNGNQPQHF   TCRBV27-01 TCRBJ01-05 LLDDFVEII,LLLDD…          2</code></pre>
+   Experiment CDR3b               V_gene         J_gene `Amino Acids` n_peptides
+   &lt;chr&gt;      &lt;chr&gt;               &lt;chr&gt;          &lt;chr&gt;  &lt;chr&gt;              &lt;dbl&gt;
+ 1 ePD91      CASSIGLTEAFF        TCRBV19-01     TCRBJ… ILGTVSWNL,SN…          2
+ 2 eEE240     CATSRPMNTEAFF       TCRBV15-01     TCRBJ… AFLLFLVLI,FL…         11
+ 3 eHH175     CASSDGPGYEQYF       TCRBV12-03/12… TCRBJ… KMKDLSPRW              1
+ 4 eHO135     CASSLAGAQPQHF       TCRBV05-01     TCRBJ… LSPRWYFYY,SP…          2
+ 5 eEE228     CASSPTSGINNEQFF     TCRBV18-01     TCRBJ… KLSYGIATV              1
+ 6 eXL31      CASSKATGEGGNYEQYF   TCRBV21-01     TCRBJ… FLQSINFVR,FL…         13
+ 7 eOX49      CASSYGLAGGEETQYF    TCRBV07-03     TCRBJ… KLWAQCVQL              1
+ 8 eLH48      CASSQDVGRGVQETQYF   TCRBV03-01/03… TCRBJ… RIRGGDGKM,RI…          2
+ 9 eXL31      CASSHGTGGELFF       TCRBV27-01     TCRBJ… QLMCQPILL,QL…          2
+10 eEE226     CASSQDRVVAGGQGDTQYF TCRBV03-01/03… TCRBJ… APKEIIFL,KEI…          2</code></pre>
 </div>
 </div>
 <ul>
@@ -1064,16 +1064,16 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <pre><code># A tibble: 10 × 18
    Experiment CDR3b        V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4
    &lt;chr&gt;      &lt;chr&gt;        &lt;chr&gt;  &lt;chr&gt;  &lt;chr&gt;     &lt;chr&gt;     &lt;chr&gt;     &lt;chr&gt;    
- 1 eXL37      CASSLGQGADY… TCRBV… TCRBJ… VLWAHGFEL &lt;NA&gt;      &lt;NA&gt;      &lt;NA&gt;     
- 2 eAV93      CASVGLAMDNE… TCRBV… TCRBJ… AFPFTIYSL GYINVFAF… INVFAFPF… MGYINVFAF
- 3 eOX54      CASSVADMNTE… TCRBV… TCRBJ… FVDGVPFVV &lt;NA&gt;      &lt;NA&gt;      &lt;NA&gt;     
- 4 eOX52      CASRGLAKSSY… TCRBV… TCRBJ… AFLLFLVLI FLAFLLFLV FYLCFLAFL FYLCFLAF…
- 5 eHO138     CASSQGFPGGV… TCRBV… TCRBJ… ALNTPKDHI ATEGALNT… &lt;NA&gt;      &lt;NA&gt;     
- 6 eAV88      CASSQNLNEKL… TCRBV… TCRBJ… LLDDFVEII LLLDDFVEI &lt;NA&gt;      &lt;NA&gt;     
- 7 eOX46      CASSLGGQGSN… TCRBV… TCRBJ… FLWLLWPVT FLWLLWPV… LWLLWPVTL LWPVTLACF
- 8 eOX52      CSVDQDGIGEL… TCRBV… TCRBJ… KLSYGIATV &lt;NA&gt;      &lt;NA&gt;      &lt;NA&gt;     
- 9 eOX54      CASSLVPSSGP… TCRBV… TCRBJ… FVDGVPFVV &lt;NA&gt;      &lt;NA&gt;      &lt;NA&gt;     
-10 eQD114     CASSYRLATYE… TCRBV… TCRBJ… NSSPDDQI… NTNSSPDD… SSPDDQIGY SSPDDQIG…
+ 1 eOX52      CASSPPISYEQ… TCRBV… TCRBJ… ILGTVSWNL SNEKQEIL… &lt;NA&gt;      &lt;NA&gt;     
+ 2 eEE226     CASPFPGQGHE… TCRBV… TCRBJ… KLSYGIATV &lt;NA&gt;      &lt;NA&gt;      &lt;NA&gt;     
+ 3 eQD124     CASRTLGAGEL… TCRBV… TCRBJ… HTTDPSFL… &lt;NA&gt;      &lt;NA&gt;      &lt;NA&gt;     
+ 4 eAV88      CASSPGLDYNE… TCRBV… TCRBJ… KPLEFGAT… &lt;NA&gt;      &lt;NA&gt;      &lt;NA&gt;     
+ 5 eEE224     CSVEGLPGRET… TCRBV… TCRBJ… APAHISTI  LIVNSVLL… LLFLAFVV… SVLLFLAFV
+ 6 eQD108     CASSATGALAS… TCRBV… TCRBJ… FIASFRLFA SYFIASFR… YFIASFRLF YFIASFRL…
+ 7 eXL27      CASSLGDSNTE… TCRBV… TCRBJ… ELYSPIFLI LYSPIFLIV QELYSPIFL VQELYSPIF
+ 8 eOX49      CASSLNHLGDR… TCRBV… TCRBJ… FVCNLLLL… LLFVTVYS… TVYSHLLLV &lt;NA&gt;     
+ 9 eAV93      CATTEGTANTE… TCRBV… TCRBJ… SSANNCTF… VYSSANNC… &lt;NA&gt;      &lt;NA&gt;     
+10 eOX46      CASSSSTAGEQ… TCRBV… TCRBJ… FPPTSFGPL &lt;NA&gt;      &lt;NA&gt;      &lt;NA&gt;     
 # ℹ 10 more variables: peptide_5 &lt;chr&gt;, peptide_6 &lt;chr&gt;, peptide_7 &lt;chr&gt;,
 #   peptide_8 &lt;chr&gt;, peptide_9 &lt;chr&gt;, peptide_10 &lt;chr&gt;, peptide_11 &lt;chr&gt;,
 #   peptide_12 &lt;chr&gt;, peptide_13 &lt;chr&gt;, n_peptides &lt;dbl&gt;</code></pre>
@@ -1112,18 +1112,18 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 7
-   Experiment CDR3b                V_gene    J_gene n_peptides peptide_n peptide
-   &lt;chr&gt;      &lt;chr&gt;                &lt;chr&gt;     &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;     &lt;chr&gt;  
- 1 eXL30      CASSYPLGYPEAFF       TCRBV06-… TCRBJ…          1 peptide_… &lt;NA&gt;   
- 2 eMR15      CASSLLTGGPVAKNIQYF   TCRBV27-… TCRBJ…          2 peptide_5 &lt;NA&gt;   
- 3 eMR26      CASSLAGVEQYF         TCRBV05-… TCRBJ…          2 peptide_7 &lt;NA&gt;   
- 4 eXL31      CASSSSHRDPEQYF       TCRBV27-… TCRBJ…          1 peptide_6 &lt;NA&gt;   
- 5 eEE243     CASSPSPARLAGGPSNEQFF TCRBV06-… TCRBJ…          1 peptide_5 &lt;NA&gt;   
- 6 eOX52      CASSPSTGGISYNEQFF    TCRBV07-… TCRBJ…          4 peptide_… &lt;NA&gt;   
- 7 eHH175     CASSLAGAYEQYF        TCRBV05-… TCRBJ…          2 peptide_4 &lt;NA&gt;   
- 8 eEE224     CAGQGWGQETQYF        TCRBV05-… TCRBJ…          3 peptide_3 KPFERD…
- 9 eOX52      CASSEIRAGPNQPQHF     TCRBV02-… TCRBJ…          1 peptide_8 &lt;NA&gt;   
-10 eOX49      CASSLVAVLTEAFF       TCRBV07-… TCRBJ…         11 peptide_… YLCFLA…</code></pre>
+   Experiment CDR3b                 V_gene   J_gene n_peptides peptide_n peptide
+   &lt;chr&gt;      &lt;chr&gt;                 &lt;chr&gt;    &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;     &lt;chr&gt;  
+ 1 eOX49      CASSRTNEQFF           TCRBV28… TCRBJ…          7 peptide_5 TLACFV…
+ 2 eHO130     CASSQATGALGYGYTF      TCRBV04… TCRBJ…         10 peptide_8 SIWNLD…
+ 3 ePD83      CASSTGQGLGYEQYF       TCRBV19… TCRBJ…          3 peptide_… &lt;NA&gt;   
+ 4 eLH47      CASSYPSEGASYNEQFF     TCRBV06… TCRBJ…          1 peptide_2 &lt;NA&gt;   
+ 5 eEE228     CSATDLAGVGEQYF        TCRBV20… TCRBJ…          7 peptide_9 &lt;NA&gt;   
+ 6 eEE226     CASSYSIHSLLGTGGTGELFF TCRBV06… TCRBJ…          1 peptide_… &lt;NA&gt;   
+ 7 eEE228     CASSPRGPSGIQETQYF     TCRBV07… TCRBJ…         11 peptide_1 AFLLFL…
+ 8 eEE240     CASSYFGGWGANVLTF      TCRBV11… TCRBJ…          1 peptide_7 &lt;NA&gt;   
+ 9 eQD111     CASSEGSGVVQPQHF       TCRBV10… TCRBJ…          1 peptide_3 &lt;NA&gt;   
+10 eQD125     CASRHSEGGVYDNEQFF     TCRBV05… TCRBJ…          1 peptide_9 &lt;NA&gt;   </code></pre>
 </div>
 </div>
 <ul>
@@ -1139,18 +1139,18 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 5
-   Experiment CDR3b           V_gene           J_gene     peptide    
-   &lt;chr&gt;      &lt;chr&gt;           &lt;chr&gt;            &lt;chr&gt;      &lt;chr&gt;      
- 1 eXL27      CASSFGGNEQFF    TCRBV07-08       TCRBJ02-01 LLFLVLIML  
- 2 eOX43      CASSFGTDTQYF    TCRBV27-01       TCRBJ02-03 YLCFLAFLL  
- 3 eOX43      CASRGAGYSSYEQYF TCRBV25-01       TCRBJ02-07 MIELSLIDFY 
- 4 eEE228     CASTSGPEQYF     TCRBV12-03/12-04 TCRBJ02-07 AFPFTIYSL  
- 5 eDH96      CASGFGLGDNEQFF  TCRBV05-01       TCRBJ02-01 YLCFLAFLL  
- 6 eAV93      CASSLLGAYEQYF   TCRBV27-01       TCRBJ02-07 QPYRVVVLSF 
- 7 eOX46      CASSPPTGAPYEQYF TCRBV18-01       TCRBJ02-07 VQPTESIVRF 
- 8 eQD111     CASSLSEGVTDTQYF TCRBV27-01       TCRBJ02-03 HTTDPSFLGRY
- 9 eOX52      CSAREVGRIEQYF   TCRBV20-X        TCRBJ02-07 IELSLIDFYL 
-10 eXL27      CASTFGGLAANEQFF TCRBV12-X        TCRBJ02-01 YINVFAFPF  </code></pre>
+   Experiment CDR3b               V_gene           J_gene     peptide   
+   &lt;chr&gt;      &lt;chr&gt;               &lt;chr&gt;            &lt;chr&gt;      &lt;chr&gt;     
+ 1 eOX54      CSATANTGELFF        TCRBV20-X        TCRBJ02-02 AFPFTIYSL 
+ 2 eOX52      CASSSGLASAYEQYF     TCRBV12-X        TCRBJ02-07 LLDDFVEII 
+ 3 eEE240     CATSEGSGANVLTF      TCRBV24-01       TCRBJ02-06 FLQSINFVR 
+ 4 eEE226     CSVEQASYEQYF        TCRBV29-01       TCRBJ02-07 SLIDFYLCFL
+ 5 eHH175     CASSQLNTGELFF       TCRBV03-01/03-02 TCRBJ02-02 FLYIIKLIFL
+ 6 eEE226     CSAWTSGETQYF        TCRBV20-X        TCRBJ02-05 FYLCFLAFLL
+ 7 eEE228     CASSHGPESGLGRNQPQHF TCRBV06-X        TCRBJ01-05 NVFAFPFTI 
+ 8 eOX43      CASSETGTGYEQYF      TCRBV02-01       TCRBJ02-07 SLIDFYLCFL
+ 9 eXL31      RASSLRRGAEQYF       TCRBV07-03       TCRBJ02-07 TLACFVLAAV
+10 eEE226     CASSLLTAEPEAFF      TCRBV07-09       TCRBJ01-01 TFKVSIWNL </code></pre>
 </div>
 </div>
 <ul>
@@ -1190,18 +1190,18 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <span id="cb24-2"><a href="#cb24-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 7
-   Experiment CDR3b              V_gene     J_gene     peptide k_CDR3b k_peptide
-   &lt;chr&gt;      &lt;chr&gt;              &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;     &lt;int&gt;     &lt;int&gt;
- 1 ePD87      CASSRTHRQGRNTDTQYF TCRBV18-01 TCRBJ02-03 LSPRWY…      18         9
- 2 eEE240     CSASTQSETQYF       TCRBV20-01 TCRBJ02-05 LLFLVL…      12         9
- 3 eOX52      CASSLYGQLYQETQYF   TCRBV27-01 TCRBJ02-05 TLVPQE…      16         9
- 4 eEE226     CASSTRGNTIYF       TCRBV13-01 TCRBJ01-03 LWPVTL…      12         9
- 5 eAV91      CASSVVGSSYEQYF     TCRBV09-01 TCRBJ02-07 SEYKGP…      14        12
- 6 eOX54      CASRSPGSYNEQFF     TCRBV27-01 TCRBJ02-01 ILLIIM…      14        10
- 7 ePD76      CASSEARLAGEYEQYF   TCRBV02-01 TCRBJ02-07 TPSGTW…      16         9
- 8 eLH48      CASNLMNTEAFF       TCRBV05-06 TCRBJ01-01 FIASFR…      12         9
- 9 eEE228     CAILDRVVNTEAFF     TCRBV06-05 TCRBJ01-01 FLQSIN…      14         9
-10 eEE228     CASSFSETGELFF      TCRBV05-01 TCRBJ02-02 TLACFV…      13        10</code></pre>
+   Experiment CDR3b            V_gene           J_gene peptide k_CDR3b k_peptide
+   &lt;chr&gt;      &lt;chr&gt;            &lt;chr&gt;            &lt;chr&gt;  &lt;chr&gt;     &lt;int&gt;     &lt;int&gt;
+ 1 eXL31      CASSQVTGELFF     TCRBV03-01/03-02 TCRBJ… FLAFLL…      12         9
+ 2 eEE224     CASVAAGELFF      TCRBV06-06       TCRBJ… INFVRI…      11         9
+ 3 eQD114     CASSWEGAGDTDTQYF TCRBV06-06       TCRBJ… HTTDPS…      16        11
+ 4 eHO140     CASSPPDRGNYGYTF  TCRBV18-01       TCRBJ… QSINFV…      15         9
+ 5 eXL30      CSVDTGGGTEAFF    TCRBV29-01       TCRBJ… YLCFLA…      13         9
+ 6 eHO140     CASSLSLPIGDEQYF  TCRBV27-01       TCRBJ… LWPVTL…      15         9
+ 7 eMR12      CASSFENTGELFF    TCRBV28-01       TCRBJ… HTTDPS…      13        11
+ 8 eXL30      CASSPGGDTQYF     TCRBV06-05       TCRBJ… NVFAFP…      12         9
+ 9 eEE226     CASSPAGTPTF      TCRBV12-03/12-04 TCRBJ… GYINVF…      11        10
+10 eOX52      CASSWGLAGADTQYF  TCRBV07-03       TCRBJ… LYSPIF…      15         9</code></pre>
 </div>
 </div>
 <ul>
@@ -1230,18 +1230,18 @@ <h4 class="anchored" data-anchor-id="the-peptide-details-data">The Peptide Detai
 <span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 7
-   Experiment CDR3b              V_gene         J_gene peptide k_CDR3b k_peptide
-   &lt;chr&gt;      &lt;chr&gt;              &lt;chr&gt;          &lt;chr&gt;  &lt;chr&gt;     &lt;int&gt;     &lt;int&gt;
- 1 eAV93      CASSLAVGTGYYEQYF   TCRBV05-04     TCRBJ… LTDEMI…      16         9
- 2 eAV88      CASSSRSGSANTGELFF  TCRBV28-01     TCRBJ… GEIPVA…      17        12
- 3 eXL30      CASSLSTANPSTDTQYF  TCRBV04-01     TCRBJ… YLCFLA…      17         9
- 4 eEE228     CASYPTGTGFFGYYGYTF TCRBV27-01     TCRBJ… VQPTES…      18        10
- 5 eOX54      CASTYESLYGYTF      TCRBV12-03/12… TCRBJ… NVFAFP…      13         9
- 6 eAV88      CASSSTPLTGGTYEQYF  TCRBV07-09     TCRBJ… RQLLFV…      17         9
- 7 eHH175     CATVGETYEQYF       TCRBV28-01     TCRBJ… MPASWV…      12         9
- 8 eLH41      CASSTGPGSEKLFF     TCRBV06-05     TCRBJ… KTFPPT…      14         9
- 9 eAV93      CASSIDYSSNQPQHF    TCRBV19-01     TCRBJ… SINFVR…      15        10
-10 eXL31      CSATPGGLEQFF       TCRBV20-X      TCRBJ… FLWLLW…      12         9</code></pre>
+   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide
+   &lt;chr&gt;      &lt;chr&gt;            &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;       &lt;int&gt;     &lt;int&gt;
+ 1 eHH175     CASSLSIAGTGGEQYF TCRBV27-01 TCRBJ02-07 MIELSLID…      16        10
+ 2 eQD125     CASSFRGPQETQYF   TCRBV27-01 TCRBJ02-05 HTTDPSFL…      14        11
+ 3 eQD125     CASRLTGTVAYEQYF  TCRBV28-01 TCRBJ02-07 VLHSYFTS…      15        10
+ 4 eEE240     CASRRTTDTQYF     TCRBV27-01 TCRBJ02-03 LIDFYLCFL      12         9
+ 5 eEE226     CASSNPGQGEETQYF  TCRBV12-X  TCRBJ02-05 LPFNDGVYF      15         9
+ 6 eAV93      CASSSGTGGDTEAFF  TCRBV07-09 TCRBJ01-01 RSVASQSII      15         9
+ 7 eHO135     CAITPGSTGELFF    TCRBV10-03 TCRBJ02-02 LWPVTLACF      13         9
+ 8 eMR16      CASSLGRLAQLNEQFF TCRBV11-02 TCRBJ02-01 FLYLYALV…      16        10
+ 9 eXL30      CASRLTGTGELFF    TCRBV06-05 TCRBJ02-02 GYINVFAF…      13        10
+10 eXL27      CSAEREGLYNEQFF   TCRBV20-X  TCRBJ02-01 IELSLIDF…      14        10</code></pre>
 </div>
 </div>
 </section>
@@ -1255,16 +1255,16 @@ <h4 class="anchored" data-anchor-id="creating-one-data-set-from-two-data-sets">C
 <pre><code># A tibble: 10 × 11
    Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   
    &lt;chr&gt;      &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
- 1 eHO130     Healthy (N…    28 F      White A*02… A*03… B*07… B*08… C*07… C*07…
- 2 eLH48      COVID-19-C…    28 M      White A*03… A*24… B*08… B*14… C*03… C*08…
- 3 eOX43      Healthy (N…    24 M      White A*02… A*03… B*27… B*40… C*03… C*07…
- 4 ePD76      Healthy (N…    33 M      White A*02… A*03… B*35… B*40… C*03… C*03…
- 5 eQD115     COVID-19-C…    48 M      &lt;NA&gt;  A*02… A*03… B*07… B*44… C*05… C*07…
- 6 eAV100     COVID-19-C…    29 F      &lt;NA&gt;  A*02… A*68… B*07… B*40… C*03… C*07…
- 7 eQD109     COVID-19-C…    61 M      &lt;NA&gt;  A*03… A*69… B*07… B*07… C*07… C*07…
- 8 eLH59      COVID-19-C…    NA &lt;NA&gt;   &lt;NA&gt;  A*01… A*02… B*40… B*52… C*03… C*16…
- 9 eMR15      COVID-19-C…    NA &lt;NA&gt;   &lt;NA&gt;  A*03… A*32… B*07… B*07… C*07… C*07…
-10 eAM23      COVID-19-C…    48 M      &lt;NA&gt;  A*11… A*24… B*15… B*52… C*04… C*12…</code></pre>
+ 1 eHO125     COVID-19-C…    52 M      &lt;NA&gt;  A*02… A*02… B*39… B*44… C*07… C*07…
+ 2 eHH169     Healthy (N…    24 F      Blac… A*02… A*74… B*35… B*35… C*04… C*04…
+ 3 eHO129     COVID-19-C…    66 F      Asian A*24… A*24… B*15… B*40… C*08… C*15…
+ 4 eLH47      COVID-19-C…    35 F      White A*01… A*02… B*07… B*08… C*07… C*07…
+ 5 ePD85      Healthy (N…    27 F      &lt;NA&gt;  A*02… A*29… B*07… B*18… C*07… C*15…
+ 6 ePD80      COVID-19-C…    67 M      &lt;NA&gt;  A*02… A*66… B*15… B*41… C*03… C*17…
+ 7 eJL149     COVID-19-C…    60 F      &lt;NA&gt;  A*02… A*02… B*44… B*50… C*06… C*16…
+ 8 eQD109     COVID-19-C…    61 M      &lt;NA&gt;  A*03… A*69… B*07… B*07… C*07… C*07…
+ 9 eEE226     Healthy (N…    21 F      White A*01… A*02… B*35… B*39… C*04… C*07…
+10 eJL154     COVID-19-E…    35 F      Nati… A*02… A*29… B*15… B*44… C*04… C*16…</code></pre>
 </div>
 </div>
 <p>Remember you can scroll in the data.</p>
@@ -1285,16 +1285,16 @@ <h4 class="anchored" data-anchor-id="creating-one-data-set-from-two-data-sets">C
 <pre><code># A tibble: 10 × 7
    Experiment Cohort                        Age Gender Race         Gene  Allele
    &lt;chr&gt;      &lt;chr&gt;                       &lt;dbl&gt; &lt;chr&gt;  &lt;chr&gt;        &lt;chr&gt; &lt;chr&gt; 
- 1 eLH47      COVID-19-Convalescent          35 F      White        A2    "A*02…
- 2 eJL160     COVID-19-Acute                 52 F      African Ame… B2    "B*81…
- 3 eAV100     COVID-19-Convalescent          29 F      &lt;NA&gt;         C2    "C*07…
- 4 eLH51      COVID-19-Convalescent          55 M      Asian        A1    "A*24…
- 5 eMR17      COVID-19-Convalescent          NA &lt;NA&gt;   &lt;NA&gt;         B2    "B*57…
- 6 eQD121     COVID-19-Convalescent          38 M      &lt;NA&gt;         C2    "C*07…
- 7 eNL192     COVID-19-Convalescent          NA &lt;NA&gt;   &lt;NA&gt;         C1    ""    
- 8 eMR23      COVID-19-Convalescent          22 F      &lt;NA&gt;         A1    ""    
- 9 eOX43      Healthy (No known exposure)    24 M      White        B1    "B*27…
-10 eLH42      COVID-19-Convalescent          63 M      &lt;NA&gt;         B1    "B*07…</code></pre>
+ 1 eDH107     COVID-19-Convalescent          72 F      &lt;NA&gt;         A2    "A*03…
+ 2 eQD117     COVID-19-Convalescent          70 F      &lt;NA&gt;         B1    "B*35…
+ 3 eAV100     COVID-19-Convalescent          29 F      &lt;NA&gt;         C1    "C*03…
+ 4 eHO138     COVID-19-B-Non-Acute           NA &lt;NA&gt;   &lt;NA&gt;         A2    ""    
+ 5 eJL149     COVID-19-Convalescent          60 F      &lt;NA&gt;         C1    "C*06…
+ 6 eMR25      COVID-19-Convalescent          21 F      &lt;NA&gt;         C1    ""    
+ 7 eQD113     COVID-19-Convalescent          36 M      &lt;NA&gt;         A1    "A*03…
+ 8 eHH169     Healthy (No known exposure)    24 F      Black or Af… A1    "A*02…
+ 9 eOX43      Healthy (No known exposure)    24 M      White        A1    "A*02…
+10 eQD127     COVID-19-Convalescent          61 F      &lt;NA&gt;         C1    "C*02…</code></pre>
 </div>
 </div>
 <p>Remember, what we are aiming for here, is to create one data set from two. So:</p>
@@ -1310,18 +1310,18 @@ <h4 class="anchored" data-anchor-id="creating-one-data-set-from-two-data-sets">C
 <span id="cb32-2"><a href="#cb32-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 2
-   Experiment Allele      
-   &lt;chr&gt;      &lt;chr&gt;       
- 1 eJL157     "C*07:01:01"
- 2 eHO135     "B*07:02:01"
- 3 eHO141     ""          
- 4 eJL153     "A*03:01:01"
- 5 eMR20      "C*07:02:01"
- 6 eJL154     "B*15:02:01"
- 7 ePD82      "A*26:02:01"
- 8 eQD111     "A*01:01:01"
- 9 eMR22      "C*07:18:01"
-10 eJL146     "A*02:01"   </code></pre>
+   Experiment Allele    
+   &lt;chr&gt;      &lt;chr&gt;     
+ 1 eQD108     A*68:01:02
+ 2 eHO130     B*08:01   
+ 3 ePD82      C*14:03:01
+ 4 ePD83      C*03:04   
+ 5 eQD116     C*04:01:01
+ 6 eQD123     A*02:01:01
+ 7 eQD112     C*07:02:01
+ 8 eOX43      C*03:04   
+ 9 eHO134     C*07:01:01
+10 eLH45      A*02:01:01</code></pre>
 </div>
 </div>
 <p>Use the <code>View()</code> function again, to look at the <code>meta_data</code>. Notice something? Some alleles are e.g.&nbsp;<code>A*11:01</code>, whereas others are <code>B*51:01:02</code>. You can find information on why, by visiting <a href="http://hla.alleles.org/nomenclature/naming.html">Nomenclature for Factors of the HLA System</a>.</p>
@@ -1347,16 +1347,16 @@ <h4 class="anchored" data-anchor-id="creating-one-data-set-from-two-data-sets">C
 <pre><code># A tibble: 10 × 3
    Experiment Allele     Allele_F_1_2
    &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;       
- 1 eJL157     C*07:02:01 C*07:02     
- 2 eQD118     C*03:04:01 C*03:04     
- 3 eMR13      C*07:01:01 C*07:01     
- 4 eLH54      B*40:02:01 B*40:02     
- 5 ePD82      C*08:01:01 C*08:01     
- 6 eQD121     B*57:01:01 B*57:01     
- 7 eAV88      C*07:04    C*07:04     
- 8 eDH105     B*40:01:02 B*40:01     
- 9 ePD86      C*14:02:01 C*14:02     
-10 eOX46      C*04:01    C*04:01     </code></pre>
+ 1 eQD128     B*39:01:01 B*39:01     
+ 2 eOX46      A*02:01    A*02:01     
+ 3 eLH45      C*12:03:01 C*12:03     
+ 4 eQD120     A*31:01:02 A*31:01     
+ 5 ePD81      B*40:02:01 B*40:02     
+ 6 eXL27      C*07:04    C*07:04     
+ 7 ePD79      B*07:02:01 B*07:02     
+ 8 eDH105     A*24:02:01 A*24:02     
+ 9 eAV91      C*05:01    C*05:01     
+10 eEE240     B*40:01    B*40:01     </code></pre>
 </div>
 </div>
 <p>The asterisk, i.e.&nbsp;<code>*</code> is a rather annoying character because of ambiguity, so:</p>
@@ -1373,16 +1373,16 @@ <h4 class="anchored" data-anchor-id="creating-one-data-set-from-two-data-sets">C
 <pre><code># A tibble: 10 × 2
    Experiment Allele
    &lt;chr&gt;      &lt;chr&gt; 
- 1 eDH96      A02:01
- 2 eQD127     C02:02
- 3 eJL162     B55:01
- 4 eLH51      C12:04
- 5 eHO133     A32:01
- 6 eJL157     B18:01
- 7 eQD123     C07:02
- 8 eLH59      A02:01
- 9 eXL32      A01:01
-10 eLH45      C12:03</code></pre>
+ 1 eLH43      B44:03
+ 2 eJL147     A11:01
+ 3 eHH169     A02:01
+ 4 eJL154     C16:01
+ 5 eQD119     C07:01
+ 6 eJL143     C08:02
+ 7 eHH169     B35:01
+ 8 eHO125     C07:01
+ 9 eOX52      A02:01
+10 eLH48      B08:01</code></pre>
 </div>
 </div>
 <details>
@@ -1407,18 +1407,18 @@ <h4 class="anchored" data-anchor-id="creating-one-data-set-from-two-data-sets">C
 <span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 7
-   Experiment CDR3b                  V_gene     J_gene peptide k_CDR3b k_peptide
-   &lt;chr&gt;      &lt;chr&gt;                  &lt;chr&gt;      &lt;chr&gt;  &lt;chr&gt;     &lt;int&gt;     &lt;int&gt;
- 1 eEE240     CASSQRSNTGELFF         TCRBV28-01 TCRBJ… AFLLFL…      14         9
- 2 eAV93      CATSDPPGWGQGAAYSNQPQHF TCRBV24-01 TCRBJ… TLACFV…      22        10
- 3 eOX54      CSASKLDSNNEQFF         TCRBV20-01 TCRBJ… SLIDFY…      14        10
- 4 eXL27      CASSPSGAGEQFF          TCRBV27-01 TCRBJ… FLWLLW…      13         9
- 5 eXL27      CASSDPFSGFYEQYF        TCRBV05-01 TCRBJ… VYFLQS…      15         9
- 6 eOX49      CASSGAGSNQPQHF         TCRBV09-01 TCRBJ… LLLDDF…      14         9
- 7 eEE228     CASRTGGSSYNEQFF        TCRBV19-01 TCRBJ… IELSLI…      15        10
- 8 eHO135     CASSLRSNQPQHF          TCRBV27-01 TCRBJ… ITLATC…      13         9
- 9 eEE226     CASSFSDYEQYF           TCRBV05-06 TCRBJ… FLNGSC…      12         9
-10 eHO124     CATSEALQETQYF          TCRBV24-01 TCRBJ… KVFRSS…      13         9</code></pre>
+   Experiment CDR3b             V_gene     J_gene     peptide  k_CDR3b k_peptide
+   &lt;chr&gt;      &lt;chr&gt;             &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;      &lt;int&gt;     &lt;int&gt;
+ 1 eXL30      CASSLEISYEQYF     TCRBV05-01 TCRBJ02-07 VPHVGEI…      13        11
+ 2 eOX54      CASSASMSDTQYF     TCRBV09-01 TCRBJ02-03 KLSYGIA…      13         9
+ 3 eQD111     CASSELAGADTQYF    TCRBV06-01 TCRBJ02-03 HTTDPSF…      14        11
+ 4 eOX49      CSAHFPGQGFGEQFF   TCRBV20-X  TCRBJ02-01 YLCFLAF…      15         9
+ 5 eHO128     CASSLQSPSSAGNEQFF TCRBV27-01 TCRBJ02-01 QSINFVR…      17         9
+ 6 eOX49      CASSLWGDNEQFF     TCRBV27-01 TCRBJ02-01 FYLCFLA…      13         9
+ 7 eEE240     CASSFYSSGGAEGEQFF TCRBV27-01 TCRBJ02-01 LEYHDVR…      17         9
+ 8 eEE228     CASSTKGRTNTGELFF  TCRBV27-01 TCRBJ02-02 LIVNSVL…      16        10
+ 9 eOX43      CASRGLAGDNSYEQYF  TCRBV25-01 TCRBJ02-07 SLIDFYL…      16        10
+10 eOX52      CASSRGTGSEQYF     TCRBV19-01 TCRBJ02-07 FLQSINF…      13         9</code></pre>
 </div>
 </div>
 <ul>
@@ -1448,18 +1448,18 @@ <h4 class="anchored" data-anchor-id="creating-one-data-set-from-two-data-sets">C
 <span id="cb40-2"><a href="#cb40-2" aria-hidden="true" tabindex="-1"></a>  <span class="fu">sample_n</span>(<span class="dv">10</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
 <pre><code># A tibble: 10 × 8
-   Experiment CDR3b              V_gene  J_gene peptide k_CDR3b k_peptide Allele
-   &lt;chr&gt;      &lt;chr&gt;              &lt;chr&gt;   &lt;chr&gt;  &lt;chr&gt;     &lt;int&gt;     &lt;int&gt; &lt;chr&gt; 
- 1 eHO134     CASSDRTPQETQYF     TCRBV2… TCRBJ… HTTDPS…      14        11 A24:02
- 2 eEE224     CSALGLEVNEQYF      TCRBV2… TCRBJ… FYLCFL…      13         9 A02:01
- 3 eEE226     CASSLGPDGYNEQFF    TCRBV0… TCRBJ… ITEEVG…      15        14 A02:01
- 4 eEE224     CASSFSGLSYEQYF     TCRBV0… TCRBJ… IDFYLC…      14        10 C07:04
- 5 eHH175     CASSQVQGVRSGANVLTF TCRBV0… TCRBJ… IPTNFT…      18         9 C07:02
- 6 eDH105     CASSRGTSRNTEAFF    TCRBV1… TCRBJ… LALLLL…      15         9 B48:01
- 7 eEE226     CSVVGTSGGHEQYF     TCRBV2… TCRBJ… DGVYFA…      14        10 B35:02
- 8 eQD128     CASSSPSGGINEQFF    TCRBV1… TCRBJ… YLCFLA…      15         9 A02:10
- 9 eOX43      CASSLRGTSYGYTF     TCRBV1… TCRBJ… FLPRVF…      14         9 C07:04
-10 eOX46      CASSWDNYNEQFF      TCRBV0… TCRBJ… YIIKLI…      13         9 A02:01</code></pre>
+   Experiment CDR3b             V_gene   J_gene peptide k_CDR3b k_peptide Allele
+   &lt;chr&gt;      &lt;chr&gt;             &lt;chr&gt;    &lt;chr&gt;  &lt;chr&gt;     &lt;int&gt;     &lt;int&gt; &lt;chr&gt; 
+ 1 eEE240     CASSYGQGTPLHF     TCRBV06… TCRBJ… FPQSAP…      13         9 A02:01
+ 2 eOX52      CATSDFSGSNTGELFF  TCRBV24… TCRBJ… LWPVTL…      16         9 B40:01
+ 3 eEE224     CASSTQGSGELFF     TCRBV27… TCRBJ… IELSLI…      13        10 C07:04
+ 4 eQD124     CASRTGSNQPQHF     TCRBV06… TCRBJ… ASAFFG…      13         9 B51:01
+ 5 eOX49      CASSQDSKLTGSYEQYF TCRBV14… TCRBJ… YLYALV…      17         9 A02:01
+ 6 eLH47      CASDGAGGYTF       TCRBV07… TCRBJ… WLLWPV…      11         9 C07:02
+ 7 eOX52      CASSLVQGAYNEQFF   TCRBV05… TCRBJ… LLFLVL…      15         9 B15:17
+ 8 eEE226     CASSLDGTPGNTIYF   TCRBV11… TCRBJ… DGVYFA…      15        10 C07:02
+ 9 eXL30      CASNFFPGLDNEQFF   TCRBV02… TCRBJ… APKEII…      15         8 B35:02
+10 eAV93      CASSFTGLSYEQYF    TCRBV05… TCRBJ… VLPFND…      14        10 C04:01</code></pre>
 </div>
 </div>
 </section>
diff --git a/docs/lab06_files/figure-html/unnamed-chunk-27-1.png b/docs/lab06_files/figure-html/unnamed-chunk-27-1.png
index dd7d219..22c2ef6 100644
Binary files a/docs/lab06_files/figure-html/unnamed-chunk-27-1.png and b/docs/lab06_files/figure-html/unnamed-chunk-27-1.png differ
diff --git a/docs/primer_on_linear_models_in_r.html b/docs/primer_on_linear_models_in_r.html
index f482fb7..3e610c8 100644
--- a/docs/primer_on_linear_models_in_r.html
+++ b/docs/primer_on_linear_models_in_r.html
@@ -395,7 +395,7 @@ <h3 class="anchored" data-anchor-id="data">Data</h3>
 <div class="cell">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">run_simulation</span>(<span class="at">temp =</span> <span class="fu">c</span>(<span class="dv">15</span>, <span class="dv">20</span>, <span class="dv">25</span>, <span class="dv">30</span>, <span class="dv">35</span>))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code>[1] 26.90906 42.74464 50.93029 69.31215 71.14476</code></pre>
+<pre><code>[1] 30.52223 38.13767 54.47297 63.80020 72.71315</code></pre>
 </div>
 </div>
 <p>Let’s just go ahead and create some data, we can work with. For this example, we take samples starting at 5 degree celsius and then in increments of 1 up to 50 degrees:</p>
diff --git a/docs/search.json b/docs/search.json
index 0bc310d..5745535 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -116,7 +116,7 @@
     "href": "lab02.html#schedule",
     "title": "Lab 2: Data Visualisation I",
     "section": "Schedule",
-    "text": "Schedule\n\n08.00 - 08.15: Pre-course Survey Walk-through\n08.15 - 08.30: Recap: RStudio Cloud, RStudio and R - The Very Basics (Live session)\n08.30 - 09.00: Lecture\n09.00 - 09.15: Break\n09.00 - 12.00: Exercises"
+    "text": "Schedule\n\n08.00 - 08.15: pre-course anonymous questionaire Walk-through\n08.15 - 08.30: Recap: RStudio Cloud, RStudio and R - The Very Basics (Live session)\n08.30 - 09.00: Lecture\n09.00 - 09.15: Break\n09.00 - 12.00: Exercises"
   },
   {
     "objectID": "lab02.html#learning-materials",
@@ -340,7 +340,7 @@
     "href": "lab05.html#creating-the-micro-report",
     "title": "Lab 5: Data Wrangling II",
     "section": "Creating the Micro-Report",
-    "text": "Creating the Micro-Report\n\nBackground\nFeel free to copy paste the one stated in the background-section above\n\n\nAim\nState the aim of the micro-report, i.e. what are the questions you are addressing?\n\n\nLoad Libraries\n\n\n\nLoad the libraries needed\n\n\nLoad Data\nRead the two data sets into variables peptide_data and meta_data.\n\n\n\nClick here for hint\n\n\nThink about which Tidyverse package deals with reading data and what are the file types we want to read here?\n\n\n\n\n\n\nData Description\nIt is customary to include a description of the data, helping the reader if the report, i.e. your stakeholder, to get an easy overview\n\nThe Subject Meta Data\nLet’s take a look at the meta data:\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 30\n   Experiment Subject `Cell Type` `Target Type` Cohort          Age Gender Race \n   <chr>        <dbl> <chr>       <chr>         <chr>         <dbl> <chr>  <chr>\n 1 eMR12         1770 PBMC        C19_cI        COVID-19-Con…    NA <NA>   <NA> \n 2 eHO132          26 PBMC        C19_cI        COVID-19-Con…    65 F      White\n 3 eQD109        1349 PBMC        C19_cI        COVID-19-Con…    61 M      <NA> \n 4 eEE240       20795 naive_CD8   C19_cI        Healthy (No …    23 M      White\n 5 eQD131        2267 PBMC        C19_cI        COVID-19-Exp…    NA <NA>   <NA> \n 6 ePD80         1027 PBMC        C19_cI        COVID-19-Con…    67 M      <NA> \n 7 eJL154          83 PBMC        C19_cI        COVID-19-Exp…    35 F      Nati…\n 8 eNL187        2686 B-CD8-_PBMC C19_cII       COVID-19-Con…    NA <NA>   <NA> \n 9 eOX54        10881 naive_CD8   C19_cI        Healthy (No …    39 F      Afri…\n10 eOX49        10943 naive_CD8   C19_cI        Healthy (No …    21 M      White\n# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,\n#   `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,\n#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,\n#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,\n#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,\n#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>,\n#   DRB5...30 <chr>\n\n\n\nQ1: How many observations of how many variables are in the data?\nQ2: Are there groupings in the variables, i.e. do certain variables “go together” somehow?\nT1: Re-create this plot\n\nRead this first:\n\nThink about: What is on the x-axis? What is on the y-axis? And also, it looks like we need to do some counting. Recall, that we can stick together a dplyr pipeline with a call to ggplot, so here we will have to count of Cohort and Gender before plotting\n\n\n\n\n\n\nDoes your plot look different somehow? Consider peeking at the hint…\n\n\n\nClick here for hint\n\n\nPerhaps not everyone agrees on how to denote NAs in data. I have seen -99, -11, _ and so on… Perhaps this can be dealt with in the instance we read the data from the file? I.e. in the actual function call to your read() function. Recall, how can we get information on the parameters of a ?function\n\n\nT2: Re-create this plot\n\n\n\n\n\n\n\n\n\nClick here for hint\n\n\nPerhaps there is a function, which can cut continuous observations into a set of bins?\n\n\nSTOP! Make sure you handled how NAs are denoted in the data before proceeding, see hint below T1\n\nT3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables Subject, Cell Type and Target Type\n\n\n\n\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 27\n   Experiment Cohort      Age Gender Race  `HLA-A...9` `HLA-A...10` `HLA-B...11`\n   <chr>      <chr>     <dbl> <chr>  <chr> <chr>       <chr>        <chr>       \n 1 eHO126     COVID-19…    37 F      <NA>  \"A*01:01:0… \"A*24:02:01\" \"B*07:02:01\"\n 2 eJL160     COVID-19…    52 F      Afri… \"A*01:01:0… \"A*02:01:01\" \"B*44:02:01\"\n 3 eNL192     COVID-19…    NA <NA>   <NA>  \"\"          \"\"           \"\"          \n 4 eHO130     Healthy …    28 F      White \"A*02:01\"   \"A*03:01\"    \"B*07:02\"   \n 5 eLH45      COVID-19…    53 M      <NA>  \"A*02:01:0… \"A*03:01:01\" \"B*07:02:01\"\n 6 eQD134     COVID-19…    NA <NA>   <NA>  \"A*24:07:0… \"A*34:01:01\" \"B*15:02:01\"\n 7 eJL161     COVID-19…    31 F      White \"A*01:01:0… \"A*02:01:01\" \"B*08:01:01\"\n 8 eQD116     COVID-19…    66 F      <NA>  \"A*03:01:0… \"A*11:01:01\" \"B*35:01:01\"\n 9 ePD86      COVID-19…    58 M      White \"A*02:01:0… \"A*26:01:01\" \"B*44:27:01\"\n10 eLH47      COVID-19…    35 F      White \"A*01:01:0… \"A*02:01:01\" \"B*07:02:01\"\n# ℹ 19 more variables: `HLA-B...12` <chr>, `HLA-C...13` <chr>,\n#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,\n#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,\n#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,\n#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>,\n#   DRB5...30 <chr>\n\n\nNow, a classic way of describing a cohort, i.e. the group of subjects used for the study, is the so-called table1 and while we could build this ourselves, this one time, in the interest of exercise focus and time, we are going to “cheat” and use an R-package, like so:\nNB!: This may look a bit odd initially, but if you render your document, you should be all good!\n\nlibrary(\"table1\") # <= Yes, this should normally go at the beginning!\nmeta_data |>\n  mutate(Gender = factor(Gender),\n         Cohort = factor(Cohort)) |>\n  table1(x = formula(~ Gender + Age + Race | Cohort),\n         data = _)\n\n\n\n\n\n\nCOVID-19-Acute(N=4)\nCOVID-19-B-Non-Acute(N=8)\nCOVID-19-Convalescent(N=90)\nCOVID-19-Exposed(N=3)\nHealthy (No known exposure)(N=39)\nOverall(N=144)\n\n\n\n\nGender\n\n\n\n\n\n\n\n\nF\n1 (25.0%)\n4 (50.0%)\n33 (36.7%)\n1 (33.3%)\n17 (43.6%)\n56 (38.9%)\n\n\nM\n2 (50.0%)\n3 (37.5%)\n36 (40.0%)\n0 (0%)\n21 (53.8%)\n62 (43.1%)\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n21 (23.3%)\n2 (66.7%)\n1 (2.6%)\n26 (18.1%)\n\n\nAge\n\n\n\n\n\n\n\n\nMean (SD)\n50.7 (17.0)\n43.7 (7.74)\n51.5 (15.3)\n35.0 (NA)\n33.3 (9.93)\n44.9 (15.7)\n\n\nMedian [Min, Max]\n52.0 [33.0, 67.0]\n42.0 [33.0, 53.0]\n53.0 [21.0, 79.0]\n35.0 [35.0, 35.0]\n31.0 [21.0, 62.0]\n42.0 [21.0, 79.0]\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n21 (23.3%)\n2 (66.7%)\n0 (0%)\n25 (17.4%)\n\n\nRace\n\n\n\n\n\n\n\n\nAfrican American\n1 (25.0%)\n0 (0%)\n0 (0%)\n0 (0%)\n1 (2.6%)\n2 (1.4%)\n\n\nWhite\n2 (50.0%)\n7 (87.5%)\n13 (14.4%)\n0 (0%)\n28 (71.8%)\n50 (34.7%)\n\n\nAsian\n0 (0%)\n0 (0%)\n3 (3.3%)\n0 (0%)\n2 (5.1%)\n5 (3.5%)\n\n\nHispanic or Latino/a\n0 (0%)\n0 (0%)\n1 (1.1%)\n0 (0%)\n0 (0%)\n1 (0.7%)\n\n\nNative Hawaiian or Other Pacific Islander\n0 (0%)\n0 (0%)\n0 (0%)\n1 (33.3%)\n0 (0%)\n1 (0.7%)\n\n\nBlack or African American\n0 (0%)\n0 (0%)\n0 (0%)\n0 (0%)\n3 (7.7%)\n3 (2.1%)\n\n\nMixed Race\n0 (0%)\n0 (0%)\n0 (0%)\n0 (0%)\n1 (2.6%)\n1 (0.7%)\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n73 (81.1%)\n2 (66.7%)\n4 (10.3%)\n81 (56.3%)\n\n\n\n\n\n\nNote how good this looks! If you have ever done a “Table 1” before, you know how painful they can be and especially if something changes in your cohort - Dynamic reporting to the rescue!\nLastly, before we proceed, the meta_data contains HLA data for both class I and class II (see background), but here we are only interested in class I, recall these are denoted HLA-A, HLA-B and HLA-C, so make sure to remove any non-class I, i.e. the one after, denoted D-something.\n\nT4: Create a new version of the meta_data, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. HLA-A...9 becomes A1 oand HLA-A...10 becomes A2 and so on for B1, B2, C1 and C2 (Think: How can we rename variables? And here, just do it “manually” per variable). Remember to assign this new data to the same meta_data variable\n\n\n\n\nClick here for hint\n\n\nWhich tidyverse function subsets variables? Perhaps there is a function, which somehow matches a set of variables? And perhaps for the initiated this is compatible with regular expressions (If you don’t know what this means - No worries! If you do, see if you utilise this to simplify your variable selection)\n\n\n\n\nBefore we proceed, this is the data we will carry on with:\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 11\n   Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   \n   <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>\n 1 eQD114     COVID-19-C…    73 M      <NA>  \"A*0… \"A*2… \"B*0… \"B*4… \"C*0… \"C*1…\n 2 eNL189     COVID-19-E…    NA <NA>   <NA>  \"\"    \"\"    \"\"    \"\"    \"\"    \"\"   \n 3 eQD112     COVID-19-C…    65 M      <NA>  \"A*2… \"A*2… \"B*0… \"B*3… \"C*0… \"C*0…\n 4 eLH46      COVID-19-C…    57 F      White \"A*0… \"A*2… \"B*3… \"B*5… \"C*1… \"C*1…\n 5 eAV88      Healthy (N…    24 M      White \"A*0… \"A*0… \"B*2… \"B*4… \"C*0… \"C*0…\n 6 eNL187     COVID-19-C…    NA <NA>   <NA>  \"\"    \"\"    \"\"    \"\"    \"\"    \"\"   \n 7 eJL154     COVID-19-E…    35 F      Nati… \"A*0… \"A*2… \"B*1… \"B*4… \"C*0… \"C*1…\n 8 eQD124     COVID-19-B…    40 F      White \"A*0… \"A*0… \"B*1… \"B*5… \"C*0… \"C*0…\n 9 eQD134     COVID-19-C…    NA <NA>   <NA>  \"A*2… \"A*3… \"B*1… \"B*1… \"C*0… \"C*0…\n10 ePD83      Healthy (N…    29 F      Asian \"A*0… \"A*0… \"B*1… \"B*4… \"C*0… \"C*0…\n\n\nNow, we have a beautiful tidy dataset, recall that this entails, that each row is an observation, each column is a variable and each cell holds one value.\n\n\n\nThe Peptide Details Data\nLet’s start with simply having a look see:\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   `TCR BioIdentity`            TCR Nucleotide Seque…¹ Experiment `ORF Coverage`\n   <chr>                        <chr>                  <chr>      <chr>         \n 1 CSGQQGYEQYF+TCRBV29-01+TCRB… ACTCTGACTGTGAGCAACATG… eXL30      surface glyco…\n 2 CASSPRTTPAPQHF+TCRBV19-01+T… GTGACATCGGCCCAAAAGAAC… eEE224     membrane glyc…\n 3 CASSEVGTLEAFF+TCRBV25-01+TC… ACCCTGGAGTCTGCCAGGCCC… eEE228     membrane glyc…\n 4 CSARLGQGSYEQYF+TCRBV20-X+TC… GTGACCAGTGCCCATCCTGAA… eXL30      surface glyco…\n 5 CASSEGLGGYEQYF+TCRBV06-01+T… NNNNTGTCGGCTGCTCCCTCC… eEE240     ORF3a         \n 6 CASSHLDRGSYNEQFF+TCRBV04-01… GCCCTGCAGCCAGAAGACTCA… eEE226     surface glyco…\n 7 CASSERDPRQETQYF+TCRBV27-01+… GAGTCGCCCAGCCCCAACCAG… eQD114     ORF1ab        \n 8 CASSVGGRSYEQYF+TCRBV09-01+T… CTGAGCTCTCTGGAGCTGGGG… ePD82      ORF3a         \n 9 CASSPAPIAYEQYF+TCRBV06-05+T… NNNNTGTCGGCTGCTCCCTCC… eQD131     surface glyco…\n10 CASSQETANTGELFF+TCRBV04-02+… CACACCCTGCAGCCAGAAGAC… eLH51      nucleocapsid …\n# ℹ abbreviated name: ¹​`TCR Nucleotide Sequence`\n# ℹ 3 more variables: `Amino Acids` <chr>, `Start Index in Genome` <dbl>,\n#   `End Index in Genome` <dbl>\n\n\n\nQ3: How many observations of how many variables are in the data?\n\nThis is a rather big data set, so let us start with two “tricks” to handle this, first:\n\nWrite the data back into your data folder, using the filename peptide-detail-ci.csv.gz, note the appending of .gz, which is automatically recognised and results in gz-compression\nNow, check in your data folder, that you have two files peptide-detail-ci.csv and peptide-detail-ci.csv.gz, delete the former\nAdjust your reading-the-data-code in the “Load Data”-section, to now read in the peptide-detail-ci.csv.gz file\n\n\n\n\nClick here for hint\n\n\nJust as you can read a file, you can of course also write a file. Note the filetype we want to write here is csv. If you in the console type e.g. readr::wr and then hit the Tab key, you will see the different functions for writing different filetypes\n\nThen:\n\nT5: As before, let’s immediately subset the peptide_data to the variables of interest: TCR BioIdentity, Experiment and Amino Acids. Remember to assign this new data to the same peptide_data variable to avoid cluttering your environment with redundant variables. Bonus: Did you know you can click the Environment pane and see which variables you have?\n\n\n\n\nOnce again, before we proceed, this is the data we will carry on with:\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 3\n   Experiment `TCR BioIdentity`                           `Amino Acids`         \n   <chr>      <chr>                                       <chr>                 \n 1 eAV93      CASSILLAGGTDTQYF+TCRBV27-01+TCRBJ02-03      FLWLLWPVT,FLWLLWPVTL,…\n 2 eMR15      CASSLIQGANTEAFF+TCRBV07-09+TCRBJ01-01       CPDGVKHVY,DGVKHVYQL,F…\n 3 eQD123     CASSPQGAGSLYEQYF+TCRBV04-01+TCRBJ02-07      FLQSINFVR,FLQSINFVRI,…\n 4 eOX54      CSADTQYF+TCRBV20-01+TCRBJ02-03              FIASFRLFA,SYFIASFRLF,…\n 5 eOX52      CASSQDAGLANEQYF+TCRBV03-01/03-02+TCRBJ02-07 ELYSPIFLI,LYSPIFLIV,Q…\n 6 eAV93      CASSLVATGELFF+TCRBV05-04+TCRBJ02-02         AFPFTIYSL,GYINVFAFPF,…\n 7 eLH44      CASSLNPGEGPQNIQYF+TCRBV28-01+TCRBJ02-04     AFPFTIYSL,GYINVFAFPF,…\n 8 eAV93      CASSSRTSGWYNEQFF+TCRBV11-02+TCRBJ02-01      AIPTNFTISV,AYSNNSIAIP…\n 9 eEE240     CASSPPGAPMGQPQHF+TCRBV27-01+TCRBJ01-05      FLNGSCGSV             \n10 eOX54      CATSDLPSTGTEVTGELFF+TCRBV24-01+TCRBJ02-02   QYIKWPWYI,YEQYIKWPW,Y…\n\n\n\nQ4: Is this tidy data? Why/why not?\nT6: See if you can find a way to create the below data, from the above\n\n\n\n\n\npeptide_data |> \n  sample_n(size = 10)\n\n# A tibble: 10 × 5\n   Experiment CDR3b              V_gene     J_gene     `Amino Acids`            \n   <chr>      <chr>              <chr>      <chr>      <chr>                    \n 1 eEE240     CASSPYGGTEAFF      TCRBV07-06 TCRBJ01-01 YLNTLTLAV                \n 2 eXL30      CATPPRGGTGELFF     TCRBV07-09 TCRBJ02-02 AFLLFLVLI,FLAFLLFLV,FYLC…\n 3 eMR16      CASSLVWGAKNIQYF    TCRBV07-08 TCRBJ02-04 AFPFTIYSL,GYINVFAFPF,INV…\n 4 eQD111     CASSMTSSRDEQYF     TCRBV27-01 TCRBJ02-07 HTTDPSFLGRY              \n 5 eEE224     CASSFGLGSDPFF      TCRBV27-01 TCRBJ02-01 AFPFTIYSL,GYINVFAFPF,INV…\n 6 eXL32      CASSPPTPAGWANEKLFF TCRBV28-01 TCRBJ01-04 TVLSFCAFA,VLSFCAFAV      \n 7 eQD126     CASRPPDGGIYEQYF    TCRBV06-05 TCRBJ02-07 HTTDPSFLGRY              \n 8 eQD128     CASSELAGPQETQYF    TCRBV06-01 TCRBJ02-05 GYQPYRVVVL,PYRVVVLSF,QPY…\n 9 eEE228     CSVLQGTEAFF        TCRBV29-01 TCRBJ01-01 AEAELAKNVSL,AELAKNVSLDNVL\n10 eQD113     CASSSTSGGNEQFF     TCRBV07-08 TCRBJ02-01 IGAGICASY,IPIGAGICASY    \n\n\n\n\n\nClick here for hint\n\n\nFirst: Compare the two datasets and identify what happened? Did any variables “disappear” and did any “appear”? Ok, so this is a bit tricky, but perhaps there is a function to separate a composite (untidy) column into a set of new variables based on a separator? But what is a separator? Just like when you read a file with Comma Separated Values, a separator denotes how a composite string is divided into fields. So, look for such a repeated value, which seem to indeed separate such fields. Also, be aware, that character, which can mean more than one thing, may need to be “escaped” using an initial two backslashed, i.e. “\\x”, where x denotes the character needing to be “escaped”\n\n\nT7: Add a variable, which counts how many peptides are in each observation of Amino Acids\n\n\n\n\n\n\n\nClick here for hint\n\n\nWe have been working with the stringr package, perhaps the contains a function to somehow count the number of occurrences of a given character in a string? Again, remember you can type e.g. stringr::str_ and then hit the Tab key to see relevant functions\n\n\npeptide_data |> \n  sample_n(size = 10)\n\n# A tibble: 10 × 6\n   Experiment CDR3b            V_gene     J_gene     `Amino Acids`    n_peptides\n   <chr>      <chr>            <chr>      <chr>      <chr>                 <dbl>\n 1 eEE226     CATQLPSTDTQYF    TCRBV06-05 TCRBJ02-03 GRLQSLQTY,LITGR…          3\n 2 eEE226     CASSSPGAGTGELFF  TCRBV27-01 TCRBJ02-02 HTTDPSFLGRY               1\n 3 eXL31      CASSLGGEQYF      TCRBV27-01 TCRBJ02-07 TEKSNIIRGW                1\n 4 eEE228     CASRYSEAYEQYF    TCRBV27-01 TCRBJ02-07 FPPTSFGPL                 1\n 5 eXL27      CAPRRGAGVSEAFF   TCRBV28-01 TCRBJ01-01 DFLEYHDVR,EDFLE…          5\n 6 eHO141     CASSVSGTGDADTQYF TCRBV06-01 TCRBJ02-03 YLQPRTFL,YLQPRT…          3\n 7 eAV91      CASSPERLGYTF     TCRBV28-01 TCRBJ01-02 ITDVFYKENSY,SEY…          2\n 8 eEE240     CASSPDRLAGEQYF   TCRBV04-02 TCRBJ02-07 KLSYGIATV                 1\n 9 eAV93      CSVRDFLYNEQFF    TCRBV29-01 TCRBJ02-01 AFPFTIYSL,GYINV…          7\n10 eEE226     CASSHYNGNQPQHF   TCRBV27-01 TCRBJ01-05 LLDDFVEII,LLLDD…          2\n\n\n\nT8: Re-create the following plot\n\n\n\n\n\n\n\nQ4: What is the maximum number of peptides assigned to one observation?\nT9: Using the str_c() and the seq() functions, re-create the below\n\n\n\n[1] \"peptide_1\" \"peptide_2\" \"peptide_3\" \"peptide_4\" \"peptide_5\"\n\n\n\n\n\nClick here for hint\n\n\nIf you’re uncertain on how a function works, try going into the console and in this case e.g. type str_c(\"a\", \"b\") and seq(from = 1, to = 3) and see if you combine these?\n\n\nT10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data\n\n\n\n\n\n\n\nClick here for hint\n\n\nIn the console, write ?separate and think about how you used it earlier. Perhaps you can not only specify a vector to separate into, but also specify a function, which returns a vector?\n\n\npeptide_data |> \n  sample_n(size = 10)\n\n# A tibble: 10 × 18\n   Experiment CDR3b        V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4\n   <chr>      <chr>        <chr>  <chr>  <chr>     <chr>     <chr>     <chr>    \n 1 eXL37      CASSLGQGADY… TCRBV… TCRBJ… VLWAHGFEL <NA>      <NA>      <NA>     \n 2 eAV93      CASVGLAMDNE… TCRBV… TCRBJ… AFPFTIYSL GYINVFAF… INVFAFPF… MGYINVFAF\n 3 eOX54      CASSVADMNTE… TCRBV… TCRBJ… FVDGVPFVV <NA>      <NA>      <NA>     \n 4 eOX52      CASRGLAKSSY… TCRBV… TCRBJ… AFLLFLVLI FLAFLLFLV FYLCFLAFL FYLCFLAF…\n 5 eHO138     CASSQGFPGGV… TCRBV… TCRBJ… ALNTPKDHI ATEGALNT… <NA>      <NA>     \n 6 eAV88      CASSQNLNEKL… TCRBV… TCRBJ… LLDDFVEII LLLDDFVEI <NA>      <NA>     \n 7 eOX46      CASSLGGQGSN… TCRBV… TCRBJ… FLWLLWPVT FLWLLWPV… LWLLWPVTL LWPVTLACF\n 8 eOX52      CSVDQDGIGEL… TCRBV… TCRBJ… KLSYGIATV <NA>      <NA>      <NA>     \n 9 eOX54      CASSLVPSSGP… TCRBV… TCRBJ… FVDGVPFVV <NA>      <NA>      <NA>     \n10 eQD114     CASSYRLATYE… TCRBV… TCRBJ… NSSPDDQI… NTNSSPDD… SSPDDQIGY SSPDDQIG…\n# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,\n#   peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,\n#   peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>\n\n\n\nQ5: Now, presumable you got a warning, discuss in your group why that is?\nQ6: With respect to peptide_n, discuss in your group, if this is wide- or long-data?\n\nNow, finally we will use the what we prepared for today, data pivoting. There are two functions, namely pivot_wider() and pivot_longer(). Also, now, we will use a trick when developing ones data pipeline, while working with new functions, that on might not be completely comfortable with. You have seen the sample_n() function several times above and we can use that to randomly sample n observations from data. This we can utilise to work with a smaller data set in the development face and once we are ready, we can increase this n gradually to see if everything continues to work as anticipated.\n\nT11: Using the peptide_data, run a few sample_n() calls with varying degree of n to make sure, that you get a feeling for what is going on\nT12: From the peptide_data data above, with peptide_1, peptide_2, etc. create this data set using one of the data pivoting functions. Remember to start initially with sampling a smaller data set and then work on that first! Also, once you’re sure you’re good to go, reuse the peptide_data variable as we don’t want huge redundant data sets floating around in our environment\n\n\n\n\n\n\n\nClick here for hint\n\n\nIf the pivoting is not clear at all, then do what I do, create some example data:\n\nmy_data <- tibble(\n  id = str_c(\"id_\", 1:10),\n  var_1 = round(rnorm(10),1),\n  var_2 = round(rnorm(10),1),\n  var_3 = round(rnorm(10),1))\n\n…and then play around with that. A small set like the one above is easy to handle, so perhaps start with that and then pivot back and forth a few times using pivot_wider()/pivot_longer(). Use View() to inspect and get a better overview of the results of pivoting.\n\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment CDR3b                V_gene    J_gene n_peptides peptide_n peptide\n   <chr>      <chr>                <chr>     <chr>       <dbl> <chr>     <chr>  \n 1 eXL30      CASSYPLGYPEAFF       TCRBV06-… TCRBJ…          1 peptide_… <NA>   \n 2 eMR15      CASSLLTGGPVAKNIQYF   TCRBV27-… TCRBJ…          2 peptide_5 <NA>   \n 3 eMR26      CASSLAGVEQYF         TCRBV05-… TCRBJ…          2 peptide_7 <NA>   \n 4 eXL31      CASSSSHRDPEQYF       TCRBV27-… TCRBJ…          1 peptide_6 <NA>   \n 5 eEE243     CASSPSPARLAGGPSNEQFF TCRBV06-… TCRBJ…          1 peptide_5 <NA>   \n 6 eOX52      CASSPSTGGISYNEQFF    TCRBV07-… TCRBJ…          4 peptide_… <NA>   \n 7 eHH175     CASSLAGAYEQYF        TCRBV05-… TCRBJ…          2 peptide_4 <NA>   \n 8 eEE224     CAGQGWGQETQYF        TCRBV05-… TCRBJ…          3 peptide_3 KPFERD…\n 9 eOX52      CASSEIRAGPNQPQHF     TCRBV02-… TCRBJ…          1 peptide_8 <NA>   \n10 eOX49      CASSLVAVLTEAFF       TCRBV07-… TCRBJ…         11 peptide_… YLCFLA…\n\n\n\nQ7: You will see some NAs in the peptide variable, discuss in your group from where these arise?\nQ8: How many rows and columns now and how does this compare with Q3? Discuss why/why not it is different?\nT13: Now, lose the redundant variables n_peptides and peptide_n, get rid of the NAs in the peptide column, and make sure that we only have unique observations (i.e. there are no repeated rows/observations).\n\n\n\n\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 5\n   Experiment CDR3b           V_gene           J_gene     peptide    \n   <chr>      <chr>           <chr>            <chr>      <chr>      \n 1 eXL27      CASSFGGNEQFF    TCRBV07-08       TCRBJ02-01 LLFLVLIML  \n 2 eOX43      CASSFGTDTQYF    TCRBV27-01       TCRBJ02-03 YLCFLAFLL  \n 3 eOX43      CASRGAGYSSYEQYF TCRBV25-01       TCRBJ02-07 MIELSLIDFY \n 4 eEE228     CASTSGPEQYF     TCRBV12-03/12-04 TCRBJ02-07 AFPFTIYSL  \n 5 eDH96      CASGFGLGDNEQFF  TCRBV05-01       TCRBJ02-01 YLCFLAFLL  \n 6 eAV93      CASSLLGAYEQYF   TCRBV27-01       TCRBJ02-07 QPYRVVVLSF \n 7 eOX46      CASSPPTGAPYEQYF TCRBV18-01       TCRBJ02-07 VQPTESIVRF \n 8 eQD111     CASSLSEGVTDTQYF TCRBV27-01       TCRBJ02-03 HTTDPSFLGRY\n 9 eOX52      CSAREVGRIEQYF   TCRBV20-X        TCRBJ02-07 IELSLIDFYL \n10 eXL27      CASTFGGLAANEQFF TCRBV12-X        TCRBJ02-01 YINVFAFPF  \n\n\n\nQ8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?\n\nAgain, we turn to the stringr package, as we need to make sure that the sequence data does indeed only contain valid characters. There are a total of 20 proteogenic amino acids, which we symbolise using ARNDCQEGHILKMFPSTWYV.\n\nT14: Use the str_detect() function to filter the CDR3b and peptide variables using a pattern of [^ARNDCQEGHILKMFPSTWYV] and then play with the negate parameter so see what happens\n\n\n\n\n\n\n\nClick here for hint\n\n\nAgain, try to play a bit around with the function in the console, type e.g. str_detect(string = \"ARND\", pattern = \"A\") and str_detect(string = \"ARND\", pattern = \"C\") and then recall, that the filter() function requires a logical vector, i.e. a vector of TRUE and FALSE to filter the rows\n\n\nT15: Add two new variables to the data, k_CDR3b and k_peptide each signifying the length of the respective sequences\n\n\n\n\n\n\n\nClick here for hint\n\n\nAgain, we’re working with strings, so perhaps there is a package of interest and perhaps in that package, there is a function, which can get the length of a string?\n\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment CDR3b              V_gene     J_gene     peptide k_CDR3b k_peptide\n   <chr>      <chr>              <chr>      <chr>      <chr>     <int>     <int>\n 1 ePD87      CASSRTHRQGRNTDTQYF TCRBV18-01 TCRBJ02-03 LSPRWY…      18         9\n 2 eEE240     CSASTQSETQYF       TCRBV20-01 TCRBJ02-05 LLFLVL…      12         9\n 3 eOX52      CASSLYGQLYQETQYF   TCRBV27-01 TCRBJ02-05 TLVPQE…      16         9\n 4 eEE226     CASSTRGNTIYF       TCRBV13-01 TCRBJ01-03 LWPVTL…      12         9\n 5 eAV91      CASSVVGSSYEQYF     TCRBV09-01 TCRBJ02-07 SEYKGP…      14        12\n 6 eOX54      CASRSPGSYNEQFF     TCRBV27-01 TCRBJ02-01 ILLIIM…      14        10\n 7 ePD76      CASSEARLAGEYEQYF   TCRBV02-01 TCRBJ02-07 TPSGTW…      16         9\n 8 eLH48      CASNLMNTEAFF       TCRBV05-06 TCRBJ01-01 FIASFR…      12         9\n 9 eEE228     CAILDRVVNTEAFF     TCRBV06-05 TCRBJ01-01 FLQSIN…      14         9\n10 eEE228     CASSFSETGELFF      TCRBV05-01 TCRBJ02-02 TLACFV…      13        10\n\n\n\nT16: Re-create this plot\n\n\n\n\n\n\n\nQ9: What is the most predominant length of the CDR3b-sequences?\nT17: Re-create this plot\n\n\n\n\n\n\n\nQ10: What is the most predominant length of the peptide-sequences?\nQ11: Discuss in your group, if this data set is tidy or not?\n\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment CDR3b              V_gene         J_gene peptide k_CDR3b k_peptide\n   <chr>      <chr>              <chr>          <chr>  <chr>     <int>     <int>\n 1 eAV93      CASSLAVGTGYYEQYF   TCRBV05-04     TCRBJ… LTDEMI…      16         9\n 2 eAV88      CASSSRSGSANTGELFF  TCRBV28-01     TCRBJ… GEIPVA…      17        12\n 3 eXL30      CASSLSTANPSTDTQYF  TCRBV04-01     TCRBJ… YLCFLA…      17         9\n 4 eEE228     CASYPTGTGFFGYYGYTF TCRBV27-01     TCRBJ… VQPTES…      18        10\n 5 eOX54      CASTYESLYGYTF      TCRBV12-03/12… TCRBJ… NVFAFP…      13         9\n 6 eAV88      CASSSTPLTGGTYEQYF  TCRBV07-09     TCRBJ… RQLLFV…      17         9\n 7 eHH175     CATVGETYEQYF       TCRBV28-01     TCRBJ… MPASWV…      12         9\n 8 eLH41      CASSTGPGSEKLFF     TCRBV06-05     TCRBJ… KTFPPT…      14         9\n 9 eAV93      CASSIDYSSNQPQHF    TCRBV19-01     TCRBJ… SINFVR…      15        10\n10 eXL31      CSATPGGLEQFF       TCRBV20-X      TCRBJ… FLWLLW…      12         9\n\n\n\n\nCreating one data set from two data sets\nBefore we move onto using the family of *_join() functions you prepared for today, we will just take a quick peek at the meta data again:\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 11\n   Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   \n   <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>\n 1 eHO130     Healthy (N…    28 F      White A*02… A*03… B*07… B*08… C*07… C*07…\n 2 eLH48      COVID-19-C…    28 M      White A*03… A*24… B*08… B*14… C*03… C*08…\n 3 eOX43      Healthy (N…    24 M      White A*02… A*03… B*27… B*40… C*03… C*07…\n 4 ePD76      Healthy (N…    33 M      White A*02… A*03… B*35… B*40… C*03… C*03…\n 5 eQD115     COVID-19-C…    48 M      <NA>  A*02… A*03… B*07… B*44… C*05… C*07…\n 6 eAV100     COVID-19-C…    29 F      <NA>  A*02… A*68… B*07… B*40… C*03… C*07…\n 7 eQD109     COVID-19-C…    61 M      <NA>  A*03… A*69… B*07… B*07… C*07… C*07…\n 8 eLH59      COVID-19-C…    NA <NA>   <NA>  A*01… A*02… B*40… B*52… C*03… C*16…\n 9 eMR15      COVID-19-C…    NA <NA>   <NA>  A*03… A*32… B*07… B*07… C*07… C*07…\n10 eAM23      COVID-19-C…    48 M      <NA>  A*11… A*24… B*15… B*52… C*04… C*12…\n\n\nRemember you can scroll in the data.\n\nQ12: Discuss in your group, if this data with respect to the A1, A2, B1, B2, C1 and C2 variables is a wide or a long data format?\n\nAs with the peptide_data, we will now have to use data pivoting again. I.e.:\n\nT18: use either pivot_wider() or pivot_longer() to create the following data:\n\n\n\n\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment Cohort                        Age Gender Race         Gene  Allele\n   <chr>      <chr>                       <dbl> <chr>  <chr>        <chr> <chr> \n 1 eLH47      COVID-19-Convalescent          35 F      White        A2    \"A*02…\n 2 eJL160     COVID-19-Acute                 52 F      African Ame… B2    \"B*81…\n 3 eAV100     COVID-19-Convalescent          29 F      <NA>         C2    \"C*07…\n 4 eLH51      COVID-19-Convalescent          55 M      Asian        A1    \"A*24…\n 5 eMR17      COVID-19-Convalescent          NA <NA>   <NA>         B2    \"B*57…\n 6 eQD121     COVID-19-Convalescent          38 M      <NA>         C2    \"C*07…\n 7 eNL192     COVID-19-Convalescent          NA <NA>   <NA>         C1    \"\"    \n 8 eMR23      COVID-19-Convalescent          22 F      <NA>         A1    \"\"    \n 9 eOX43      Healthy (No known exposure)    24 M      White        B1    \"B*27…\n10 eLH42      COVID-19-Convalescent          63 M      <NA>         B1    \"B*07…\n\n\nRemember, what we are aiming for here, is to create one data set from two. So:\n\nQ13: Discuss in your group, which variable(s?) define the same observations between the peptide_data and the meta_data?\n\nOnce you have agreed upon Experiment, then use that knowledge to subset the meta_data to the variables-of-interest:\n\n\n\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 2\n   Experiment Allele      \n   <chr>      <chr>       \n 1 eJL157     \"C*07:01:01\"\n 2 eHO135     \"B*07:02:01\"\n 3 eHO141     \"\"          \n 4 eJL153     \"A*03:01:01\"\n 5 eMR20      \"C*07:02:01\"\n 6 eJL154     \"B*15:02:01\"\n 7 ePD82      \"A*26:02:01\"\n 8 eQD111     \"A*01:01:01\"\n 9 eMR22      \"C*07:18:01\"\n10 eJL146     \"A*02:01\"   \n\n\nUse the View() function again, to look at the meta_data. Notice something? Some alleles are e.g. A*11:01, whereas others are B*51:01:02. You can find information on why, by visiting Nomenclature for Factors of the HLA System.\nLong story short, we only want to include Field 1 (allele group) and Field 2 (Specific HLA protein). You have prepared the stringr package for today. See if you can find a way to reduce e.g. B*51:01:02 to B*51:01 and then create a new variable Allele_F_1_2 accordingly, while also removing the ...x (where x is a number) subscripts from the Gene variable (It is an artifact from having the data in a wide format, where you cannot have two variables with the same name) and also, remove any NAs and \"\"s, denoting empty entries.\n\n\n\nClick here for hint\n\n\nThere are several ways this can be achieved, the easiest being to consider if perhaps a part of the string based on indices could be of interest. This term “a part of a string” is called a substring, perhaps the stringr package contains a function work with substring? In the console, type stringr:: and hit tab. This will display the functions available in the stringr package. Scroll down and find the functionst starting with str_ and look for on, which might be relevant and remember you can use ?function_name to get more information on how a given function works.\n\n\n\n\n\nT19: Create the following data, according to specifications above:\n\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 3\n   Experiment Allele     Allele_F_1_2\n   <chr>      <chr>      <chr>       \n 1 eJL157     C*07:02:01 C*07:02     \n 2 eQD118     C*03:04:01 C*03:04     \n 3 eMR13      C*07:01:01 C*07:01     \n 4 eLH54      B*40:02:01 B*40:02     \n 5 ePD82      C*08:01:01 C*08:01     \n 6 eQD121     B*57:01:01 B*57:01     \n 7 eAV88      C*07:04    C*07:04     \n 8 eDH105     B*40:01:02 B*40:01     \n 9 ePD86      C*14:02:01 C*14:02     \n10 eOX46      C*04:01    C*04:01     \n\n\nThe asterisk, i.e. * is a rather annoying character because of ambiguity, so:\n\nT20: Clean the data a bit more, by removing the asterisk and redundant variables:\n\n\n\n\n\nmeta_data |> \n  sample_n(size = 10)\n\n# A tibble: 10 × 2\n   Experiment Allele\n   <chr>      <chr> \n 1 eDH96      A02:01\n 2 eQD127     C02:02\n 3 eJL162     B55:01\n 4 eLH51      C12:04\n 5 eHO133     A32:01\n 6 eJL157     B18:01\n 7 eQD123     C07:02\n 8 eLH59      A02:01\n 9 eXL32      A01:01\n10 eLH45      C12:03\n\n\n\n\n\nClick here for hint 1\n\n\nAgain, the stringr package may come in handy. Perhaps there is a function remove, one or more such pesky characters?\n\n\n\n\nClick here for hint 2\n\n\nGetting a weird error? Recall, that character ambiguity needs to be “escaped”, you did this somehow earlier on…\n\nRecall the peptide_data?\n\npeptide_data |>\n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment CDR3b                  V_gene     J_gene peptide k_CDR3b k_peptide\n   <chr>      <chr>                  <chr>      <chr>  <chr>     <int>     <int>\n 1 eEE240     CASSQRSNTGELFF         TCRBV28-01 TCRBJ… AFLLFL…      14         9\n 2 eAV93      CATSDPPGWGQGAAYSNQPQHF TCRBV24-01 TCRBJ… TLACFV…      22        10\n 3 eOX54      CSASKLDSNNEQFF         TCRBV20-01 TCRBJ… SLIDFY…      14        10\n 4 eXL27      CASSPSGAGEQFF          TCRBV27-01 TCRBJ… FLWLLW…      13         9\n 5 eXL27      CASSDPFSGFYEQYF        TCRBV05-01 TCRBJ… VYFLQS…      15         9\n 6 eOX49      CASSGAGSNQPQHF         TCRBV09-01 TCRBJ… LLLDDF…      14         9\n 7 eEE228     CASRTGGSSYNEQFF        TCRBV19-01 TCRBJ… IELSLI…      15        10\n 8 eHO135     CASSLRSNQPQHF          TCRBV27-01 TCRBJ… ITLATC…      13         9\n 9 eEE226     CASSFSDYEQYF           TCRBV05-06 TCRBJ… FLNGSC…      12         9\n10 eHO124     CATSEALQETQYF          TCRBV24-01 TCRBJ… KVFRSS…      13         9\n\n\n\nT21: Create a dplyr pipeline, starting with the peptide_data, which joins it with the meta_data and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names peptide_meta_data (If you get a warning, discuss in your group what it means?)\n\n\n\n\n\n\n\nClick here for hint 1\n\n\nWhich family of functions do we use to join data? Also, perhaps here it would be prudent to start with working on a smaller data set, recall we could sample a number of rows yielding a smaller development data set\n\n\n\n\nClick here for hint 2\n\n\nYou should get a data set of around +3.000.000, take a moment to consider how that would have been to work with in Excel? Also, in case the servers are not liking this, you can consider subsetting the peptide_data prior to joining to e.g. 100,000 or 10,000 rows.\n\n\npeptide_meta_data |>\n  sample_n(10)\n\n# A tibble: 10 × 8\n   Experiment CDR3b              V_gene  J_gene peptide k_CDR3b k_peptide Allele\n   <chr>      <chr>              <chr>   <chr>  <chr>     <int>     <int> <chr> \n 1 eHO134     CASSDRTPQETQYF     TCRBV2… TCRBJ… HTTDPS…      14        11 A24:02\n 2 eEE224     CSALGLEVNEQYF      TCRBV2… TCRBJ… FYLCFL…      13         9 A02:01\n 3 eEE226     CASSLGPDGYNEQFF    TCRBV0… TCRBJ… ITEEVG…      15        14 A02:01\n 4 eEE224     CASSFSGLSYEQYF     TCRBV0… TCRBJ… IDFYLC…      14        10 C07:04\n 5 eHH175     CASSQVQGVRSGANVLTF TCRBV0… TCRBJ… IPTNFT…      18         9 C07:02\n 6 eDH105     CASSRGTSRNTEAFF    TCRBV1… TCRBJ… LALLLL…      15         9 B48:01\n 7 eEE226     CSVVGTSGGHEQYF     TCRBV2… TCRBJ… DGVYFA…      14        10 B35:02\n 8 eQD128     CASSSPSGGINEQFF    TCRBV1… TCRBJ… YLCFLA…      15         9 A02:10\n 9 eOX43      CASSLRGTSYGYTF     TCRBV1… TCRBJ… FLPRVF…      14         9 C07:04\n10 eOX46      CASSWDNYNEQFF      TCRBV0… TCRBJ… YIIKLI…      13         9 A02:01\n\n\n\n\n\nAnalysis\nNow, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:\n\nWhat characterises the peptides binding to the HLAs?\nWhat characterises T-cell Receptors binding to the pMHC-complexes?\n\n\nPeptides binding to HLA\nAs we have touched upon multiple times, R is very flexible and naturally you can also create sequence logos. Finally, let us create a binding motif using the package ggseqlogo (More info here).\n\nT22: Subset the final peptide_meta_data data to A02:01 and unique observations of peptides of length 9 and re-create the below sequence logo\n\n\n\n\nClick here for hint\n\n\nYou can pipe a vector of peptides into ggseqlogo, but perhaps you first need to pull that vector from the relevant variable in your tibble? Also, consider before that, that you’ll need to make sure, you are only looking at peptides of length 9\n\n\n\n\n\n\n\n\n\n\n\nT23: Repeat for e.g. B07:02 or another of your favourite alleles\n\nNow, let’s take a closer look at the sequence logo:\n\nQ14: Which positions in the peptide determines binding to HLA?\n\n\n\n\nClick here for hint\n\n\nRecall your Introduction to Bioinformatics course? And/or perhaps ask your fellow group members if they know?\n\n\n\nCDR3b-sequences binding to pMHC\n\nT24: Subset the peptide_meta_data, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:\n\n\n\n\n\n\n\n\n\n\n\nQ15: In your group, discuss what you see?\nT25: Play around with other combinations of k_CDR3b, Allele, and peptide and inspect how the logo changes\n\nDisclaimer: In this data set, we only get: A given CDR3b was found to recognise a given peptide in a given subject and that subject had a given haplotype - Something’s missing… Perhaps if you have had immunology, then you can spot it? There is a trick to get around this missing information, but that’s beyond scope of what we’re working with here."
+    "text": "Creating the Micro-Report\n\nBackground\nFeel free to copy paste the one stated in the background-section above\n\n\nAim\nState the aim of the micro-report, i.e. what are the questions you are addressing?\n\n\nLoad Libraries\n\n\n\nLoad the libraries needed\n\n\nLoad Data\nRead the two data sets into variables peptide_data and meta_data.\n\n\n\nClick here for hint\n\n\nThink about which Tidyverse package deals with reading data and what are the file types we want to read here?\n\n\n\n\n\n\nData Description\nIt is customary to include a description of the data, helping the reader if the report, i.e. your stakeholder, to get an easy overview\n\nThe Subject Meta Data\nLet’s take a look at the meta data:\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 30\n   Experiment Subject `Cell Type`       `Target Type` Cohort    Age Gender Race \n   <chr>        <dbl> <chr>             <chr>         <chr>   <dbl> <chr>  <chr>\n 1 eXL27        19830 naive_CD8         C19_cI        Health…    24 M      White\n 2 eHO141        3238 PBMC              C19_cI        COVID-…    NA <NA>   <NA> \n 3 eHH175       20300 naive_CD8         C19_cI        Health…    28 M      White\n 4 eHO124        3819 PBMC              C19_cI        Health…    62 M      <NA> \n 5 ePD100        1811 PBMC              C19_cI        COVID-…    66 M      <NA> \n 6 ePD85         5869 naive_CD8         C19_cI        Health…    27 F      <NA> \n 7 eJL161     1005703 PBMC              C19_cI        COVID-…    31 F      White\n 8 eHH173       19829 naive_CD8         C19_cI        Health…    50 M      White\n 9 eLH59         1770 B- depleted PBMCs C19_cII       COVID-…    NA <NA>   <NA> \n10 ePD73         4423 naive_CD8         minigene_Set1 Health…    37 F      White\n# ℹ 22 more variables: `HLA-A...9` <chr>, `HLA-A...10` <chr>,\n#   `HLA-B...11` <chr>, `HLA-B...12` <chr>, `HLA-C...13` <chr>,\n#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,\n#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,\n#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,\n#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>,\n#   DRB5...30 <chr>\n\n\n\nQ1: How many observations of how many variables are in the data?\nQ2: Are there groupings in the variables, i.e. do certain variables “go together” somehow?\nT1: Re-create this plot\n\nRead this first:\n\nThink about: What is on the x-axis? What is on the y-axis? And also, it looks like we need to do some counting. Recall, that we can stick together a dplyr pipeline with a call to ggplot, so here we will have to count of Cohort and Gender before plotting\n\n\n\n\n\n\nDoes your plot look different somehow? Consider peeking at the hint…\n\n\n\nClick here for hint\n\n\nPerhaps not everyone agrees on how to denote NAs in data. I have seen -99, -11, _ and so on… Perhaps this can be dealt with in the instance we read the data from the file? I.e. in the actual function call to your read() function. Recall, how can we get information on the parameters of a ?function\n\n\nT2: Re-create this plot\n\n\n\n\n\n\n\n\n\nClick here for hint\n\n\nPerhaps there is a function, which can cut continuous observations into a set of bins?\n\n\nSTOP! Make sure you handled how NAs are denoted in the data before proceeding, see hint below T1\n\nT3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables Subject, Cell Type and Target Type\n\n\n\n\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 27\n   Experiment Cohort      Age Gender Race  `HLA-A...9` `HLA-A...10` `HLA-B...11`\n   <chr>      <chr>     <dbl> <chr>  <chr> <chr>       <chr>        <chr>       \n 1 eJL154     COVID-19…    35 F      Nati… \"A*02:01:0… \"A*29:02:01\" \"B*15:02:01\"\n 2 eXL31      Healthy …    28 M      White \"A*02:01\"   \"A*29:02\"    \"B*07:02\"   \n 3 ePD90      COVID-19…    29 M      <NA>  \"\"          \"\"           \"\"          \n 4 eQD116     COVID-19…    66 F      <NA>  \"A*03:01:0… \"A*11:01:01\" \"B*35:01:01\"\n 5 eMR23      COVID-19…    22 F      <NA>  \"\"          \"\"           \"\"          \n 6 eJL158     COVID-19…    33 M      White \"A*02:01:0… \"A*24:02:01\" \"B*15:01:01\"\n 7 eAV93      Healthy …    41 M      White \"A*11:01\"   \"A*68:01\"    \"B*35:01\"   \n 8 eLH45      COVID-19…    53 M      <NA>  \"A*02:01:0… \"A*03:01:01\" \"B*07:02:01\"\n 9 eLH59      COVID-19…    NA <NA>   <NA>  \"A*01:01:0… \"A*02:01:01\" \"B*40:01:02\"\n10 eNL192     COVID-19…    NA <NA>   <NA>  \"\"          \"\"           \"\"          \n# ℹ 19 more variables: `HLA-B...12` <chr>, `HLA-C...13` <chr>,\n#   `HLA-C...14` <chr>, DPA1...15 <chr>, DPA1...16 <chr>, DPB1...17 <chr>,\n#   DPB1...18 <chr>, DQA1...19 <chr>, DQA1...20 <chr>, DQB1...21 <chr>,\n#   DQB1...22 <chr>, DRB1...23 <chr>, DRB1...24 <chr>, DRB3...25 <chr>,\n#   DRB3...26 <chr>, DRB4...27 <chr>, DRB4...28 <chr>, DRB5...29 <chr>,\n#   DRB5...30 <chr>\n\n\nNow, a classic way of describing a cohort, i.e. the group of subjects used for the study, is the so-called table1 and while we could build this ourselves, this one time, in the interest of exercise focus and time, we are going to “cheat” and use an R-package, like so:\nNB!: This may look a bit odd initially, but if you render your document, you should be all good!\n\nlibrary(\"table1\") # <= Yes, this should normally go at the beginning!\nmeta_data |>\n  mutate(Gender = factor(Gender),\n         Cohort = factor(Cohort)) |>\n  table1(x = formula(~ Gender + Age + Race | Cohort),\n         data = _)\n\n\n\n\n\n\nCOVID-19-Acute(N=4)\nCOVID-19-B-Non-Acute(N=8)\nCOVID-19-Convalescent(N=90)\nCOVID-19-Exposed(N=3)\nHealthy (No known exposure)(N=39)\nOverall(N=144)\n\n\n\n\nGender\n\n\n\n\n\n\n\n\nF\n1 (25.0%)\n4 (50.0%)\n33 (36.7%)\n1 (33.3%)\n17 (43.6%)\n56 (38.9%)\n\n\nM\n2 (50.0%)\n3 (37.5%)\n36 (40.0%)\n0 (0%)\n21 (53.8%)\n62 (43.1%)\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n21 (23.3%)\n2 (66.7%)\n1 (2.6%)\n26 (18.1%)\n\n\nAge\n\n\n\n\n\n\n\n\nMean (SD)\n50.7 (17.0)\n43.7 (7.74)\n51.5 (15.3)\n35.0 (NA)\n33.3 (9.93)\n44.9 (15.7)\n\n\nMedian [Min, Max]\n52.0 [33.0, 67.0]\n42.0 [33.0, 53.0]\n53.0 [21.0, 79.0]\n35.0 [35.0, 35.0]\n31.0 [21.0, 62.0]\n42.0 [21.0, 79.0]\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n21 (23.3%)\n2 (66.7%)\n0 (0%)\n25 (17.4%)\n\n\nRace\n\n\n\n\n\n\n\n\nAfrican American\n1 (25.0%)\n0 (0%)\n0 (0%)\n0 (0%)\n1 (2.6%)\n2 (1.4%)\n\n\nWhite\n2 (50.0%)\n7 (87.5%)\n13 (14.4%)\n0 (0%)\n28 (71.8%)\n50 (34.7%)\n\n\nAsian\n0 (0%)\n0 (0%)\n3 (3.3%)\n0 (0%)\n2 (5.1%)\n5 (3.5%)\n\n\nHispanic or Latino/a\n0 (0%)\n0 (0%)\n1 (1.1%)\n0 (0%)\n0 (0%)\n1 (0.7%)\n\n\nNative Hawaiian or Other Pacific Islander\n0 (0%)\n0 (0%)\n0 (0%)\n1 (33.3%)\n0 (0%)\n1 (0.7%)\n\n\nBlack or African American\n0 (0%)\n0 (0%)\n0 (0%)\n0 (0%)\n3 (7.7%)\n3 (2.1%)\n\n\nMixed Race\n0 (0%)\n0 (0%)\n0 (0%)\n0 (0%)\n1 (2.6%)\n1 (0.7%)\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n73 (81.1%)\n2 (66.7%)\n4 (10.3%)\n81 (56.3%)\n\n\n\n\n\n\nNote how good this looks! If you have ever done a “Table 1” before, you know how painful they can be and especially if something changes in your cohort - Dynamic reporting to the rescue!\nLastly, before we proceed, the meta_data contains HLA data for both class I and class II (see background), but here we are only interested in class I, recall these are denoted HLA-A, HLA-B and HLA-C, so make sure to remove any non-class I, i.e. the one after, denoted D-something.\n\nT4: Create a new version of the meta_data, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. HLA-A...9 becomes A1 oand HLA-A...10 becomes A2 and so on for B1, B2, C1 and C2 (Think: How can we rename variables? And here, just do it “manually” per variable). Remember to assign this new data to the same meta_data variable\n\n\n\n\nClick here for hint\n\n\nWhich tidyverse function subsets variables? Perhaps there is a function, which somehow matches a set of variables? And perhaps for the initiated this is compatible with regular expressions (If you don’t know what this means - No worries! If you do, see if you utilise this to simplify your variable selection)\n\n\n\n\nBefore we proceed, this is the data we will carry on with:\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 11\n   Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   \n   <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>\n 1 eLH54      COVID-19-C…    NA <NA>   <NA>  \"A*0… \"A*0… \"B*0… \"B*4… \"C*0… \"C*0…\n 2 eHH173     Healthy (N…    50 M      White \"A*0… \"A*0… \"B*3… \"B*4… \"C*0… \"C*0…\n 3 eMR25      COVID-19-C…    21 F      <NA>  \"\"    \"\"    \"\"    \"\"    \"\"    \"\"   \n 4 ePD91      COVID-19-C…    52 M      White \"\"    \"\"    \"\"    \"\"    \"\"    \"\"   \n 5 eDH105     COVID-19-C…    32 F      <NA>  \"A*2… \"A*2… \"B*4… \"B*4… \"C*0… \"C*0…\n 6 ePD87      COVID-19-C…    47 M      White \"A*0… \"A*2… \"B*0… \"B*0… \"C*0… \"C*0…\n 7 eAM23      COVID-19-C…    48 M      <NA>  \"A*1… \"A*2… \"B*1… \"B*5… \"C*0… \"C*1…\n 8 eOX54      Healthy (N…    39 F      Afri… \"A*0… \"A*2… \"B*1… \"B*5… \"C*0… \"C*1…\n 9 eMR12      COVID-19-C…    NA <NA>   <NA>  \"A*0… \"A*0… \"B*4… \"B*5… \"C*0… \"C*1…\n10 eQD131     COVID-19-E…    NA <NA>   <NA>  \"A*0… \"A*3… \"B*1… \"B*5… \"C*0… \"C*0…\n\n\nNow, we have a beautiful tidy dataset, recall that this entails, that each row is an observation, each column is a variable and each cell holds one value.\n\n\n\nThe Peptide Details Data\nLet’s start with simply having a look see:\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   `TCR BioIdentity`            TCR Nucleotide Seque…¹ Experiment `ORF Coverage`\n   <chr>                        <chr>                  <chr>      <chr>         \n 1 CASSEAPGLEFGNTIYF+TCRBV02-0… ACAAAGCTGGAGGACTCAGCC… eXL30      ORF1ab        \n 2 CASSHEDRGRPGELFF+TCRBV03-01… TCCCTGGAGCTTGGTGACTCT… eMR17      ORF1ab        \n 3 CASSPPTDTQYF+TCRBV27-01+TCR… CTGATCCTGGAGTCGCCCAGC… eXL31      surface glyco…\n 4 CASSLGLAGEQYF+TCRBV07-02+TC… ACGATCCAGCGCACAGAGCAG… eDH113     ORF1ab        \n 5 CASSYWNEQFF+TCRBV06-05+TCRB… NNNNNNNNNNNNNTGTCGGCT… eOX52      ORF1ab        \n 6 CASSLVGGDPSTDTQYF+TCRBV13-0… TTGGAGCTGGGGGACTCAGCC… eEE226     ORF1ab        \n 7 CASSIGVGRAYEQYF+TCRBV19-01+… ACATCGGCCCAAAAGAACCCG… ePD83      ORF3a         \n 8 CSALGQGNVQFF+TCRBV29-01+TCR… CTGACTGTGAGCAACATGAGC… eEE226     ORF6          \n 9 CASSQLRYTEAFF+TCRBV04-03+TC… CACCTACACACCCTGCAGCCA… eHO124     ORF1ab        \n10 CASSLFGRGPTYNEQFF+TCRBV27-0… CCCAGCCCCAACCAGACCTCT… eLH43      ORF1ab        \n# ℹ abbreviated name: ¹​`TCR Nucleotide Sequence`\n# ℹ 3 more variables: `Amino Acids` <chr>, `Start Index in Genome` <dbl>,\n#   `End Index in Genome` <dbl>\n\n\n\nQ3: How many observations of how many variables are in the data?\n\nThis is a rather big data set, so let us start with two “tricks” to handle this, first:\n\nWrite the data back into your data folder, using the filename peptide-detail-ci.csv.gz, note the appending of .gz, which is automatically recognised and results in gz-compression\nNow, check in your data folder, that you have two files peptide-detail-ci.csv and peptide-detail-ci.csv.gz, delete the former\nAdjust your reading-the-data-code in the “Load Data”-section, to now read in the peptide-detail-ci.csv.gz file\n\n\n\n\nClick here for hint\n\n\nJust as you can read a file, you can of course also write a file. Note the filetype we want to write here is csv. If you in the console type e.g. readr::wr and then hit the Tab key, you will see the different functions for writing different filetypes\n\nThen:\n\nT5: As before, let’s immediately subset the peptide_data to the variables of interest: TCR BioIdentity, Experiment and Amino Acids. Remember to assign this new data to the same peptide_data variable to avoid cluttering your environment with redundant variables. Bonus: Did you know you can click the Environment pane and see which variables you have?\n\n\n\n\nOnce again, before we proceed, this is the data we will carry on with:\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 3\n   Experiment `TCR BioIdentity`                       `Amino Acids`             \n   <chr>      <chr>                                   <chr>                     \n 1 eEE226     CASSQDSDGGGNTIYF+TCRBV04-02+TCRBJ01-03  AEAELAKNVSL,AELAKNVSLDNVL \n 2 eOX54      CASSTVGGPFQPQHF+TCRBV12-X+TCRBJ01-05    MMISAGFSL                 \n 3 eXL27      CASRKTTDTQYF+TCRBV27-01+TCRBJ02-03      AFLLFLVLI,FLAFLLFLV,FYLCF…\n 4 eHO135     CAWRRGGKLFF+TCRBV30-01+TCRBJ01-04       AYKTFPPTEPK,KTFPPTEPK     \n 5 eXL36      CASSVAAAVSYNEQFF+TCRBV09-01+TCRBJ02-01  VDDPCPIHFY,VVDDPCPIHFY,YV…\n 6 eEE228     CASSFPQNTQYF+TCRBV07-09+TCRBJ02-03      FLWLLWPVT,FLWLLWPVTL,LWLL…\n 7 eEE226     CASSSRTEGSTDTQYF+TCRBV11-02+TCRBJ02-03  EEHVQIHTI                 \n 8 eOX52      CASSVEGTVNEKLFF+TCRBV09-01+TCRBJ01-04   FVDGVPFVV                 \n 9 eQD137     CSVVSGISYNEQFF+TCRBV29-01+TCRBJ02-01    AFLLFLVLI,FLAFLLFLV,FYLCF…\n10 ePD83      CASSIGLGLAEYNEQFF+TCRBV19-01+TCRBJ02-01 SEHDYQIGGYTEKW,YQIGGYTEK,…\n\n\n\nQ4: Is this tidy data? Why/why not?\nT6: See if you can find a way to create the below data, from the above\n\n\n\n\n\npeptide_data |> \n  sample_n(size = 10)\n\n# A tibble: 10 × 5\n   Experiment CDR3b              V_gene           J_gene     `Amino Acids`      \n   <chr>      <chr>              <chr>            <chr>      <chr>              \n 1 ePD76      CASSPRPGLAGGRDTQYF TCRBV07-06       TCRBJ02-03 SSANNCTFEY,VYSSANN…\n 2 eEE226     CASSEASMNTEAFF     TCRBV06-01       TCRBJ01-01 LPAADLDDF          \n 3 eOX49      CASSRQTEAFF        TCRBV03-01/03-02 TCRBJ01-01 FLNGSCGSV          \n 4 eOX43      CASSLRGTGESEFF     TCRBV12-X        TCRBJ02-01 AFPFTIYSL,GYINVFAF…\n 5 eOX43      CASSHAASRSYEQYF    TCRBV04-01       TCRBJ02-07 APKEIIFL,KEIIFLEGE…\n 6 eEE226     CASSHWSVAEETQYF    TCRBV03-01/03-02 TCRBJ02-05 KLSYGIATV          \n 7 eJL164     CSASERSTTLGQTTQYF  TCRBV20-01       TCRBJ02-03 KLWAQCVQL          \n 8 eEE228     CAISDRLISGSTGELFF  TCRBV10-03       TCRBJ02-02 FPNITNLCPF,QPTESIV…\n 9 eQD132     CASSSRTKGYEQYF     TCRBV06-05       TCRBJ02-07 STQDLFLPFF,TQDLFLP…\n10 eOX52      CASSIGPLDSYGYTF    TCRBV19-01       TCRBJ01-02 KLSYGIATV          \n\n\n\n\n\nClick here for hint\n\n\nFirst: Compare the two datasets and identify what happened? Did any variables “disappear” and did any “appear”? Ok, so this is a bit tricky, but perhaps there is a function to separate a composite (untidy) column into a set of new variables based on a separator? But what is a separator? Just like when you read a file with Comma Separated Values, a separator denotes how a composite string is divided into fields. So, look for such a repeated value, which seem to indeed separate such fields. Also, be aware, that character, which can mean more than one thing, may need to be “escaped” using an initial two backslashed, i.e. “\\x”, where x denotes the character needing to be “escaped”\n\n\nT7: Add a variable, which counts how many peptides are in each observation of Amino Acids\n\n\n\n\n\n\n\nClick here for hint\n\n\nWe have been working with the stringr package, perhaps the contains a function to somehow count the number of occurrences of a given character in a string? Again, remember you can type e.g. stringr::str_ and then hit the Tab key to see relevant functions\n\n\npeptide_data |> \n  sample_n(size = 10)\n\n# A tibble: 10 × 6\n   Experiment CDR3b               V_gene         J_gene `Amino Acids` n_peptides\n   <chr>      <chr>               <chr>          <chr>  <chr>              <dbl>\n 1 ePD91      CASSIGLTEAFF        TCRBV19-01     TCRBJ… ILGTVSWNL,SN…          2\n 2 eEE240     CATSRPMNTEAFF       TCRBV15-01     TCRBJ… AFLLFLVLI,FL…         11\n 3 eHH175     CASSDGPGYEQYF       TCRBV12-03/12… TCRBJ… KMKDLSPRW              1\n 4 eHO135     CASSLAGAQPQHF       TCRBV05-01     TCRBJ… LSPRWYFYY,SP…          2\n 5 eEE228     CASSPTSGINNEQFF     TCRBV18-01     TCRBJ… KLSYGIATV              1\n 6 eXL31      CASSKATGEGGNYEQYF   TCRBV21-01     TCRBJ… FLQSINFVR,FL…         13\n 7 eOX49      CASSYGLAGGEETQYF    TCRBV07-03     TCRBJ… KLWAQCVQL              1\n 8 eLH48      CASSQDVGRGVQETQYF   TCRBV03-01/03… TCRBJ… RIRGGDGKM,RI…          2\n 9 eXL31      CASSHGTGGELFF       TCRBV27-01     TCRBJ… QLMCQPILL,QL…          2\n10 eEE226     CASSQDRVVAGGQGDTQYF TCRBV03-01/03… TCRBJ… APKEIIFL,KEI…          2\n\n\n\nT8: Re-create the following plot\n\n\n\n\n\n\n\nQ4: What is the maximum number of peptides assigned to one observation?\nT9: Using the str_c() and the seq() functions, re-create the below\n\n\n\n[1] \"peptide_1\" \"peptide_2\" \"peptide_3\" \"peptide_4\" \"peptide_5\"\n\n\n\n\n\nClick here for hint\n\n\nIf you’re uncertain on how a function works, try going into the console and in this case e.g. type str_c(\"a\", \"b\") and seq(from = 1, to = 3) and see if you combine these?\n\n\nT10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data\n\n\n\n\n\n\n\nClick here for hint\n\n\nIn the console, write ?separate and think about how you used it earlier. Perhaps you can not only specify a vector to separate into, but also specify a function, which returns a vector?\n\n\npeptide_data |> \n  sample_n(size = 10)\n\n# A tibble: 10 × 18\n   Experiment CDR3b        V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4\n   <chr>      <chr>        <chr>  <chr>  <chr>     <chr>     <chr>     <chr>    \n 1 eOX52      CASSPPISYEQ… TCRBV… TCRBJ… ILGTVSWNL SNEKQEIL… <NA>      <NA>     \n 2 eEE226     CASPFPGQGHE… TCRBV… TCRBJ… KLSYGIATV <NA>      <NA>      <NA>     \n 3 eQD124     CASRTLGAGEL… TCRBV… TCRBJ… HTTDPSFL… <NA>      <NA>      <NA>     \n 4 eAV88      CASSPGLDYNE… TCRBV… TCRBJ… KPLEFGAT… <NA>      <NA>      <NA>     \n 5 eEE224     CSVEGLPGRET… TCRBV… TCRBJ… APAHISTI  LIVNSVLL… LLFLAFVV… SVLLFLAFV\n 6 eQD108     CASSATGALAS… TCRBV… TCRBJ… FIASFRLFA SYFIASFR… YFIASFRLF YFIASFRL…\n 7 eXL27      CASSLGDSNTE… TCRBV… TCRBJ… ELYSPIFLI LYSPIFLIV QELYSPIFL VQELYSPIF\n 8 eOX49      CASSLNHLGDR… TCRBV… TCRBJ… FVCNLLLL… LLFVTVYS… TVYSHLLLV <NA>     \n 9 eAV93      CATTEGTANTE… TCRBV… TCRBJ… SSANNCTF… VYSSANNC… <NA>      <NA>     \n10 eOX46      CASSSSTAGEQ… TCRBV… TCRBJ… FPPTSFGPL <NA>      <NA>      <NA>     \n# ℹ 10 more variables: peptide_5 <chr>, peptide_6 <chr>, peptide_7 <chr>,\n#   peptide_8 <chr>, peptide_9 <chr>, peptide_10 <chr>, peptide_11 <chr>,\n#   peptide_12 <chr>, peptide_13 <chr>, n_peptides <dbl>\n\n\n\nQ5: Now, presumable you got a warning, discuss in your group why that is?\nQ6: With respect to peptide_n, discuss in your group, if this is wide- or long-data?\n\nNow, finally we will use the what we prepared for today, data pivoting. There are two functions, namely pivot_wider() and pivot_longer(). Also, now, we will use a trick when developing ones data pipeline, while working with new functions, that on might not be completely comfortable with. You have seen the sample_n() function several times above and we can use that to randomly sample n observations from data. This we can utilise to work with a smaller data set in the development face and once we are ready, we can increase this n gradually to see if everything continues to work as anticipated.\n\nT11: Using the peptide_data, run a few sample_n() calls with varying degree of n to make sure, that you get a feeling for what is going on\nT12: From the peptide_data data above, with peptide_1, peptide_2, etc. create this data set using one of the data pivoting functions. Remember to start initially with sampling a smaller data set and then work on that first! Also, once you’re sure you’re good to go, reuse the peptide_data variable as we don’t want huge redundant data sets floating around in our environment\n\n\n\n\n\n\n\nClick here for hint\n\n\nIf the pivoting is not clear at all, then do what I do, create some example data:\n\nmy_data <- tibble(\n  id = str_c(\"id_\", 1:10),\n  var_1 = round(rnorm(10),1),\n  var_2 = round(rnorm(10),1),\n  var_3 = round(rnorm(10),1))\n\n…and then play around with that. A small set like the one above is easy to handle, so perhaps start with that and then pivot back and forth a few times using pivot_wider()/pivot_longer(). Use View() to inspect and get a better overview of the results of pivoting.\n\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment CDR3b                 V_gene   J_gene n_peptides peptide_n peptide\n   <chr>      <chr>                 <chr>    <chr>       <dbl> <chr>     <chr>  \n 1 eOX49      CASSRTNEQFF           TCRBV28… TCRBJ…          7 peptide_5 TLACFV…\n 2 eHO130     CASSQATGALGYGYTF      TCRBV04… TCRBJ…         10 peptide_8 SIWNLD…\n 3 ePD83      CASSTGQGLGYEQYF       TCRBV19… TCRBJ…          3 peptide_… <NA>   \n 4 eLH47      CASSYPSEGASYNEQFF     TCRBV06… TCRBJ…          1 peptide_2 <NA>   \n 5 eEE228     CSATDLAGVGEQYF        TCRBV20… TCRBJ…          7 peptide_9 <NA>   \n 6 eEE226     CASSYSIHSLLGTGGTGELFF TCRBV06… TCRBJ…          1 peptide_… <NA>   \n 7 eEE228     CASSPRGPSGIQETQYF     TCRBV07… TCRBJ…         11 peptide_1 AFLLFL…\n 8 eEE240     CASSYFGGWGANVLTF      TCRBV11… TCRBJ…          1 peptide_7 <NA>   \n 9 eQD111     CASSEGSGVVQPQHF       TCRBV10… TCRBJ…          1 peptide_3 <NA>   \n10 eQD125     CASRHSEGGVYDNEQFF     TCRBV05… TCRBJ…          1 peptide_9 <NA>   \n\n\n\nQ7: You will see some NAs in the peptide variable, discuss in your group from where these arise?\nQ8: How many rows and columns now and how does this compare with Q3? Discuss why/why not it is different?\nT13: Now, lose the redundant variables n_peptides and peptide_n, get rid of the NAs in the peptide column, and make sure that we only have unique observations (i.e. there are no repeated rows/observations).\n\n\n\n\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 5\n   Experiment CDR3b               V_gene           J_gene     peptide   \n   <chr>      <chr>               <chr>            <chr>      <chr>     \n 1 eOX54      CSATANTGELFF        TCRBV20-X        TCRBJ02-02 AFPFTIYSL \n 2 eOX52      CASSSGLASAYEQYF     TCRBV12-X        TCRBJ02-07 LLDDFVEII \n 3 eEE240     CATSEGSGANVLTF      TCRBV24-01       TCRBJ02-06 FLQSINFVR \n 4 eEE226     CSVEQASYEQYF        TCRBV29-01       TCRBJ02-07 SLIDFYLCFL\n 5 eHH175     CASSQLNTGELFF       TCRBV03-01/03-02 TCRBJ02-02 FLYIIKLIFL\n 6 eEE226     CSAWTSGETQYF        TCRBV20-X        TCRBJ02-05 FYLCFLAFLL\n 7 eEE228     CASSHGPESGLGRNQPQHF TCRBV06-X        TCRBJ01-05 NVFAFPFTI \n 8 eOX43      CASSETGTGYEQYF      TCRBV02-01       TCRBJ02-07 SLIDFYLCFL\n 9 eXL31      RASSLRRGAEQYF       TCRBV07-03       TCRBJ02-07 TLACFVLAAV\n10 eEE226     CASSLLTAEPEAFF      TCRBV07-09       TCRBJ01-01 TFKVSIWNL \n\n\n\nQ8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?\n\nAgain, we turn to the stringr package, as we need to make sure that the sequence data does indeed only contain valid characters. There are a total of 20 proteogenic amino acids, which we symbolise using ARNDCQEGHILKMFPSTWYV.\n\nT14: Use the str_detect() function to filter the CDR3b and peptide variables using a pattern of [^ARNDCQEGHILKMFPSTWYV] and then play with the negate parameter so see what happens\n\n\n\n\n\n\n\nClick here for hint\n\n\nAgain, try to play a bit around with the function in the console, type e.g. str_detect(string = \"ARND\", pattern = \"A\") and str_detect(string = \"ARND\", pattern = \"C\") and then recall, that the filter() function requires a logical vector, i.e. a vector of TRUE and FALSE to filter the rows\n\n\nT15: Add two new variables to the data, k_CDR3b and k_peptide each signifying the length of the respective sequences\n\n\n\n\n\n\n\nClick here for hint\n\n\nAgain, we’re working with strings, so perhaps there is a package of interest and perhaps in that package, there is a function, which can get the length of a string?\n\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment CDR3b            V_gene           J_gene peptide k_CDR3b k_peptide\n   <chr>      <chr>            <chr>            <chr>  <chr>     <int>     <int>\n 1 eXL31      CASSQVTGELFF     TCRBV03-01/03-02 TCRBJ… FLAFLL…      12         9\n 2 eEE224     CASVAAGELFF      TCRBV06-06       TCRBJ… INFVRI…      11         9\n 3 eQD114     CASSWEGAGDTDTQYF TCRBV06-06       TCRBJ… HTTDPS…      16        11\n 4 eHO140     CASSPPDRGNYGYTF  TCRBV18-01       TCRBJ… QSINFV…      15         9\n 5 eXL30      CSVDTGGGTEAFF    TCRBV29-01       TCRBJ… YLCFLA…      13         9\n 6 eHO140     CASSLSLPIGDEQYF  TCRBV27-01       TCRBJ… LWPVTL…      15         9\n 7 eMR12      CASSFENTGELFF    TCRBV28-01       TCRBJ… HTTDPS…      13        11\n 8 eXL30      CASSPGGDTQYF     TCRBV06-05       TCRBJ… NVFAFP…      12         9\n 9 eEE226     CASSPAGTPTF      TCRBV12-03/12-04 TCRBJ… GYINVF…      11        10\n10 eOX52      CASSWGLAGADTQYF  TCRBV07-03       TCRBJ… LYSPIF…      15         9\n\n\n\nT16: Re-create this plot\n\n\n\n\n\n\n\nQ9: What is the most predominant length of the CDR3b-sequences?\nT17: Re-create this plot\n\n\n\n\n\n\n\nQ10: What is the most predominant length of the peptide-sequences?\nQ11: Discuss in your group, if this data set is tidy or not?\n\n\npeptide_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment CDR3b            V_gene     J_gene     peptide   k_CDR3b k_peptide\n   <chr>      <chr>            <chr>      <chr>      <chr>       <int>     <int>\n 1 eHH175     CASSLSIAGTGGEQYF TCRBV27-01 TCRBJ02-07 MIELSLID…      16        10\n 2 eQD125     CASSFRGPQETQYF   TCRBV27-01 TCRBJ02-05 HTTDPSFL…      14        11\n 3 eQD125     CASRLTGTVAYEQYF  TCRBV28-01 TCRBJ02-07 VLHSYFTS…      15        10\n 4 eEE240     CASRRTTDTQYF     TCRBV27-01 TCRBJ02-03 LIDFYLCFL      12         9\n 5 eEE226     CASSNPGQGEETQYF  TCRBV12-X  TCRBJ02-05 LPFNDGVYF      15         9\n 6 eAV93      CASSSGTGGDTEAFF  TCRBV07-09 TCRBJ01-01 RSVASQSII      15         9\n 7 eHO135     CAITPGSTGELFF    TCRBV10-03 TCRBJ02-02 LWPVTLACF      13         9\n 8 eMR16      CASSLGRLAQLNEQFF TCRBV11-02 TCRBJ02-01 FLYLYALV…      16        10\n 9 eXL30      CASRLTGTGELFF    TCRBV06-05 TCRBJ02-02 GYINVFAF…      13        10\n10 eXL27      CSAEREGLYNEQFF   TCRBV20-X  TCRBJ02-01 IELSLIDF…      14        10\n\n\n\n\nCreating one data set from two data sets\nBefore we move onto using the family of *_join() functions you prepared for today, we will just take a quick peek at the meta data again:\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 11\n   Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   \n   <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>\n 1 eHO125     COVID-19-C…    52 M      <NA>  A*02… A*02… B*39… B*44… C*07… C*07…\n 2 eHH169     Healthy (N…    24 F      Blac… A*02… A*74… B*35… B*35… C*04… C*04…\n 3 eHO129     COVID-19-C…    66 F      Asian A*24… A*24… B*15… B*40… C*08… C*15…\n 4 eLH47      COVID-19-C…    35 F      White A*01… A*02… B*07… B*08… C*07… C*07…\n 5 ePD85      Healthy (N…    27 F      <NA>  A*02… A*29… B*07… B*18… C*07… C*15…\n 6 ePD80      COVID-19-C…    67 M      <NA>  A*02… A*66… B*15… B*41… C*03… C*17…\n 7 eJL149     COVID-19-C…    60 F      <NA>  A*02… A*02… B*44… B*50… C*06… C*16…\n 8 eQD109     COVID-19-C…    61 M      <NA>  A*03… A*69… B*07… B*07… C*07… C*07…\n 9 eEE226     Healthy (N…    21 F      White A*01… A*02… B*35… B*39… C*04… C*07…\n10 eJL154     COVID-19-E…    35 F      Nati… A*02… A*29… B*15… B*44… C*04… C*16…\n\n\nRemember you can scroll in the data.\n\nQ12: Discuss in your group, if this data with respect to the A1, A2, B1, B2, C1 and C2 variables is a wide or a long data format?\n\nAs with the peptide_data, we will now have to use data pivoting again. I.e.:\n\nT18: use either pivot_wider() or pivot_longer() to create the following data:\n\n\n\n\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment Cohort                        Age Gender Race         Gene  Allele\n   <chr>      <chr>                       <dbl> <chr>  <chr>        <chr> <chr> \n 1 eDH107     COVID-19-Convalescent          72 F      <NA>         A2    \"A*03…\n 2 eQD117     COVID-19-Convalescent          70 F      <NA>         B1    \"B*35…\n 3 eAV100     COVID-19-Convalescent          29 F      <NA>         C1    \"C*03…\n 4 eHO138     COVID-19-B-Non-Acute           NA <NA>   <NA>         A2    \"\"    \n 5 eJL149     COVID-19-Convalescent          60 F      <NA>         C1    \"C*06…\n 6 eMR25      COVID-19-Convalescent          21 F      <NA>         C1    \"\"    \n 7 eQD113     COVID-19-Convalescent          36 M      <NA>         A1    \"A*03…\n 8 eHH169     Healthy (No known exposure)    24 F      Black or Af… A1    \"A*02…\n 9 eOX43      Healthy (No known exposure)    24 M      White        A1    \"A*02…\n10 eQD127     COVID-19-Convalescent          61 F      <NA>         C1    \"C*02…\n\n\nRemember, what we are aiming for here, is to create one data set from two. So:\n\nQ13: Discuss in your group, which variable(s?) define the same observations between the peptide_data and the meta_data?\n\nOnce you have agreed upon Experiment, then use that knowledge to subset the meta_data to the variables-of-interest:\n\n\n\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 2\n   Experiment Allele    \n   <chr>      <chr>     \n 1 eQD108     A*68:01:02\n 2 eHO130     B*08:01   \n 3 ePD82      C*14:03:01\n 4 ePD83      C*03:04   \n 5 eQD116     C*04:01:01\n 6 eQD123     A*02:01:01\n 7 eQD112     C*07:02:01\n 8 eOX43      C*03:04   \n 9 eHO134     C*07:01:01\n10 eLH45      A*02:01:01\n\n\nUse the View() function again, to look at the meta_data. Notice something? Some alleles are e.g. A*11:01, whereas others are B*51:01:02. You can find information on why, by visiting Nomenclature for Factors of the HLA System.\nLong story short, we only want to include Field 1 (allele group) and Field 2 (Specific HLA protein). You have prepared the stringr package for today. See if you can find a way to reduce e.g. B*51:01:02 to B*51:01 and then create a new variable Allele_F_1_2 accordingly, while also removing the ...x (where x is a number) subscripts from the Gene variable (It is an artifact from having the data in a wide format, where you cannot have two variables with the same name) and also, remove any NAs and \"\"s, denoting empty entries.\n\n\n\nClick here for hint\n\n\nThere are several ways this can be achieved, the easiest being to consider if perhaps a part of the string based on indices could be of interest. This term “a part of a string” is called a substring, perhaps the stringr package contains a function work with substring? In the console, type stringr:: and hit tab. This will display the functions available in the stringr package. Scroll down and find the functionst starting with str_ and look for on, which might be relevant and remember you can use ?function_name to get more information on how a given function works.\n\n\n\n\n\nT19: Create the following data, according to specifications above:\n\n\nmeta_data |> \n  sample_n(10)\n\n# A tibble: 10 × 3\n   Experiment Allele     Allele_F_1_2\n   <chr>      <chr>      <chr>       \n 1 eQD128     B*39:01:01 B*39:01     \n 2 eOX46      A*02:01    A*02:01     \n 3 eLH45      C*12:03:01 C*12:03     \n 4 eQD120     A*31:01:02 A*31:01     \n 5 ePD81      B*40:02:01 B*40:02     \n 6 eXL27      C*07:04    C*07:04     \n 7 ePD79      B*07:02:01 B*07:02     \n 8 eDH105     A*24:02:01 A*24:02     \n 9 eAV91      C*05:01    C*05:01     \n10 eEE240     B*40:01    B*40:01     \n\n\nThe asterisk, i.e. * is a rather annoying character because of ambiguity, so:\n\nT20: Clean the data a bit more, by removing the asterisk and redundant variables:\n\n\n\n\n\nmeta_data |> \n  sample_n(size = 10)\n\n# A tibble: 10 × 2\n   Experiment Allele\n   <chr>      <chr> \n 1 eLH43      B44:03\n 2 eJL147     A11:01\n 3 eHH169     A02:01\n 4 eJL154     C16:01\n 5 eQD119     C07:01\n 6 eJL143     C08:02\n 7 eHH169     B35:01\n 8 eHO125     C07:01\n 9 eOX52      A02:01\n10 eLH48      B08:01\n\n\n\n\n\nClick here for hint 1\n\n\nAgain, the stringr package may come in handy. Perhaps there is a function remove, one or more such pesky characters?\n\n\n\n\nClick here for hint 2\n\n\nGetting a weird error? Recall, that character ambiguity needs to be “escaped”, you did this somehow earlier on…\n\nRecall the peptide_data?\n\npeptide_data |>\n  sample_n(10)\n\n# A tibble: 10 × 7\n   Experiment CDR3b             V_gene     J_gene     peptide  k_CDR3b k_peptide\n   <chr>      <chr>             <chr>      <chr>      <chr>      <int>     <int>\n 1 eXL30      CASSLEISYEQYF     TCRBV05-01 TCRBJ02-07 VPHVGEI…      13        11\n 2 eOX54      CASSASMSDTQYF     TCRBV09-01 TCRBJ02-03 KLSYGIA…      13         9\n 3 eQD111     CASSELAGADTQYF    TCRBV06-01 TCRBJ02-03 HTTDPSF…      14        11\n 4 eOX49      CSAHFPGQGFGEQFF   TCRBV20-X  TCRBJ02-01 YLCFLAF…      15         9\n 5 eHO128     CASSLQSPSSAGNEQFF TCRBV27-01 TCRBJ02-01 QSINFVR…      17         9\n 6 eOX49      CASSLWGDNEQFF     TCRBV27-01 TCRBJ02-01 FYLCFLA…      13         9\n 7 eEE240     CASSFYSSGGAEGEQFF TCRBV27-01 TCRBJ02-01 LEYHDVR…      17         9\n 8 eEE228     CASSTKGRTNTGELFF  TCRBV27-01 TCRBJ02-02 LIVNSVL…      16        10\n 9 eOX43      CASRGLAGDNSYEQYF  TCRBV25-01 TCRBJ02-07 SLIDFYL…      16        10\n10 eOX52      CASSRGTGSEQYF     TCRBV19-01 TCRBJ02-07 FLQSINF…      13         9\n\n\n\nT21: Create a dplyr pipeline, starting with the peptide_data, which joins it with the meta_data and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names peptide_meta_data (If you get a warning, discuss in your group what it means?)\n\n\n\n\n\n\n\nClick here for hint 1\n\n\nWhich family of functions do we use to join data? Also, perhaps here it would be prudent to start with working on a smaller data set, recall we could sample a number of rows yielding a smaller development data set\n\n\n\n\nClick here for hint 2\n\n\nYou should get a data set of around +3.000.000, take a moment to consider how that would have been to work with in Excel? Also, in case the servers are not liking this, you can consider subsetting the peptide_data prior to joining to e.g. 100,000 or 10,000 rows.\n\n\npeptide_meta_data |>\n  sample_n(10)\n\n# A tibble: 10 × 8\n   Experiment CDR3b             V_gene   J_gene peptide k_CDR3b k_peptide Allele\n   <chr>      <chr>             <chr>    <chr>  <chr>     <int>     <int> <chr> \n 1 eEE240     CASSYGQGTPLHF     TCRBV06… TCRBJ… FPQSAP…      13         9 A02:01\n 2 eOX52      CATSDFSGSNTGELFF  TCRBV24… TCRBJ… LWPVTL…      16         9 B40:01\n 3 eEE224     CASSTQGSGELFF     TCRBV27… TCRBJ… IELSLI…      13        10 C07:04\n 4 eQD124     CASRTGSNQPQHF     TCRBV06… TCRBJ… ASAFFG…      13         9 B51:01\n 5 eOX49      CASSQDSKLTGSYEQYF TCRBV14… TCRBJ… YLYALV…      17         9 A02:01\n 6 eLH47      CASDGAGGYTF       TCRBV07… TCRBJ… WLLWPV…      11         9 C07:02\n 7 eOX52      CASSLVQGAYNEQFF   TCRBV05… TCRBJ… LLFLVL…      15         9 B15:17\n 8 eEE226     CASSLDGTPGNTIYF   TCRBV11… TCRBJ… DGVYFA…      15        10 C07:02\n 9 eXL30      CASNFFPGLDNEQFF   TCRBV02… TCRBJ… APKEII…      15         8 B35:02\n10 eAV93      CASSFTGLSYEQYF    TCRBV05… TCRBJ… VLPFND…      14        10 C04:01\n\n\n\n\n\nAnalysis\nNow, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:\n\nWhat characterises the peptides binding to the HLAs?\nWhat characterises T-cell Receptors binding to the pMHC-complexes?\n\n\nPeptides binding to HLA\nAs we have touched upon multiple times, R is very flexible and naturally you can also create sequence logos. Finally, let us create a binding motif using the package ggseqlogo (More info here).\n\nT22: Subset the final peptide_meta_data data to A02:01 and unique observations of peptides of length 9 and re-create the below sequence logo\n\n\n\n\nClick here for hint\n\n\nYou can pipe a vector of peptides into ggseqlogo, but perhaps you first need to pull that vector from the relevant variable in your tibble? Also, consider before that, that you’ll need to make sure, you are only looking at peptides of length 9\n\n\n\n\n\n\n\n\n\n\n\nT23: Repeat for e.g. B07:02 or another of your favourite alleles\n\nNow, let’s take a closer look at the sequence logo:\n\nQ14: Which positions in the peptide determines binding to HLA?\n\n\n\n\nClick here for hint\n\n\nRecall your Introduction to Bioinformatics course? And/or perhaps ask your fellow group members if they know?\n\n\n\nCDR3b-sequences binding to pMHC\n\nT24: Subset the peptide_meta_data, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:\n\n\n\n\n\n\n\n\n\n\n\nQ15: In your group, discuss what you see?\nT25: Play around with other combinations of k_CDR3b, Allele, and peptide and inspect how the logo changes\n\nDisclaimer: In this data set, we only get: A given CDR3b was found to recognise a given peptide in a given subject and that subject had a given haplotype - Something’s missing… Perhaps if you have had immunology, then you can spot it? There is a trick to get around this missing information, but that’s beyond scope of what we’re working with here."
   },
   {
     "objectID": "lab05.html#epilogue",
@@ -774,7 +774,7 @@
     "href": "primer_on_linear_models_in_r.html#example",
     "title": "Primer on Linear Models in R",
     "section": "Example",
-    "text": "Example\n\n\n\n\nBackground\nLet’s say we wanted to study the genetic mechanism protecting a plant from heat shock, then:\n\nIndependent: Environmental Condition (temperature)\nDependent: Gene Expression Level (related to heat shock protection)\n\nHere, the independent variable is the temperature and the dependent variable is the gene expression level. It is clear, that the temperature, does not rely on the gene expression level, but the gene expression level of heat shock related genes, does rely on the temperature.\nSo, we keep plants under different temperatures and collect samples, from which we can extract RNA and run a transcriptomics analysis uncovering gene expression levels.\n\n\nData\nFor the data here, we are going to simulate the relationship between gene expression levels and temperature, as a function in R:\n\nrun_simulation <- function(temp){\n  measurement_error <- rnorm(n = length(temp), mean = 0, sd = 3)\n  gene_expression_level <- 2 * temp + 3 + measurement_error\n  return( gene_expression_level )\n}\n\nNote, how we’re adding some measurement error to our simulation, otherwise we would get a perfect relationship, which we all know never happens.\nNow, we can easily run simulations:\n\nrun_simulation(temp = c(15, 20, 25, 30, 35))\n\n[1] 26.90906 42.74464 50.93029 69.31215 71.14476\n\n\nLet’s just go ahead and create some data, we can work with. For this example, we take samples starting at 5 degree celsius and then in increments of 1 up to 50 degrees:\n\nset.seed(806017)\nexperiment_data <- tibble(\n  temperature = seq(from = 5, to = 50, by = 1),\n  gene_expression_level = run_simulation(temp = temperature)\n)\nexperiment_data |> \n  sample_n(10) |> \n  arrange(temperature)\n\n# A tibble: 10 × 2\n   temperature gene_expression_level\n         <dbl>                 <dbl>\n 1          12                  28.7\n 2          13                  30.2\n 3          16                  35.6\n 4          19                  44.0\n 5          22                  48.8\n 6          27                  64.6\n 7          32                  61.1\n 8          33                  73.9\n 9          35                  76.7\n10          40                  84.9\n\n\n\n\nVisualising\nNow, that we have the data, we can visualise the relationship between the temperature- and gene_expression_level-variables:\n\nmy_viz <- experiment_data |> \n  ggplot(aes(x = temperature,\n             y = gene_expression_level)) +\n  geom_point() +\n  geom_vline(xintercept = 0) +\n  geom_hline(yintercept = 0)\nmy_viz\n\n\n\n\n\n\n\n\nNow, we can easily add the best fit line using the geom_smooth()-function, where we specify that we want to use method = \"lm\" and for now, we exclude the confidence interval, by setting se = FALSE:\n\nmy_viz +\n  geom_smooth(method = \"lm\",\n              se = FALSE)\n\n\n\n\n\n\n\n\nWhat happens here, is that a best-fit line is added to the plot by calculating the line, such that the sum of the squared errors is as small as possible, where the error is the distance from the line to a given point. This is a basic linear regression and is known as Ordinary Least Squares (OLS). But what if we want to work with this regression model, beyond just adding a line to a plot?\n\n\nModelling\nOne of the super powers of R is the build in capability to do modelling. Because we simulated the data (see above), we know that the true intercept is 3 and the true slope of the temperature variable is 2. Let see what we get, if we run a linear model:\n\nmy_lm_mdl <- lm(formula = gene_expression_level ~ temperature,\n   data = experiment_data)\nmy_lm_mdl\n\n\nCall:\nlm(formula = gene_expression_level ~ temperature, data = experiment_data)\n\nCoefficients:\n(Intercept)  temperature  \n      2.816        2.021  \n\n\nImportant, the formula notation gene_expression_level ~ temperature is central to R and should be read as: “gene_expression_level modelled as a function of temperature”, i.e. gene_expression_level is the dependent variable often denoted y and temperature is the independendt variable often denoted x.\nOkay that’s pretty close! Recall the reason for the difference is, that we are adding measurement error, when we run the simulation (see above).\nIn other words our model says, that:\n\\[gene\\_expression\\_level = 2.816 + 2.021 \\cdot temperature\\]\nI.e. the estimate of the intercept is 2.816 and the estimate of the slope is 2.021, meaning that when the temperature = 0, we estimate that the gene_expression_level is 2.816 and for each 1 degree increase in temperature, we estimate, that the increase in gene_expression_level is 2.021.\nThese estimates are pretty close to the true model underlying our simulation:\n\\[gene\\_expression\\_level = 3 + 2 \\cdot temperature\\]\nIn general form, such a linear model can be written like so:\n\\[y = \\beta_{0} + \\beta_{1} \\cdot x_{1}\\]\nWhere the \\(\\beta\\)-coefficients are termed estimates, because that is exactly what we do, given the observed data, we estimate their values.\n\n\nWorking with a lm-object:\nThe model format you saw above, is a bit quirky, but luckily, there is a really nice way to get these kind of model object into a more tidy-format:\n\nlibrary(\"broom\")\nmy_lm_mdl |> \n  tidy()\n\n# A tibble: 2 × 5\n  term        estimate std.error statistic  p.value\n  <chr>          <dbl>     <dbl>     <dbl>    <dbl>\n1 (Intercept)     2.82    1.16        2.44 1.89e- 2\n2 temperature     2.02    0.0378     53.4  1.16e-41\n\n\nBriefly, here we term, estimate, std.error, statistic and p.value. We discussed the term and estimate. The std.error pertains to the estimate and the statistic is used to calculate the p.value.\n\nThe P-value\nNow, because we now have a tidy object, we can simply plug-‘n’-play with other tidyverse tools, so let us visualise the p.value. Note, because of the often vary large differences in p.values, we use a -log10-transformation, this means that larger values are “more significant”. Below, the dashed line signifies \\(p=0.05\\), so anything above that line is considered “statistically significant”:\n\nmy_lm_mdl |> \n  tidy() |> \n  ggplot(aes(x = term,\n             y = -log10(p.value))) +\n  geom_point() +\n  geom_hline(yintercept = -log10(0.05),\n             linetype = \"dashed\")\n\n\n\n\n\n\n\n\nNow, as mentioned the p-values are computed based on the statistic and are defined as: “The probability of observing a statistic as or more extreme given, that the null-hypothesis is true”. Where the null-hypothesis it that there is no effect, i.e. the estimate for the term is zero.\nFrom this, it is quite clear, that there very likely is a relationship between the gene_expression_level and temperature. In fact, we know there is, because we simulated the data.\n\n\nThe Confidence Intervals\nWe can further easily include the confidence intervals of the estimates:\n\nmy_lm_mdl_tidy <- my_lm_mdl |> \n  tidy(conf.int = TRUE,\n       conf.level = 0.95)\nmy_lm_mdl_tidy\n\n# A tibble: 2 × 7\n  term        estimate std.error statistic  p.value conf.low conf.high\n  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>\n1 (Intercept)     2.82    1.16        2.44 1.89e- 2    0.488      5.14\n2 temperature     2.02    0.0378     53.4  1.16e-41    1.94       2.10\n\n\n…and as before easily do a plug’n’play into ggplot:\n\nmy_lm_mdl_tidy |> \n  ggplot(aes(x = estimate,\n             y = term,\n             xmin = conf.low,\n             xmax = conf.high)) +\n  geom_errorbarh(height = 0.1) +\n  geom_point()\n\n\n\n\n\n\n\n\nNote, what the 0.95 = 95% confidence intervals means is that: “If we were to repeat this experiment 100 times, then 95 of the times, the generated confidence interval would contain the true value”.\n\n\n\nSummary\nWhat we have gone through here is a basic linear regression, where we are aiming to model the continuous variable gene_expression_level as a function of yet another continous variable temperature. We simulated data, where the true intercept and slope were 3 and 2 respectively and by fitting a linear regression model, the estimates of the intercept and slope respectively were 2.82 [0.49;5.14] and 2.02 [1.94;2.10].\nLinear models allow us to gain insights into data, by modelling relationships."
+    "text": "Example\n\n\n\n\nBackground\nLet’s say we wanted to study the genetic mechanism protecting a plant from heat shock, then:\n\nIndependent: Environmental Condition (temperature)\nDependent: Gene Expression Level (related to heat shock protection)\n\nHere, the independent variable is the temperature and the dependent variable is the gene expression level. It is clear, that the temperature, does not rely on the gene expression level, but the gene expression level of heat shock related genes, does rely on the temperature.\nSo, we keep plants under different temperatures and collect samples, from which we can extract RNA and run a transcriptomics analysis uncovering gene expression levels.\n\n\nData\nFor the data here, we are going to simulate the relationship between gene expression levels and temperature, as a function in R:\n\nrun_simulation <- function(temp){\n  measurement_error <- rnorm(n = length(temp), mean = 0, sd = 3)\n  gene_expression_level <- 2 * temp + 3 + measurement_error\n  return( gene_expression_level )\n}\n\nNote, how we’re adding some measurement error to our simulation, otherwise we would get a perfect relationship, which we all know never happens.\nNow, we can easily run simulations:\n\nrun_simulation(temp = c(15, 20, 25, 30, 35))\n\n[1] 30.52223 38.13767 54.47297 63.80020 72.71315\n\n\nLet’s just go ahead and create some data, we can work with. For this example, we take samples starting at 5 degree celsius and then in increments of 1 up to 50 degrees:\n\nset.seed(806017)\nexperiment_data <- tibble(\n  temperature = seq(from = 5, to = 50, by = 1),\n  gene_expression_level = run_simulation(temp = temperature)\n)\nexperiment_data |> \n  sample_n(10) |> \n  arrange(temperature)\n\n# A tibble: 10 × 2\n   temperature gene_expression_level\n         <dbl>                 <dbl>\n 1          12                  28.7\n 2          13                  30.2\n 3          16                  35.6\n 4          19                  44.0\n 5          22                  48.8\n 6          27                  64.6\n 7          32                  61.1\n 8          33                  73.9\n 9          35                  76.7\n10          40                  84.9\n\n\n\n\nVisualising\nNow, that we have the data, we can visualise the relationship between the temperature- and gene_expression_level-variables:\n\nmy_viz <- experiment_data |> \n  ggplot(aes(x = temperature,\n             y = gene_expression_level)) +\n  geom_point() +\n  geom_vline(xintercept = 0) +\n  geom_hline(yintercept = 0)\nmy_viz\n\n\n\n\n\n\n\n\nNow, we can easily add the best fit line using the geom_smooth()-function, where we specify that we want to use method = \"lm\" and for now, we exclude the confidence interval, by setting se = FALSE:\n\nmy_viz +\n  geom_smooth(method = \"lm\",\n              se = FALSE)\n\n\n\n\n\n\n\n\nWhat happens here, is that a best-fit line is added to the plot by calculating the line, such that the sum of the squared errors is as small as possible, where the error is the distance from the line to a given point. This is a basic linear regression and is known as Ordinary Least Squares (OLS). But what if we want to work with this regression model, beyond just adding a line to a plot?\n\n\nModelling\nOne of the super powers of R is the build in capability to do modelling. Because we simulated the data (see above), we know that the true intercept is 3 and the true slope of the temperature variable is 2. Let see what we get, if we run a linear model:\n\nmy_lm_mdl <- lm(formula = gene_expression_level ~ temperature,\n   data = experiment_data)\nmy_lm_mdl\n\n\nCall:\nlm(formula = gene_expression_level ~ temperature, data = experiment_data)\n\nCoefficients:\n(Intercept)  temperature  \n      2.816        2.021  \n\n\nImportant, the formula notation gene_expression_level ~ temperature is central to R and should be read as: “gene_expression_level modelled as a function of temperature”, i.e. gene_expression_level is the dependent variable often denoted y and temperature is the independendt variable often denoted x.\nOkay that’s pretty close! Recall the reason for the difference is, that we are adding measurement error, when we run the simulation (see above).\nIn other words our model says, that:\n\\[gene\\_expression\\_level = 2.816 + 2.021 \\cdot temperature\\]\nI.e. the estimate of the intercept is 2.816 and the estimate of the slope is 2.021, meaning that when the temperature = 0, we estimate that the gene_expression_level is 2.816 and for each 1 degree increase in temperature, we estimate, that the increase in gene_expression_level is 2.021.\nThese estimates are pretty close to the true model underlying our simulation:\n\\[gene\\_expression\\_level = 3 + 2 \\cdot temperature\\]\nIn general form, such a linear model can be written like so:\n\\[y = \\beta_{0} + \\beta_{1} \\cdot x_{1}\\]\nWhere the \\(\\beta\\)-coefficients are termed estimates, because that is exactly what we do, given the observed data, we estimate their values.\n\n\nWorking with a lm-object:\nThe model format you saw above, is a bit quirky, but luckily, there is a really nice way to get these kind of model object into a more tidy-format:\n\nlibrary(\"broom\")\nmy_lm_mdl |> \n  tidy()\n\n# A tibble: 2 × 5\n  term        estimate std.error statistic  p.value\n  <chr>          <dbl>     <dbl>     <dbl>    <dbl>\n1 (Intercept)     2.82    1.16        2.44 1.89e- 2\n2 temperature     2.02    0.0378     53.4  1.16e-41\n\n\nBriefly, here we term, estimate, std.error, statistic and p.value. We discussed the term and estimate. The std.error pertains to the estimate and the statistic is used to calculate the p.value.\n\nThe P-value\nNow, because we now have a tidy object, we can simply plug-‘n’-play with other tidyverse tools, so let us visualise the p.value. Note, because of the often vary large differences in p.values, we use a -log10-transformation, this means that larger values are “more significant”. Below, the dashed line signifies \\(p=0.05\\), so anything above that line is considered “statistically significant”:\n\nmy_lm_mdl |> \n  tidy() |> \n  ggplot(aes(x = term,\n             y = -log10(p.value))) +\n  geom_point() +\n  geom_hline(yintercept = -log10(0.05),\n             linetype = \"dashed\")\n\n\n\n\n\n\n\n\nNow, as mentioned the p-values are computed based on the statistic and are defined as: “The probability of observing a statistic as or more extreme given, that the null-hypothesis is true”. Where the null-hypothesis it that there is no effect, i.e. the estimate for the term is zero.\nFrom this, it is quite clear, that there very likely is a relationship between the gene_expression_level and temperature. In fact, we know there is, because we simulated the data.\n\n\nThe Confidence Intervals\nWe can further easily include the confidence intervals of the estimates:\n\nmy_lm_mdl_tidy <- my_lm_mdl |> \n  tidy(conf.int = TRUE,\n       conf.level = 0.95)\nmy_lm_mdl_tidy\n\n# A tibble: 2 × 7\n  term        estimate std.error statistic  p.value conf.low conf.high\n  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>\n1 (Intercept)     2.82    1.16        2.44 1.89e- 2    0.488      5.14\n2 temperature     2.02    0.0378     53.4  1.16e-41    1.94       2.10\n\n\n…and as before easily do a plug’n’play into ggplot:\n\nmy_lm_mdl_tidy |> \n  ggplot(aes(x = estimate,\n             y = term,\n             xmin = conf.low,\n             xmax = conf.high)) +\n  geom_errorbarh(height = 0.1) +\n  geom_point()\n\n\n\n\n\n\n\n\nNote, what the 0.95 = 95% confidence intervals means is that: “If we were to repeat this experiment 100 times, then 95 of the times, the generated confidence interval would contain the true value”.\n\n\n\nSummary\nWhat we have gone through here is a basic linear regression, where we are aiming to model the continuous variable gene_expression_level as a function of yet another continous variable temperature. We simulated data, where the true intercept and slope were 3 and 2 respectively and by fitting a linear regression model, the estimates of the intercept and slope respectively were 2.82 [0.49;5.14] and 2.02 [1.94;2.10].\nLinear models allow us to gain insights into data, by modelling relationships."
   },
   {
     "objectID": "primer_on_r_packages.html",
diff --git a/lab02.qmd b/lab02.qmd
index 62d49e3..2024777 100644
--- a/lab02.qmd
+++ b/lab02.qmd
@@ -6,7 +6,7 @@
 
 ## Schedule
 
-- 08.00 - 08.15: [Pre-course Survey Walk-through](https://raw.githack.com/r4bds/r4bds.github.io/main/pre_course_questionnaire_summary.html)
+- 08.00 - 08.15: [pre-course anonymous questionaire Walk-through](https://raw.githack.com/r4bds/r4bds.github.io/main/pre_course_questionnaire_summary.html)
 - 08.15 - 08.30: Recap: RStudio Cloud, RStudio and R - The Very Basics (Live session)
 - 08.30 - 09.00: [Lecture](https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab02.html)
 - 09.00 - 09.15: Break
diff --git a/pre_course_questionnaire_summary.html b/pre_course_questionnaire_summary.html
index 87bb73d..a263448 100644
--- a/pre_course_questionnaire_summary.html
+++ b/pre_course_questionnaire_summary.html
@@ -1174,178 +1174,121 @@ <h1>Is there any specific area of bio-research you would like to see covered?</h
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="general-bioinformatics-data-analysis" class="slide level2">
-<h2>General Bioinformatics &amp; Data Analysis:</h2>
-<ul>
-<li>General interest in bioinformatics.</li>
-<li>Comfortable handling different types of biological data in R.</li>
-<li>How to work with data and make it easier to analyze.</li>
-<li>Visualizations of big data.</li>
-</ul>
+<section id="data-analysis-machine-learning" class="slide level2">
+<h2>Data Analysis &amp; Machine Learning</h2>
+<p><em>“I’m interested in more advanced statistics, mapping, and machine learning techniques applied to biological data, especially handling large datasets like RNA-seq and proteomics.”</em></p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="genetics-genomics" class="slide level2">
-<h2>Genetics &amp; Genomics:</h2>
-<ul>
-<li>Genetics, Genomics, and Evolution.</li>
-<li>Transcriptomics, Metagenomics, Genomic data analysis, and RNA-seq.</li>
-<li>Single cell omics, Bulk RNA-seq data manipulation.</li>
-<li>Gene expression data analysis in R.</li>
-</ul>
+<section id="genetics-and-genomics" class="slide level2">
+<h2>Genetics and Genomics</h2>
+<p><em>“I would like to see topics related to gene-based disease discovery, genome sequencing, and CRISPR applications in bio-research.”</em></p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="disease-medical-research" class="slide level2">
-<h2>Disease &amp; Medical Research:</h2>
-<ul>
-<li>Personalized medicine and precision medicine.</li>
-<li>Research related to specific diseases: obesity, cancer, autoimmune diseases, infectious diseases.</li>
-<li>Drug design, clinical drug trials, and drug trial data analysis.</li>
-<li>Analysis of complex cancer patient data.</li>
-</ul>
+<section id="immunology" class="slide level2">
+<h2>Immunology</h2>
+<p><em>“Immunology research, particularly related to autoimmune diseases and tumor immunology, would be valuable to explore.”</em></p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="immunology-microbiology" class="slide level2">
-<h2>Immunology &amp; Microbiology:</h2>
-<ul>
-<li>Immunology, especially related to MHC, neoantigens, antibodies, and antigens.</li>
-<li>Clinical research in R related to immunology.</li>
-<li>Gut microbiome, Microbiome studies on cancer, and Microbiologic studies.</li>
-<li>Immune response bio-research and immune system or stem cells.</li>
-</ul>
+<section id="multi-omics-omics-data" class="slide level2">
+<h2>Multi-omics &amp; Omics Data</h2>
+<p><em>“Multi-omics approaches, including proteomics, genomics, and the analysis of RNA-seq data, are areas I’m very interested in.”</em></p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="advanced-computational-techniques" class="slide level2">
-<h2>Advanced Computational Techniques:</h2>
-<ul>
-<li>Predictive modeling and visualizations of complex networks/pathways.</li>
-<li>Deep learning in R and Artificial Intelligence.</li>
-<li>Analysis of peptide sequencing via mass spec (TD-search and de novo).</li>
-</ul>
+<section id="epidemiology-clinical-data" class="slide level2">
+<h2>Epidemiology &amp; Clinical Data</h2>
+<p><em>“It would be helpful to cover epidemiology topics, including predictive modeling of disease outbreaks and the analysis of clinical datasets.”</em></p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-</section>
-<section id="specific-bio-research-topics" class="slide level2">
-<h2>Specific Bio-research Topics:</h2>
-<ul>
-<li>Food-related research.</li>
-<li>Ecology.</li>
-<li>Plastic degradation by microorganisms or enzymes.</li>
-<li>CO2 capture by microorganisms.</li>
-<li>Mass screening and patient profiling, especially for cancer.</li>
-<li>LC-MS data, peptide prediction from proteins.</li>
-</ul>
+</section></section>
+<section>
+<section id="briefly-what-are-your-general-expectations-to-this-course" class="title-slide slide level1 center">
+<h1>Briefly, what are your general expectations to this course?</h1>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="others" class="slide level2">
-<h2>Others:</h2>
-<ul>
-<li>Some students have expressed that they are open to any topic or aren’t particularly focused on a specific area.</li>
-<li>A few are excited about the course in general and don’t have specific preferences.</li>
-<li>There’s interest in the integration of bioinformatics with scientific articles and hospital data.</li>
-</ul>
+<section id="r-programming-proficiency" class="slide level2">
+<h2>R Programming Proficiency</h2>
+<p>“I expect to become proficient in R programming, gaining confidence in writing, interpreting, and using R code in a variety of bio-research contexts. I aim to improve my R skills for future work in biological research and industry.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-</section></section>
-<section>
-<section id="briefly-what-are-your-general-expectations-to-this-course" class="title-slide slide level1 center">
-<h1>Briefly, what are your general expectations to this course?</h1>
+</section>
+<section id="data-handling-and-analysis" class="slide level2">
+<h2>Data Handling and Analysis</h2>
+<p>“My goal is to learn how to effectively handle and analyze large datasets, including cleaning, organizing, and applying statistical methods to biological data. I hope to gain the ability to manage big data and automate data analysis tasks in R.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="r-proficiency" class="slide level2">
-<h2>R Proficiency:</h2>
-<ul>
-<li>Many students wish to gain or improve proficiency in using R for data analysis.</li>
-<li>There’s an emphasis on understanding the R environment, syntax, and packages.</li>
-<li>Some students are already familiar with R and wish to polish and expand their skills, while others are complete beginners hoping to grasp the basics.</li>
-</ul>
+<section id="practical-applications-in-biology" class="slide level2">
+<h2>Practical Applications in Biology</h2>
+<p>“I want to apply the R programming skills learned in this course to real-world biological problems, such as genomic data analysis, bioinformatics, and pipeline development. Understanding how to use R in practical bio-research scenarios is a key expectation.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="bioinformatics-and-biology-application" class="slide level2">
-<h2>Bioinformatics and Biology Application:</h2>
-<ul>
-<li>Many students want to learn how to apply R specifically to biological and bioinformatic datasets.</li>
-<li>They expect to work with real-life bio data and learn how to handle and analyze data relevant to their bio studies.</li>
-</ul>
+<section id="visualization-data-presentation" class="slide level2">
+<h2>Visualization &amp; Data Presentation</h2>
+<p>“I hope to develop skills in visualizing and presenting biological data in a clear and effective way. Learning how to create plots and presentations that make large datasets more accessible is a critical aspect I expect to master.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="data-visualization-manipulation" class="slide level2">
-<h2>Data Visualization &amp; Manipulation:</h2>
-<ul>
-<li>Students are keen to learn about data visualization and manipulation in R.</li>
-<li>They are interested in using R for creating plots, visualizations, and handling various data types.</li>
-</ul>
+<section id="confidence-and-efficiency-in-using-r" class="slide level2">
+<h2>Confidence and Efficiency in Using R</h2>
+<p>“I aim to feel more confident and efficient in using R for bio-data analysis by the end of this course. I hope to reduce the time spent on coding, improve the readability of my code, and tackle intermediate challenges independently.”</p>
+<!--# ---------------------------------------------------------------------- -->
+<!--# SLIDE ---------------------------------------------------------------- -->
+<!--# ---------------------------------------------------------------------- -->
+</section></section>
+<section>
+<section id="is-there-anything-you-would-like-to-add-comments-suggestions-anything" class="title-slide slide level1 center">
+<h1>Is there anything you would like to add? Comments, suggestions, anything?</h1>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="practical-skills-for-future-application" class="slide level2">
-<h2>Practical Skills for Future Application:</h2>
-<ul>
-<li>Several students hope the course will prepare them for future projects, research, or roles that require data analysis.</li>
-<li>Some students are interested in using the skills they gain in this course for their thesis or future studies.</li>
-<li>A few want to be able to transfer the knowledge they gain to other programming languages, like Python.</li>
-</ul>
+<section id="course-pace-structure" class="slide level2">
+<h2>Course Pace &amp; Structure</h2>
+<p>“No rushing through the learning material would greatly benefit my understanding. A slower, more deliberate pace will help those of us who are new to programming.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="data-analysis-techniques" class="slide level2">
-<h2>Data Analysis Techniques:</h2>
-<ul>
-<li>Students wish to learn about different data analysis methods, including statistics, RNA sequencing data processing, etc.</li>
-<li>They hope to understand how to organize, clean, and interpret data.</li>
-</ul>
+<section id="learning-resources" class="slide level2">
+<h2>Learning Resources</h2>
+<p>“It would be really helpful to have additional learning resources, such as recommended books or websites, to support learning outside of class. Pointers for getting extra help would be appreciated.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="tool-and-package-familiarity" class="slide level2">
-<h2>Tool and Package Familiarity:</h2>
-<ul>
-<li>Some students want to familiarize themselves with specific R packages, like tidyverse.</li>
-<li>There’s an interest in learning how to utilize GitHub for shared programming.</li>
-<li>Several mention wanting to know how to use specific tools for data analysis, such as data wrangling and lab notebook style coding.</li>
-</ul>
+<section id="industry-applications" class="slide level2">
+<h2>Industry Applications</h2>
+<p>“Including company talks with content related to protein engineering and data handling would be valuable. It would be great to see how skills from the course can be applied to real industry problems, particularly in the context of protein data.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="learning-environment" class="slide level2">
-<h2>Learning Environment:</h2>
-<ul>
-<li>A few students mentioned hoping for a structured or gradual introduction, especially for those without prior knowledge.</li>
-<li>Some have heard from past students and have expectations based on word-of-mouth.</li>
-<li>A couple of students are concerned about the timing and structure of the exam.</li>
-</ul>
+<section id="community-atmosphere" class="slide level2">
+<h2>Community &amp; Atmosphere</h2>
+<p>“Looking forward to the course and excited about the learning experience! There’s a general sense of anticipation and eagerness to start the class.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
 </section>
-<section id="miscellaneous" class="slide level2">
-<h2>Miscellaneous:</h2>
-<ul>
-<li>There are mentions of topics like the application of R in various biological datasets, multiomics, and genetics.</li>
-<li>Some are looking forward to learning how to plan scientific studies or adjust chosen methods.</li>
-<li>A few students don’t have specific expectations, while others hope for a challenging but rewarding experience.</li>
-</ul>
+<section id="individual-programming-projects" class="slide level2">
+<h2>Individual Programming Projects</h2>
+<p>“I’d love to focus more on coding custom functions and understanding how they can be applied in different contexts, beyond just the basics.”</p>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
@@ -1385,8 +1328,8 @@ <h2>This course</h2>
 <h2>This course - In other words</h2>
 <ul>
 <li>Creates the foundation for you to explore the multitude of bioinformatics subjects</li>
-<li>Gives you concrete skills to handle (almost) any kind of bio data</li>
-<li>Trains your collaborative and communicative meta skills</li>
+<li>Gives you concrete tool skills to handle (almost) any kind of bio data and to do collaborative coding projects</li>
+<li>Trains your general collaborative and communicative meta skills</li>
 </ul>
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
diff --git a/pre_course_questionnaire_summary.qmd b/pre_course_questionnaire_summary.qmd
index ac988d3..aa4e338 100644
--- a/pre_course_questionnaire_summary.qmd
+++ b/pre_course_questionnaire_summary.qmd
@@ -91,180 +91,140 @@ d_pca_obj_aug |>
 
 
 
+
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## General Bioinformatics & Data Analysis:
-
-- General interest in bioinformatics.
-- Comfortable handling different types of biological data in R.
-- How to work with data and make it easier to analyze.
-- Visualizations of big data.
+## Data Analysis & Machine Learning
+_"I'm interested in more advanced statistics, mapping, and machine learning techniques applied to biological data, especially handling large datasets like RNA-seq and proteomics."_
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Genetics & Genomics:
-
-- Genetics, Genomics, and Evolution.
-- Transcriptomics, Metagenomics, Genomic data analysis, and RNA-seq.
-- Single cell omics, Bulk RNA-seq data manipulation.
-- Gene expression data analysis in R.
+## Genetics and Genomics
+_"I would like to see topics related to gene-based disease discovery, genome sequencing, and CRISPR applications in bio-research."_
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Disease & Medical Research:
+## Immunology
+_"Immunology research, particularly related to autoimmune diseases and tumor immunology, would be valuable to explore."_
 
-- Personalized medicine and precision medicine.
-- Research related to specific diseases: obesity, cancer, autoimmune diseases, infectious diseases.
-- Drug design, clinical drug trials, and drug trial data analysis.
-- Analysis of complex cancer patient data.
 
 
+<!--# ---------------------------------------------------------------------- -->
+<!--# SLIDE ---------------------------------------------------------------- -->
+<!--# ---------------------------------------------------------------------- -->
+## Multi-omics & Omics Data
+_"Multi-omics approaches, including proteomics, genomics, and the analysis of RNA-seq data, are areas I’m very interested in."_
+
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Immunology & Microbiology:
-
-- Immunology, especially related to MHC, neoantigens, antibodies, and antigens.
-- Clinical research in R related to immunology.
-- Gut microbiome, Microbiome studies on cancer, and Microbiologic studies.
-- Immune response bio-research and immune system or stem cells.
+## Epidemiology & Clinical Data
+_"It would be helpful to cover epidemiology topics, including predictive modeling of disease outbreaks and the analysis of clinical datasets."_
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Advanced Computational Techniques:
+# Briefly, what are your general expectations to this course?
+
 
-- Predictive modeling and visualizations of complex networks/pathways.
-- Deep learning in R and Artificial Intelligence.
-- Analysis of peptide sequencing via mass spec (TD-search and de novo).
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Specific Bio-research Topics:
-- Food-related research.
-- Ecology.
-- Plastic degradation by microorganisms or enzymes.
-- CO2 capture by microorganisms.
-- Mass screening and patient profiling, especially for cancer.
-- LC-MS data, peptide prediction from proteins.
+## R Programming Proficiency
+"I expect to become proficient in R programming, gaining confidence in writing, interpreting, and using R code in a variety of bio-research contexts. I aim to improve my R skills for future work in biological research and industry."
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Others:
-
-- Some students have expressed that they are open to any topic or aren't particularly focused on a specific area.
-- A few are excited about the course in general and don't have specific preferences.
-- There's interest in the integration of bioinformatics with scientific articles and hospital data.
+## Data Handling and Analysis
+"My goal is to learn how to effectively handle and analyze large datasets, including cleaning, organizing, and applying statistical methods to biological data. I hope to gain the ability to manage big data and automate data analysis tasks in R."
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-# Briefly, what are your general expectations to this course?
+## Practical Applications in Biology
+"I want to apply the R programming skills learned in this course to real-world biological problems, such as genomic data analysis, bioinformatics, and pipeline development. Understanding how to use R in practical bio-research scenarios is a key expectation."
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## R Proficiency:
-
-- Many students wish to gain or improve proficiency in using R for data analysis.
-- There's an emphasis on understanding the R environment, syntax, and packages.
-- Some students are already familiar with R and wish to polish and expand their skills, while others are complete beginners hoping to grasp the basics.
+## Visualization & Data Presentation
+"I hope to develop skills in visualizing and presenting biological data in a clear and effective way. Learning how to create plots and presentations that make large datasets more accessible is a critical aspect I expect to master."
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Bioinformatics and Biology Application:
-
-- Many students want to learn how to apply R specifically to biological and bioinformatic datasets.
-- They expect to work with real-life bio data and learn how to handle and analyze data relevant to their bio studies.
-
+## Confidence and Efficiency in Using R
+"I aim to feel more confident and efficient in using R for bio-data analysis by the end of this course. I hope to reduce the time spent on coding, improve the readability of my code, and tackle intermediate challenges independently."
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Data Visualization & Manipulation:
+# Is there anything you would like to add? Comments, suggestions, anything?
 
-- Students are keen to learn about data visualization and manipulation in R.
-- They are interested in using R for creating plots, visualizations, and handling various data types.
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Practical Skills for Future Application:
-
-- Several students hope the course will prepare them for future projects, research, or roles that require data analysis.
-- Some students are interested in using the skills they gain in this course for their thesis or future studies.
-- A few want to be able to transfer the knowledge they gain to other programming languages, like Python.
+## Course Pace & Structure
+"No rushing through the learning material would greatly benefit my understanding. A slower, more deliberate pace will help those of us who are new to programming."
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Data Analysis Techniques:
-
-- Students wish to learn about different data analysis methods, including statistics, RNA sequencing data processing, etc.
-- They hope to understand how to organize, clean, and interpret data.
+## Learning Resources
+"It would be really helpful to have additional learning resources, such as recommended books or websites, to support learning outside of class. Pointers for getting extra help would be appreciated."
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Tool and Package Familiarity:
-
-- Some students want to familiarize themselves with specific R packages, like tidyverse.
-- There's an interest in learning how to utilize GitHub for shared programming.
-- Several mention wanting to know how to use specific tools for data analysis, such as data wrangling and lab notebook style coding.
+## Industry Applications
+"Including company talks with content related to protein engineering and data handling would be valuable. It would be great to see how skills from the course can be applied to real industry problems, particularly in the context of protein data."
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Learning Environment:
-
-- A few students mentioned hoping for a structured or gradual introduction, especially for those without prior knowledge.
-- Some have heard from past students and have expectations based on word-of-mouth.
-- A couple of students are concerned about the timing and structure of the exam.
+## Community & Atmosphere
+"Looking forward to the course and excited about the learning experience! There's a general sense of anticipation and eagerness to start the class."
 
 
 
 <!--# ---------------------------------------------------------------------- -->
 <!--# SLIDE ---------------------------------------------------------------- -->
 <!--# ---------------------------------------------------------------------- -->
-## Miscellaneous:
-
-- There are mentions of topics like the application of R in various biological datasets, multiomics, and genetics.
-- Some are looking forward to learning how to plan scientific studies or adjust chosen methods.
-- A few students don't have specific expectations, while others hope for a challenging but rewarding experience.
-
+## Individual Programming Projects
+"I'd love to focus more on coding custom functions and understanding how they can be applied in different contexts, beyond just the basics."
 
 
 <!--# ---------------------------------------------------------------------- -->
@@ -304,8 +264,8 @@ d_pca_obj_aug |>
 ## This course - In other words
 
 - Creates the foundation for you to explore the multitude of bioinformatics subjects
-- Gives you concrete skills to handle (almost) any kind of bio data
-- Trains your collaborative and communicative meta skills
+- Gives you concrete tool skills to handle (almost) any kind of bio data and to do collaborative coding projects
+- Trains your general collaborative and communicative meta skills