diff --git a/docs/lab02.html b/docs/lab02.html index 3d58ef6..4738de2 100644 --- a/docs/lab02.html +++ b/docs/lab02.html @@ -378,7 +378,7 @@

Package(s)

Schedule

@@ -1255,16 +1255,16 @@

C
# A tibble: 10 × 11
    Experiment Cohort        Age Gender Race  A1    A2    B1    B2    C1    C2   
    <chr>      <chr>       <dbl> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
- 1 eHO130     Healthy (N…    28 F      White A*02… A*03… B*07… B*08… C*07… C*07…
- 2 eLH48      COVID-19-C…    28 M      White A*03… A*24… B*08… B*14… C*03… C*08…
- 3 eOX43      Healthy (N…    24 M      White A*02… A*03… B*27… B*40… C*03… C*07…
- 4 ePD76      Healthy (N…    33 M      White A*02… A*03… B*35… B*40… C*03… C*03…
- 5 eQD115     COVID-19-C…    48 M      <NA>  A*02… A*03… B*07… B*44… C*05… C*07…
- 6 eAV100     COVID-19-C…    29 F      <NA>  A*02… A*68… B*07… B*40… C*03… C*07…
- 7 eQD109     COVID-19-C…    61 M      <NA>  A*03… A*69… B*07… B*07… C*07… C*07…
- 8 eLH59      COVID-19-C…    NA <NA>   <NA>  A*01… A*02… B*40… B*52… C*03… C*16…
- 9 eMR15      COVID-19-C…    NA <NA>   <NA>  A*03… A*32… B*07… B*07… C*07… C*07…
-10 eAM23      COVID-19-C…    48 M      <NA>  A*11… A*24… B*15… B*52… C*04… C*12…
+ 1 eHO125 COVID-19-C… 52 M <NA> A*02… A*02… B*39… B*44… C*07… C*07… + 2 eHH169 Healthy (N… 24 F Blac… A*02… A*74… B*35… B*35… C*04… C*04… + 3 eHO129 COVID-19-C… 66 F Asian A*24… A*24… B*15… B*40… C*08… C*15… + 4 eLH47 COVID-19-C… 35 F White A*01… A*02… B*07… B*08… C*07… C*07… + 5 ePD85 Healthy (N… 27 F <NA> A*02… A*29… B*07… B*18… C*07… C*15… + 6 ePD80 COVID-19-C… 67 M <NA> A*02… A*66… B*15… B*41… C*03… C*17… + 7 eJL149 COVID-19-C… 60 F <NA> A*02… A*02… B*44… B*50… C*06… C*16… + 8 eQD109 COVID-19-C… 61 M <NA> A*03… A*69… B*07… B*07… C*07… C*07… + 9 eEE226 Healthy (N… 21 F White A*01… A*02… B*35… B*39… C*04… C*07… +10 eJL154 COVID-19-E… 35 F Nati… A*02… A*29… B*15… B*44… C*04… C*16…

Remember you can scroll in the data.

@@ -1285,16 +1285,16 @@

C
# A tibble: 10 × 7
    Experiment Cohort                        Age Gender Race         Gene  Allele
    <chr>      <chr>                       <dbl> <chr>  <chr>        <chr> <chr> 
- 1 eLH47      COVID-19-Convalescent          35 F      White        A2    "A*02…
- 2 eJL160     COVID-19-Acute                 52 F      African Ame… B2    "B*81…
- 3 eAV100     COVID-19-Convalescent          29 F      <NA>         C2    "C*07…
- 4 eLH51      COVID-19-Convalescent          55 M      Asian        A1    "A*24…
- 5 eMR17      COVID-19-Convalescent          NA <NA>   <NA>         B2    "B*57…
- 6 eQD121     COVID-19-Convalescent          38 M      <NA>         C2    "C*07…
- 7 eNL192     COVID-19-Convalescent          NA <NA>   <NA>         C1    ""    
- 8 eMR23      COVID-19-Convalescent          22 F      <NA>         A1    ""    
- 9 eOX43      Healthy (No known exposure)    24 M      White        B1    "B*27…
-10 eLH42      COVID-19-Convalescent          63 M      <NA>         B1    "B*07…
+ 1 eDH107 COVID-19-Convalescent 72 F <NA> A2 "A*03… + 2 eQD117 COVID-19-Convalescent 70 F <NA> B1 "B*35… + 3 eAV100 COVID-19-Convalescent 29 F <NA> C1 "C*03… + 4 eHO138 COVID-19-B-Non-Acute NA <NA> <NA> A2 "" + 5 eJL149 COVID-19-Convalescent 60 F <NA> C1 "C*06… + 6 eMR25 COVID-19-Convalescent 21 F <NA> C1 "" + 7 eQD113 COVID-19-Convalescent 36 M <NA> A1 "A*03… + 8 eHH169 Healthy (No known exposure) 24 F Black or Af… A1 "A*02… + 9 eOX43 Healthy (No known exposure) 24 M White A1 "A*02… +10 eQD127 COVID-19-Convalescent 61 F <NA> C1 "C*02…

Remember, what we are aiming for here, is to create one data set from two. So:

@@ -1310,18 +1310,18 @@

C sample_n(10)
# A tibble: 10 × 2
-   Experiment Allele      
-   <chr>      <chr>       
- 1 eJL157     "C*07:01:01"
- 2 eHO135     "B*07:02:01"
- 3 eHO141     ""          
- 4 eJL153     "A*03:01:01"
- 5 eMR20      "C*07:02:01"
- 6 eJL154     "B*15:02:01"
- 7 ePD82      "A*26:02:01"
- 8 eQD111     "A*01:01:01"
- 9 eMR22      "C*07:18:01"
-10 eJL146     "A*02:01"   
+ Experiment Allele + <chr> <chr> + 1 eQD108 A*68:01:02 + 2 eHO130 B*08:01 + 3 ePD82 C*14:03:01 + 4 ePD83 C*03:04 + 5 eQD116 C*04:01:01 + 6 eQD123 A*02:01:01 + 7 eQD112 C*07:02:01 + 8 eOX43 C*03:04 + 9 eHO134 C*07:01:01 +10 eLH45 A*02:01:01

Use the View() function again, to look at the meta_data. Notice something? Some alleles are e.g. A*11:01, whereas others are B*51:01:02. You can find information on why, by visiting Nomenclature for Factors of the HLA System.

@@ -1347,16 +1347,16 @@

C
# A tibble: 10 × 3
    Experiment Allele     Allele_F_1_2
    <chr>      <chr>      <chr>       
- 1 eJL157     C*07:02:01 C*07:02     
- 2 eQD118     C*03:04:01 C*03:04     
- 3 eMR13      C*07:01:01 C*07:01     
- 4 eLH54      B*40:02:01 B*40:02     
- 5 ePD82      C*08:01:01 C*08:01     
- 6 eQD121     B*57:01:01 B*57:01     
- 7 eAV88      C*07:04    C*07:04     
- 8 eDH105     B*40:01:02 B*40:01     
- 9 ePD86      C*14:02:01 C*14:02     
-10 eOX46      C*04:01    C*04:01     
+ 1 eQD128 B*39:01:01 B*39:01 + 2 eOX46 A*02:01 A*02:01 + 3 eLH45 C*12:03:01 C*12:03 + 4 eQD120 A*31:01:02 A*31:01 + 5 ePD81 B*40:02:01 B*40:02 + 6 eXL27 C*07:04 C*07:04 + 7 ePD79 B*07:02:01 B*07:02 + 8 eDH105 A*24:02:01 A*24:02 + 9 eAV91 C*05:01 C*05:01 +10 eEE240 B*40:01 B*40:01

The asterisk, i.e. * is a rather annoying character because of ambiguity, so:

@@ -1373,16 +1373,16 @@

C
# A tibble: 10 × 2
    Experiment Allele
    <chr>      <chr> 
- 1 eDH96      A02:01
- 2 eQD127     C02:02
- 3 eJL162     B55:01
- 4 eLH51      C12:04
- 5 eHO133     A32:01
- 6 eJL157     B18:01
- 7 eQD123     C07:02
- 8 eLH59      A02:01
- 9 eXL32      A01:01
-10 eLH45      C12:03
+ 1 eLH43 B44:03 + 2 eJL147 A11:01 + 3 eHH169 A02:01 + 4 eJL154 C16:01 + 5 eQD119 C07:01 + 6 eJL143 C08:02 + 7 eHH169 B35:01 + 8 eHO125 C07:01 + 9 eOX52 A02:01 +10 eLH48 B08:01
@@ -1407,18 +1407,18 @@

C sample_n(10)
# A tibble: 10 × 7
-   Experiment CDR3b                  V_gene     J_gene peptide k_CDR3b k_peptide
-   <chr>      <chr>                  <chr>      <chr>  <chr>     <int>     <int>
- 1 eEE240     CASSQRSNTGELFF         TCRBV28-01 TCRBJ… AFLLFL…      14         9
- 2 eAV93      CATSDPPGWGQGAAYSNQPQHF TCRBV24-01 TCRBJ… TLACFV…      22        10
- 3 eOX54      CSASKLDSNNEQFF         TCRBV20-01 TCRBJ… SLIDFY…      14        10
- 4 eXL27      CASSPSGAGEQFF          TCRBV27-01 TCRBJ… FLWLLW…      13         9
- 5 eXL27      CASSDPFSGFYEQYF        TCRBV05-01 TCRBJ… VYFLQS…      15         9
- 6 eOX49      CASSGAGSNQPQHF         TCRBV09-01 TCRBJ… LLLDDF…      14         9
- 7 eEE228     CASRTGGSSYNEQFF        TCRBV19-01 TCRBJ… IELSLI…      15        10
- 8 eHO135     CASSLRSNQPQHF          TCRBV27-01 TCRBJ… ITLATC…      13         9
- 9 eEE226     CASSFSDYEQYF           TCRBV05-06 TCRBJ… FLNGSC…      12         9
-10 eHO124     CATSEALQETQYF          TCRBV24-01 TCRBJ… KVFRSS…      13         9
+ Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide + <chr> <chr> <chr> <chr> <chr> <int> <int> + 1 eXL30 CASSLEISYEQYF TCRBV05-01 TCRBJ02-07 VPHVGEI… 13 11 + 2 eOX54 CASSASMSDTQYF TCRBV09-01 TCRBJ02-03 KLSYGIA… 13 9 + 3 eQD111 CASSELAGADTQYF TCRBV06-01 TCRBJ02-03 HTTDPSF… 14 11 + 4 eOX49 CSAHFPGQGFGEQFF TCRBV20-X TCRBJ02-01 YLCFLAF… 15 9 + 5 eHO128 CASSLQSPSSAGNEQFF TCRBV27-01 TCRBJ02-01 QSINFVR… 17 9 + 6 eOX49 CASSLWGDNEQFF TCRBV27-01 TCRBJ02-01 FYLCFLA… 13 9 + 7 eEE240 CASSFYSSGGAEGEQFF TCRBV27-01 TCRBJ02-01 LEYHDVR… 17 9 + 8 eEE228 CASSTKGRTNTGELFF TCRBV27-01 TCRBJ02-02 LIVNSVL… 16 10 + 9 eOX43 CASRGLAGDNSYEQYF TCRBV25-01 TCRBJ02-07 SLIDFYL… 16 10 +10 eOX52 CASSRGTGSEQYF TCRBV19-01 TCRBJ02-07 FLQSINF… 13 9
    @@ -1448,18 +1448,18 @@

    C sample_n(10)
    # A tibble: 10 × 8
    -   Experiment CDR3b              V_gene  J_gene peptide k_CDR3b k_peptide Allele
    -   <chr>      <chr>              <chr>   <chr>  <chr>     <int>     <int> <chr> 
    - 1 eHO134     CASSDRTPQETQYF     TCRBV2… TCRBJ… HTTDPS…      14        11 A24:02
    - 2 eEE224     CSALGLEVNEQYF      TCRBV2… TCRBJ… FYLCFL…      13         9 A02:01
    - 3 eEE226     CASSLGPDGYNEQFF    TCRBV0… TCRBJ… ITEEVG…      15        14 A02:01
    - 4 eEE224     CASSFSGLSYEQYF     TCRBV0… TCRBJ… IDFYLC…      14        10 C07:04
    - 5 eHH175     CASSQVQGVRSGANVLTF TCRBV0… TCRBJ… IPTNFT…      18         9 C07:02
    - 6 eDH105     CASSRGTSRNTEAFF    TCRBV1… TCRBJ… LALLLL…      15         9 B48:01
    - 7 eEE226     CSVVGTSGGHEQYF     TCRBV2… TCRBJ… DGVYFA…      14        10 B35:02
    - 8 eQD128     CASSSPSGGINEQFF    TCRBV1… TCRBJ… YLCFLA…      15         9 A02:10
    - 9 eOX43      CASSLRGTSYGYTF     TCRBV1… TCRBJ… FLPRVF…      14         9 C07:04
    -10 eOX46      CASSWDNYNEQFF      TCRBV0… TCRBJ… YIIKLI…      13         9 A02:01
    + Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide Allele + <chr> <chr> <chr> <chr> <chr> <int> <int> <chr> + 1 eEE240 CASSYGQGTPLHF TCRBV06… TCRBJ… FPQSAP… 13 9 A02:01 + 2 eOX52 CATSDFSGSNTGELFF TCRBV24… TCRBJ… LWPVTL… 16 9 B40:01 + 3 eEE224 CASSTQGSGELFF TCRBV27… TCRBJ… IELSLI… 13 10 C07:04 + 4 eQD124 CASRTGSNQPQHF TCRBV06… TCRBJ… ASAFFG… 13 9 B51:01 + 5 eOX49 CASSQDSKLTGSYEQYF TCRBV14… TCRBJ… YLYALV… 17 9 A02:01 + 6 eLH47 CASDGAGGYTF TCRBV07… TCRBJ… WLLWPV… 11 9 C07:02 + 7 eOX52 CASSLVQGAYNEQFF TCRBV05… TCRBJ… LLFLVL… 15 9 B15:17 + 8 eEE226 CASSLDGTPGNTIYF TCRBV11… TCRBJ… DGVYFA… 15 10 C07:02 + 9 eXL30 CASNFFPGLDNEQFF TCRBV02… TCRBJ… APKEII… 15 8 B35:02 +10 eAV93 CASSFTGLSYEQYF TCRBV05… TCRBJ… VLPFND… 14 10 C04:01
    diff --git a/docs/lab06_files/figure-html/unnamed-chunk-27-1.png b/docs/lab06_files/figure-html/unnamed-chunk-27-1.png index dd7d219..22c2ef6 100644 Binary files a/docs/lab06_files/figure-html/unnamed-chunk-27-1.png and b/docs/lab06_files/figure-html/unnamed-chunk-27-1.png differ diff --git a/docs/primer_on_linear_models_in_r.html b/docs/primer_on_linear_models_in_r.html index f482fb7..3e610c8 100644 --- a/docs/primer_on_linear_models_in_r.html +++ b/docs/primer_on_linear_models_in_r.html @@ -395,7 +395,7 @@

    Data

    run_simulation(temp = c(15, 20, 25, 30, 35))
    -
    [1] 26.90906 42.74464 50.93029 69.31215 71.14476
    +
    [1] 30.52223 38.13767 54.47297 63.80020 72.71315

    Let’s just go ahead and create some data, we can work with. For this example, we take samples starting at 5 degree celsius and then in increments of 1 up to 50 degrees:

    diff --git a/docs/search.json b/docs/search.json index 0bc310d..5745535 100644 --- a/docs/search.json +++ b/docs/search.json @@ -116,7 +116,7 @@ "href": "lab02.html#schedule", "title": "Lab 2: Data Visualisation I", "section": "Schedule", - "text": "Schedule\n\n08.00 - 08.15: Pre-course Survey Walk-through\n08.15 - 08.30: Recap: RStudio Cloud, RStudio and R - The Very Basics (Live session)\n08.30 - 09.00: Lecture\n09.00 - 09.15: Break\n09.00 - 12.00: Exercises" + "text": "Schedule\n\n08.00 - 08.15: pre-course anonymous questionaire Walk-through\n08.15 - 08.30: Recap: RStudio Cloud, RStudio and R - The Very Basics (Live session)\n08.30 - 09.00: Lecture\n09.00 - 09.15: Break\n09.00 - 12.00: Exercises" }, { "objectID": "lab02.html#learning-materials", @@ -340,7 +340,7 @@ "href": "lab05.html#creating-the-micro-report", "title": "Lab 5: Data Wrangling II", "section": "Creating the Micro-Report", - "text": "Creating the Micro-Report\n\nBackground\nFeel free to copy paste the one stated in the background-section above\n\n\nAim\nState the aim of the micro-report, i.e. what are the questions you are addressing?\n\n\nLoad Libraries\n\n\n\nLoad the libraries needed\n\n\nLoad Data\nRead the two data sets into variables peptide_data and meta_data.\n\n\n\nClick here for hint\n\n\nThink about which Tidyverse package deals with reading data and what are the file types we want to read here?\n\n\n\n\n\n\nData Description\nIt is customary to include a description of the data, helping the reader if the report, i.e. your stakeholder, to get an easy overview\n\nThe Subject Meta Data\nLet’s take a look at the meta data:\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 30\n Experiment Subject `Cell Type` `Target Type` Cohort Age Gender Race \n \n 1 eMR12 1770 PBMC C19_cI COVID-19-Con… NA \n 2 eHO132 26 PBMC C19_cI COVID-19-Con… 65 F White\n 3 eQD109 1349 PBMC C19_cI COVID-19-Con… 61 M \n 4 eEE240 20795 naive_CD8 C19_cI Healthy (No … 23 M White\n 5 eQD131 2267 PBMC C19_cI COVID-19-Exp… NA \n 6 ePD80 1027 PBMC C19_cI COVID-19-Con… 67 M \n 7 eJL154 83 PBMC C19_cI COVID-19-Exp… 35 F Nati…\n 8 eNL187 2686 B-CD8-_PBMC C19_cII COVID-19-Con… NA \n 9 eOX54 10881 naive_CD8 C19_cI Healthy (No … 39 F Afri…\n10 eOX49 10943 naive_CD8 C19_cI Healthy (No … 21 M White\n# ℹ 22 more variables: `HLA-A...9` , `HLA-A...10` ,\n# `HLA-B...11` , `HLA-B...12` , `HLA-C...13` ,\n# `HLA-C...14` , DPA1...15 , DPA1...16 , DPB1...17 ,\n# DPB1...18 , DQA1...19 , DQA1...20 , DQB1...21 ,\n# DQB1...22 , DRB1...23 , DRB1...24 , DRB3...25 ,\n# DRB3...26 , DRB4...27 , DRB4...28 , DRB5...29 ,\n# DRB5...30 \n\n\n\nQ1: How many observations of how many variables are in the data?\nQ2: Are there groupings in the variables, i.e. do certain variables “go together” somehow?\nT1: Re-create this plot\n\nRead this first:\n\nThink about: What is on the x-axis? What is on the y-axis? And also, it looks like we need to do some counting. Recall, that we can stick together a dplyr pipeline with a call to ggplot, so here we will have to count of Cohort and Gender before plotting\n\n\n\n\n\n\nDoes your plot look different somehow? Consider peeking at the hint…\n\n\n\nClick here for hint\n\n\nPerhaps not everyone agrees on how to denote NAs in data. I have seen -99, -11, _ and so on… Perhaps this can be dealt with in the instance we read the data from the file? I.e. in the actual function call to your read() function. Recall, how can we get information on the parameters of a ?function\n\n\nT2: Re-create this plot\n\n\n\n\n\n\n\n\n\nClick here for hint\n\n\nPerhaps there is a function, which can cut continuous observations into a set of bins?\n\n\nSTOP! Make sure you handled how NAs are denoted in the data before proceeding, see hint below T1\n\nT3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables Subject, Cell Type and Target Type\n\n\n\n\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 27\n Experiment Cohort Age Gender Race `HLA-A...9` `HLA-A...10` `HLA-B...11`\n \n 1 eHO126 COVID-19… 37 F \"A*01:01:0… \"A*24:02:01\" \"B*07:02:01\"\n 2 eJL160 COVID-19… 52 F Afri… \"A*01:01:0… \"A*02:01:01\" \"B*44:02:01\"\n 3 eNL192 COVID-19… NA \"\" \"\" \"\" \n 4 eHO130 Healthy … 28 F White \"A*02:01\" \"A*03:01\" \"B*07:02\" \n 5 eLH45 COVID-19… 53 M \"A*02:01:0… \"A*03:01:01\" \"B*07:02:01\"\n 6 eQD134 COVID-19… NA \"A*24:07:0… \"A*34:01:01\" \"B*15:02:01\"\n 7 eJL161 COVID-19… 31 F White \"A*01:01:0… \"A*02:01:01\" \"B*08:01:01\"\n 8 eQD116 COVID-19… 66 F \"A*03:01:0… \"A*11:01:01\" \"B*35:01:01\"\n 9 ePD86 COVID-19… 58 M White \"A*02:01:0… \"A*26:01:01\" \"B*44:27:01\"\n10 eLH47 COVID-19… 35 F White \"A*01:01:0… \"A*02:01:01\" \"B*07:02:01\"\n# ℹ 19 more variables: `HLA-B...12` , `HLA-C...13` ,\n# `HLA-C...14` , DPA1...15 , DPA1...16 , DPB1...17 ,\n# DPB1...18 , DQA1...19 , DQA1...20 , DQB1...21 ,\n# DQB1...22 , DRB1...23 , DRB1...24 , DRB3...25 ,\n# DRB3...26 , DRB4...27 , DRB4...28 , DRB5...29 ,\n# DRB5...30 \n\n\nNow, a classic way of describing a cohort, i.e. the group of subjects used for the study, is the so-called table1 and while we could build this ourselves, this one time, in the interest of exercise focus and time, we are going to “cheat” and use an R-package, like so:\nNB!: This may look a bit odd initially, but if you render your document, you should be all good!\n\nlibrary(\"table1\") # <= Yes, this should normally go at the beginning!\nmeta_data |>\n mutate(Gender = factor(Gender),\n Cohort = factor(Cohort)) |>\n table1(x = formula(~ Gender + Age + Race | Cohort),\n data = _)\n\n\n\n\n\n\nCOVID-19-Acute(N=4)\nCOVID-19-B-Non-Acute(N=8)\nCOVID-19-Convalescent(N=90)\nCOVID-19-Exposed(N=3)\nHealthy (No known exposure)(N=39)\nOverall(N=144)\n\n\n\n\nGender\n\n\n\n\n\n\n\n\nF\n1 (25.0%)\n4 (50.0%)\n33 (36.7%)\n1 (33.3%)\n17 (43.6%)\n56 (38.9%)\n\n\nM\n2 (50.0%)\n3 (37.5%)\n36 (40.0%)\n0 (0%)\n21 (53.8%)\n62 (43.1%)\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n21 (23.3%)\n2 (66.7%)\n1 (2.6%)\n26 (18.1%)\n\n\nAge\n\n\n\n\n\n\n\n\nMean (SD)\n50.7 (17.0)\n43.7 (7.74)\n51.5 (15.3)\n35.0 (NA)\n33.3 (9.93)\n44.9 (15.7)\n\n\nMedian [Min, Max]\n52.0 [33.0, 67.0]\n42.0 [33.0, 53.0]\n53.0 [21.0, 79.0]\n35.0 [35.0, 35.0]\n31.0 [21.0, 62.0]\n42.0 [21.0, 79.0]\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n21 (23.3%)\n2 (66.7%)\n0 (0%)\n25 (17.4%)\n\n\nRace\n\n\n\n\n\n\n\n\nAfrican American\n1 (25.0%)\n0 (0%)\n0 (0%)\n0 (0%)\n1 (2.6%)\n2 (1.4%)\n\n\nWhite\n2 (50.0%)\n7 (87.5%)\n13 (14.4%)\n0 (0%)\n28 (71.8%)\n50 (34.7%)\n\n\nAsian\n0 (0%)\n0 (0%)\n3 (3.3%)\n0 (0%)\n2 (5.1%)\n5 (3.5%)\n\n\nHispanic or Latino/a\n0 (0%)\n0 (0%)\n1 (1.1%)\n0 (0%)\n0 (0%)\n1 (0.7%)\n\n\nNative Hawaiian or Other Pacific Islander\n0 (0%)\n0 (0%)\n0 (0%)\n1 (33.3%)\n0 (0%)\n1 (0.7%)\n\n\nBlack or African American\n0 (0%)\n0 (0%)\n0 (0%)\n0 (0%)\n3 (7.7%)\n3 (2.1%)\n\n\nMixed Race\n0 (0%)\n0 (0%)\n0 (0%)\n0 (0%)\n1 (2.6%)\n1 (0.7%)\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n73 (81.1%)\n2 (66.7%)\n4 (10.3%)\n81 (56.3%)\n\n\n\n\n\n\nNote how good this looks! If you have ever done a “Table 1” before, you know how painful they can be and especially if something changes in your cohort - Dynamic reporting to the rescue!\nLastly, before we proceed, the meta_data contains HLA data for both class I and class II (see background), but here we are only interested in class I, recall these are denoted HLA-A, HLA-B and HLA-C, so make sure to remove any non-class I, i.e. the one after, denoted D-something.\n\nT4: Create a new version of the meta_data, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. HLA-A...9 becomes A1 oand HLA-A...10 becomes A2 and so on for B1, B2, C1 and C2 (Think: How can we rename variables? And here, just do it “manually” per variable). Remember to assign this new data to the same meta_data variable\n\n\n\n\nClick here for hint\n\n\nWhich tidyverse function subsets variables? Perhaps there is a function, which somehow matches a set of variables? And perhaps for the initiated this is compatible with regular expressions (If you don’t know what this means - No worries! If you do, see if you utilise this to simplify your variable selection)\n\n\n\n\nBefore we proceed, this is the data we will carry on with:\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 11\n Experiment Cohort Age Gender Race A1 A2 B1 B2 C1 C2 \n \n 1 eQD114 COVID-19-C… 73 M \"A*0… \"A*2… \"B*0… \"B*4… \"C*0… \"C*1…\n 2 eNL189 COVID-19-E… NA \"\" \"\" \"\" \"\" \"\" \"\" \n 3 eQD112 COVID-19-C… 65 M \"A*2… \"A*2… \"B*0… \"B*3… \"C*0… \"C*0…\n 4 eLH46 COVID-19-C… 57 F White \"A*0… \"A*2… \"B*3… \"B*5… \"C*1… \"C*1…\n 5 eAV88 Healthy (N… 24 M White \"A*0… \"A*0… \"B*2… \"B*4… \"C*0… \"C*0…\n 6 eNL187 COVID-19-C… NA \"\" \"\" \"\" \"\" \"\" \"\" \n 7 eJL154 COVID-19-E… 35 F Nati… \"A*0… \"A*2… \"B*1… \"B*4… \"C*0… \"C*1…\n 8 eQD124 COVID-19-B… 40 F White \"A*0… \"A*0… \"B*1… \"B*5… \"C*0… \"C*0…\n 9 eQD134 COVID-19-C… NA \"A*2… \"A*3… \"B*1… \"B*1… \"C*0… \"C*0…\n10 ePD83 Healthy (N… 29 F Asian \"A*0… \"A*0… \"B*1… \"B*4… \"C*0… \"C*0…\n\n\nNow, we have a beautiful tidy dataset, recall that this entails, that each row is an observation, each column is a variable and each cell holds one value.\n\n\n\nThe Peptide Details Data\nLet’s start with simply having a look see:\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n `TCR BioIdentity` TCR Nucleotide Seque…¹ Experiment `ORF Coverage`\n \n 1 CSGQQGYEQYF+TCRBV29-01+TCRB… ACTCTGACTGTGAGCAACATG… eXL30 surface glyco…\n 2 CASSPRTTPAPQHF+TCRBV19-01+T… GTGACATCGGCCCAAAAGAAC… eEE224 membrane glyc…\n 3 CASSEVGTLEAFF+TCRBV25-01+TC… ACCCTGGAGTCTGCCAGGCCC… eEE228 membrane glyc…\n 4 CSARLGQGSYEQYF+TCRBV20-X+TC… GTGACCAGTGCCCATCCTGAA… eXL30 surface glyco…\n 5 CASSEGLGGYEQYF+TCRBV06-01+T… NNNNTGTCGGCTGCTCCCTCC… eEE240 ORF3a \n 6 CASSHLDRGSYNEQFF+TCRBV04-01… GCCCTGCAGCCAGAAGACTCA… eEE226 surface glyco…\n 7 CASSERDPRQETQYF+TCRBV27-01+… GAGTCGCCCAGCCCCAACCAG… eQD114 ORF1ab \n 8 CASSVGGRSYEQYF+TCRBV09-01+T… CTGAGCTCTCTGGAGCTGGGG… ePD82 ORF3a \n 9 CASSPAPIAYEQYF+TCRBV06-05+T… NNNNTGTCGGCTGCTCCCTCC… eQD131 surface glyco…\n10 CASSQETANTGELFF+TCRBV04-02+… CACACCCTGCAGCCAGAAGAC… eLH51 nucleocapsid …\n# ℹ abbreviated name: ¹​`TCR Nucleotide Sequence`\n# ℹ 3 more variables: `Amino Acids` , `Start Index in Genome` ,\n# `End Index in Genome` \n\n\n\nQ3: How many observations of how many variables are in the data?\n\nThis is a rather big data set, so let us start with two “tricks” to handle this, first:\n\nWrite the data back into your data folder, using the filename peptide-detail-ci.csv.gz, note the appending of .gz, which is automatically recognised and results in gz-compression\nNow, check in your data folder, that you have two files peptide-detail-ci.csv and peptide-detail-ci.csv.gz, delete the former\nAdjust your reading-the-data-code in the “Load Data”-section, to now read in the peptide-detail-ci.csv.gz file\n\n\n\n\nClick here for hint\n\n\nJust as you can read a file, you can of course also write a file. Note the filetype we want to write here is csv. If you in the console type e.g. readr::wr and then hit the Tab key, you will see the different functions for writing different filetypes\n\nThen:\n\nT5: As before, let’s immediately subset the peptide_data to the variables of interest: TCR BioIdentity, Experiment and Amino Acids. Remember to assign this new data to the same peptide_data variable to avoid cluttering your environment with redundant variables. Bonus: Did you know you can click the Environment pane and see which variables you have?\n\n\n\n\nOnce again, before we proceed, this is the data we will carry on with:\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 3\n Experiment `TCR BioIdentity` `Amino Acids` \n \n 1 eAV93 CASSILLAGGTDTQYF+TCRBV27-01+TCRBJ02-03 FLWLLWPVT,FLWLLWPVTL,…\n 2 eMR15 CASSLIQGANTEAFF+TCRBV07-09+TCRBJ01-01 CPDGVKHVY,DGVKHVYQL,F…\n 3 eQD123 CASSPQGAGSLYEQYF+TCRBV04-01+TCRBJ02-07 FLQSINFVR,FLQSINFVRI,…\n 4 eOX54 CSADTQYF+TCRBV20-01+TCRBJ02-03 FIASFRLFA,SYFIASFRLF,…\n 5 eOX52 CASSQDAGLANEQYF+TCRBV03-01/03-02+TCRBJ02-07 ELYSPIFLI,LYSPIFLIV,Q…\n 6 eAV93 CASSLVATGELFF+TCRBV05-04+TCRBJ02-02 AFPFTIYSL,GYINVFAFPF,…\n 7 eLH44 CASSLNPGEGPQNIQYF+TCRBV28-01+TCRBJ02-04 AFPFTIYSL,GYINVFAFPF,…\n 8 eAV93 CASSSRTSGWYNEQFF+TCRBV11-02+TCRBJ02-01 AIPTNFTISV,AYSNNSIAIP…\n 9 eEE240 CASSPPGAPMGQPQHF+TCRBV27-01+TCRBJ01-05 FLNGSCGSV \n10 eOX54 CATSDLPSTGTEVTGELFF+TCRBV24-01+TCRBJ02-02 QYIKWPWYI,YEQYIKWPW,Y…\n\n\n\nQ4: Is this tidy data? Why/why not?\nT6: See if you can find a way to create the below data, from the above\n\n\n\n\n\npeptide_data |> \n sample_n(size = 10)\n\n# A tibble: 10 × 5\n Experiment CDR3b V_gene J_gene `Amino Acids` \n \n 1 eEE240 CASSPYGGTEAFF TCRBV07-06 TCRBJ01-01 YLNTLTLAV \n 2 eXL30 CATPPRGGTGELFF TCRBV07-09 TCRBJ02-02 AFLLFLVLI,FLAFLLFLV,FYLC…\n 3 eMR16 CASSLVWGAKNIQYF TCRBV07-08 TCRBJ02-04 AFPFTIYSL,GYINVFAFPF,INV…\n 4 eQD111 CASSMTSSRDEQYF TCRBV27-01 TCRBJ02-07 HTTDPSFLGRY \n 5 eEE224 CASSFGLGSDPFF TCRBV27-01 TCRBJ02-01 AFPFTIYSL,GYINVFAFPF,INV…\n 6 eXL32 CASSPPTPAGWANEKLFF TCRBV28-01 TCRBJ01-04 TVLSFCAFA,VLSFCAFAV \n 7 eQD126 CASRPPDGGIYEQYF TCRBV06-05 TCRBJ02-07 HTTDPSFLGRY \n 8 eQD128 CASSELAGPQETQYF TCRBV06-01 TCRBJ02-05 GYQPYRVVVL,PYRVVVLSF,QPY…\n 9 eEE228 CSVLQGTEAFF TCRBV29-01 TCRBJ01-01 AEAELAKNVSL,AELAKNVSLDNVL\n10 eQD113 CASSSTSGGNEQFF TCRBV07-08 TCRBJ02-01 IGAGICASY,IPIGAGICASY \n\n\n\n\n\nClick here for hint\n\n\nFirst: Compare the two datasets and identify what happened? Did any variables “disappear” and did any “appear”? Ok, so this is a bit tricky, but perhaps there is a function to separate a composite (untidy) column into a set of new variables based on a separator? But what is a separator? Just like when you read a file with Comma Separated Values, a separator denotes how a composite string is divided into fields. So, look for such a repeated value, which seem to indeed separate such fields. Also, be aware, that character, which can mean more than one thing, may need to be “escaped” using an initial two backslashed, i.e. “\\x”, where x denotes the character needing to be “escaped”\n\n\nT7: Add a variable, which counts how many peptides are in each observation of Amino Acids\n\n\n\n\n\n\n\nClick here for hint\n\n\nWe have been working with the stringr package, perhaps the contains a function to somehow count the number of occurrences of a given character in a string? Again, remember you can type e.g. stringr::str_ and then hit the Tab key to see relevant functions\n\n\npeptide_data |> \n sample_n(size = 10)\n\n# A tibble: 10 × 6\n Experiment CDR3b V_gene J_gene `Amino Acids` n_peptides\n \n 1 eEE226 CATQLPSTDTQYF TCRBV06-05 TCRBJ02-03 GRLQSLQTY,LITGR… 3\n 2 eEE226 CASSSPGAGTGELFF TCRBV27-01 TCRBJ02-02 HTTDPSFLGRY 1\n 3 eXL31 CASSLGGEQYF TCRBV27-01 TCRBJ02-07 TEKSNIIRGW 1\n 4 eEE228 CASRYSEAYEQYF TCRBV27-01 TCRBJ02-07 FPPTSFGPL 1\n 5 eXL27 CAPRRGAGVSEAFF TCRBV28-01 TCRBJ01-01 DFLEYHDVR,EDFLE… 5\n 6 eHO141 CASSVSGTGDADTQYF TCRBV06-01 TCRBJ02-03 YLQPRTFL,YLQPRT… 3\n 7 eAV91 CASSPERLGYTF TCRBV28-01 TCRBJ01-02 ITDVFYKENSY,SEY… 2\n 8 eEE240 CASSPDRLAGEQYF TCRBV04-02 TCRBJ02-07 KLSYGIATV 1\n 9 eAV93 CSVRDFLYNEQFF TCRBV29-01 TCRBJ02-01 AFPFTIYSL,GYINV… 7\n10 eEE226 CASSHYNGNQPQHF TCRBV27-01 TCRBJ01-05 LLDDFVEII,LLLDD… 2\n\n\n\nT8: Re-create the following plot\n\n\n\n\n\n\n\nQ4: What is the maximum number of peptides assigned to one observation?\nT9: Using the str_c() and the seq() functions, re-create the below\n\n\n\n[1] \"peptide_1\" \"peptide_2\" \"peptide_3\" \"peptide_4\" \"peptide_5\"\n\n\n\n\n\nClick here for hint\n\n\nIf you’re uncertain on how a function works, try going into the console and in this case e.g. type str_c(\"a\", \"b\") and seq(from = 1, to = 3) and see if you combine these?\n\n\nT10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data\n\n\n\n\n\n\n\nClick here for hint\n\n\nIn the console, write ?separate and think about how you used it earlier. Perhaps you can not only specify a vector to separate into, but also specify a function, which returns a vector?\n\n\npeptide_data |> \n sample_n(size = 10)\n\n# A tibble: 10 × 18\n Experiment CDR3b V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4\n \n 1 eXL37 CASSLGQGADY… TCRBV… TCRBJ… VLWAHGFEL \n 2 eAV93 CASVGLAMDNE… TCRBV… TCRBJ… AFPFTIYSL GYINVFAF… INVFAFPF… MGYINVFAF\n 3 eOX54 CASSVADMNTE… TCRBV… TCRBJ… FVDGVPFVV \n 4 eOX52 CASRGLAKSSY… TCRBV… TCRBJ… AFLLFLVLI FLAFLLFLV FYLCFLAFL FYLCFLAF…\n 5 eHO138 CASSQGFPGGV… TCRBV… TCRBJ… ALNTPKDHI ATEGALNT… \n 6 eAV88 CASSQNLNEKL… TCRBV… TCRBJ… LLDDFVEII LLLDDFVEI \n 7 eOX46 CASSLGGQGSN… TCRBV… TCRBJ… FLWLLWPVT FLWLLWPV… LWLLWPVTL LWPVTLACF\n 8 eOX52 CSVDQDGIGEL… TCRBV… TCRBJ… KLSYGIATV \n 9 eOX54 CASSLVPSSGP… TCRBV… TCRBJ… FVDGVPFVV \n10 eQD114 CASSYRLATYE… TCRBV… TCRBJ… NSSPDDQI… NTNSSPDD… SSPDDQIGY SSPDDQIG…\n# ℹ 10 more variables: peptide_5 , peptide_6 , peptide_7 ,\n# peptide_8 , peptide_9 , peptide_10 , peptide_11 ,\n# peptide_12 , peptide_13 , n_peptides \n\n\n\nQ5: Now, presumable you got a warning, discuss in your group why that is?\nQ6: With respect to peptide_n, discuss in your group, if this is wide- or long-data?\n\nNow, finally we will use the what we prepared for today, data pivoting. There are two functions, namely pivot_wider() and pivot_longer(). Also, now, we will use a trick when developing ones data pipeline, while working with new functions, that on might not be completely comfortable with. You have seen the sample_n() function several times above and we can use that to randomly sample n observations from data. This we can utilise to work with a smaller data set in the development face and once we are ready, we can increase this n gradually to see if everything continues to work as anticipated.\n\nT11: Using the peptide_data, run a few sample_n() calls with varying degree of n to make sure, that you get a feeling for what is going on\nT12: From the peptide_data data above, with peptide_1, peptide_2, etc. create this data set using one of the data pivoting functions. Remember to start initially with sampling a smaller data set and then work on that first! Also, once you’re sure you’re good to go, reuse the peptide_data variable as we don’t want huge redundant data sets floating around in our environment\n\n\n\n\n\n\n\nClick here for hint\n\n\nIf the pivoting is not clear at all, then do what I do, create some example data:\n\nmy_data <- tibble(\n id = str_c(\"id_\", 1:10),\n var_1 = round(rnorm(10),1),\n var_2 = round(rnorm(10),1),\n var_3 = round(rnorm(10),1))\n\n…and then play around with that. A small set like the one above is easy to handle, so perhaps start with that and then pivot back and forth a few times using pivot_wider()/pivot_longer(). Use View() to inspect and get a better overview of the results of pivoting.\n\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment CDR3b V_gene J_gene n_peptides peptide_n peptide\n \n 1 eXL30 CASSYPLGYPEAFF TCRBV06-… TCRBJ… 1 peptide_… \n 2 eMR15 CASSLLTGGPVAKNIQYF TCRBV27-… TCRBJ… 2 peptide_5 \n 3 eMR26 CASSLAGVEQYF TCRBV05-… TCRBJ… 2 peptide_7 \n 4 eXL31 CASSSSHRDPEQYF TCRBV27-… TCRBJ… 1 peptide_6 \n 5 eEE243 CASSPSPARLAGGPSNEQFF TCRBV06-… TCRBJ… 1 peptide_5 \n 6 eOX52 CASSPSTGGISYNEQFF TCRBV07-… TCRBJ… 4 peptide_… \n 7 eHH175 CASSLAGAYEQYF TCRBV05-… TCRBJ… 2 peptide_4 \n 8 eEE224 CAGQGWGQETQYF TCRBV05-… TCRBJ… 3 peptide_3 KPFERD…\n 9 eOX52 CASSEIRAGPNQPQHF TCRBV02-… TCRBJ… 1 peptide_8 \n10 eOX49 CASSLVAVLTEAFF TCRBV07-… TCRBJ… 11 peptide_… YLCFLA…\n\n\n\nQ7: You will see some NAs in the peptide variable, discuss in your group from where these arise?\nQ8: How many rows and columns now and how does this compare with Q3? Discuss why/why not it is different?\nT13: Now, lose the redundant variables n_peptides and peptide_n, get rid of the NAs in the peptide column, and make sure that we only have unique observations (i.e. there are no repeated rows/observations).\n\n\n\n\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 5\n Experiment CDR3b V_gene J_gene peptide \n \n 1 eXL27 CASSFGGNEQFF TCRBV07-08 TCRBJ02-01 LLFLVLIML \n 2 eOX43 CASSFGTDTQYF TCRBV27-01 TCRBJ02-03 YLCFLAFLL \n 3 eOX43 CASRGAGYSSYEQYF TCRBV25-01 TCRBJ02-07 MIELSLIDFY \n 4 eEE228 CASTSGPEQYF TCRBV12-03/12-04 TCRBJ02-07 AFPFTIYSL \n 5 eDH96 CASGFGLGDNEQFF TCRBV05-01 TCRBJ02-01 YLCFLAFLL \n 6 eAV93 CASSLLGAYEQYF TCRBV27-01 TCRBJ02-07 QPYRVVVLSF \n 7 eOX46 CASSPPTGAPYEQYF TCRBV18-01 TCRBJ02-07 VQPTESIVRF \n 8 eQD111 CASSLSEGVTDTQYF TCRBV27-01 TCRBJ02-03 HTTDPSFLGRY\n 9 eOX52 CSAREVGRIEQYF TCRBV20-X TCRBJ02-07 IELSLIDFYL \n10 eXL27 CASTFGGLAANEQFF TCRBV12-X TCRBJ02-01 YINVFAFPF \n\n\n\nQ8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?\n\nAgain, we turn to the stringr package, as we need to make sure that the sequence data does indeed only contain valid characters. There are a total of 20 proteogenic amino acids, which we symbolise using ARNDCQEGHILKMFPSTWYV.\n\nT14: Use the str_detect() function to filter the CDR3b and peptide variables using a pattern of [^ARNDCQEGHILKMFPSTWYV] and then play with the negate parameter so see what happens\n\n\n\n\n\n\n\nClick here for hint\n\n\nAgain, try to play a bit around with the function in the console, type e.g. str_detect(string = \"ARND\", pattern = \"A\") and str_detect(string = \"ARND\", pattern = \"C\") and then recall, that the filter() function requires a logical vector, i.e. a vector of TRUE and FALSE to filter the rows\n\n\nT15: Add two new variables to the data, k_CDR3b and k_peptide each signifying the length of the respective sequences\n\n\n\n\n\n\n\nClick here for hint\n\n\nAgain, we’re working with strings, so perhaps there is a package of interest and perhaps in that package, there is a function, which can get the length of a string?\n\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide\n \n 1 ePD87 CASSRTHRQGRNTDTQYF TCRBV18-01 TCRBJ02-03 LSPRWY… 18 9\n 2 eEE240 CSASTQSETQYF TCRBV20-01 TCRBJ02-05 LLFLVL… 12 9\n 3 eOX52 CASSLYGQLYQETQYF TCRBV27-01 TCRBJ02-05 TLVPQE… 16 9\n 4 eEE226 CASSTRGNTIYF TCRBV13-01 TCRBJ01-03 LWPVTL… 12 9\n 5 eAV91 CASSVVGSSYEQYF TCRBV09-01 TCRBJ02-07 SEYKGP… 14 12\n 6 eOX54 CASRSPGSYNEQFF TCRBV27-01 TCRBJ02-01 ILLIIM… 14 10\n 7 ePD76 CASSEARLAGEYEQYF TCRBV02-01 TCRBJ02-07 TPSGTW… 16 9\n 8 eLH48 CASNLMNTEAFF TCRBV05-06 TCRBJ01-01 FIASFR… 12 9\n 9 eEE228 CAILDRVVNTEAFF TCRBV06-05 TCRBJ01-01 FLQSIN… 14 9\n10 eEE228 CASSFSETGELFF TCRBV05-01 TCRBJ02-02 TLACFV… 13 10\n\n\n\nT16: Re-create this plot\n\n\n\n\n\n\n\nQ9: What is the most predominant length of the CDR3b-sequences?\nT17: Re-create this plot\n\n\n\n\n\n\n\nQ10: What is the most predominant length of the peptide-sequences?\nQ11: Discuss in your group, if this data set is tidy or not?\n\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide\n \n 1 eAV93 CASSLAVGTGYYEQYF TCRBV05-04 TCRBJ… LTDEMI… 16 9\n 2 eAV88 CASSSRSGSANTGELFF TCRBV28-01 TCRBJ… GEIPVA… 17 12\n 3 eXL30 CASSLSTANPSTDTQYF TCRBV04-01 TCRBJ… YLCFLA… 17 9\n 4 eEE228 CASYPTGTGFFGYYGYTF TCRBV27-01 TCRBJ… VQPTES… 18 10\n 5 eOX54 CASTYESLYGYTF TCRBV12-03/12… TCRBJ… NVFAFP… 13 9\n 6 eAV88 CASSSTPLTGGTYEQYF TCRBV07-09 TCRBJ… RQLLFV… 17 9\n 7 eHH175 CATVGETYEQYF TCRBV28-01 TCRBJ… MPASWV… 12 9\n 8 eLH41 CASSTGPGSEKLFF TCRBV06-05 TCRBJ… KTFPPT… 14 9\n 9 eAV93 CASSIDYSSNQPQHF TCRBV19-01 TCRBJ… SINFVR… 15 10\n10 eXL31 CSATPGGLEQFF TCRBV20-X TCRBJ… FLWLLW… 12 9\n\n\n\n\nCreating one data set from two data sets\nBefore we move onto using the family of *_join() functions you prepared for today, we will just take a quick peek at the meta data again:\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 11\n Experiment Cohort Age Gender Race A1 A2 B1 B2 C1 C2 \n \n 1 eHO130 Healthy (N… 28 F White A*02… A*03… B*07… B*08… C*07… C*07…\n 2 eLH48 COVID-19-C… 28 M White A*03… A*24… B*08… B*14… C*03… C*08…\n 3 eOX43 Healthy (N… 24 M White A*02… A*03… B*27… B*40… C*03… C*07…\n 4 ePD76 Healthy (N… 33 M White A*02… A*03… B*35… B*40… C*03… C*03…\n 5 eQD115 COVID-19-C… 48 M A*02… A*03… B*07… B*44… C*05… C*07…\n 6 eAV100 COVID-19-C… 29 F A*02… A*68… B*07… B*40… C*03… C*07…\n 7 eQD109 COVID-19-C… 61 M A*03… A*69… B*07… B*07… C*07… C*07…\n 8 eLH59 COVID-19-C… NA A*01… A*02… B*40… B*52… C*03… C*16…\n 9 eMR15 COVID-19-C… NA A*03… A*32… B*07… B*07… C*07… C*07…\n10 eAM23 COVID-19-C… 48 M A*11… A*24… B*15… B*52… C*04… C*12…\n\n\nRemember you can scroll in the data.\n\nQ12: Discuss in your group, if this data with respect to the A1, A2, B1, B2, C1 and C2 variables is a wide or a long data format?\n\nAs with the peptide_data, we will now have to use data pivoting again. I.e.:\n\nT18: use either pivot_wider() or pivot_longer() to create the following data:\n\n\n\n\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment Cohort Age Gender Race Gene Allele\n \n 1 eLH47 COVID-19-Convalescent 35 F White A2 \"A*02…\n 2 eJL160 COVID-19-Acute 52 F African Ame… B2 \"B*81…\n 3 eAV100 COVID-19-Convalescent 29 F C2 \"C*07…\n 4 eLH51 COVID-19-Convalescent 55 M Asian A1 \"A*24…\n 5 eMR17 COVID-19-Convalescent NA B2 \"B*57…\n 6 eQD121 COVID-19-Convalescent 38 M C2 \"C*07…\n 7 eNL192 COVID-19-Convalescent NA C1 \"\" \n 8 eMR23 COVID-19-Convalescent 22 F A1 \"\" \n 9 eOX43 Healthy (No known exposure) 24 M White B1 \"B*27…\n10 eLH42 COVID-19-Convalescent 63 M B1 \"B*07…\n\n\nRemember, what we are aiming for here, is to create one data set from two. So:\n\nQ13: Discuss in your group, which variable(s?) define the same observations between the peptide_data and the meta_data?\n\nOnce you have agreed upon Experiment, then use that knowledge to subset the meta_data to the variables-of-interest:\n\n\n\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 2\n Experiment Allele \n \n 1 eJL157 \"C*07:01:01\"\n 2 eHO135 \"B*07:02:01\"\n 3 eHO141 \"\" \n 4 eJL153 \"A*03:01:01\"\n 5 eMR20 \"C*07:02:01\"\n 6 eJL154 \"B*15:02:01\"\n 7 ePD82 \"A*26:02:01\"\n 8 eQD111 \"A*01:01:01\"\n 9 eMR22 \"C*07:18:01\"\n10 eJL146 \"A*02:01\" \n\n\nUse the View() function again, to look at the meta_data. Notice something? Some alleles are e.g. A*11:01, whereas others are B*51:01:02. You can find information on why, by visiting Nomenclature for Factors of the HLA System.\nLong story short, we only want to include Field 1 (allele group) and Field 2 (Specific HLA protein). You have prepared the stringr package for today. See if you can find a way to reduce e.g. B*51:01:02 to B*51:01 and then create a new variable Allele_F_1_2 accordingly, while also removing the ...x (where x is a number) subscripts from the Gene variable (It is an artifact from having the data in a wide format, where you cannot have two variables with the same name) and also, remove any NAs and \"\"s, denoting empty entries.\n\n\n\nClick here for hint\n\n\nThere are several ways this can be achieved, the easiest being to consider if perhaps a part of the string based on indices could be of interest. This term “a part of a string” is called a substring, perhaps the stringr package contains a function work with substring? In the console, type stringr:: and hit tab. This will display the functions available in the stringr package. Scroll down and find the functionst starting with str_ and look for on, which might be relevant and remember you can use ?function_name to get more information on how a given function works.\n\n\n\n\n\nT19: Create the following data, according to specifications above:\n\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 3\n Experiment Allele Allele_F_1_2\n \n 1 eJL157 C*07:02:01 C*07:02 \n 2 eQD118 C*03:04:01 C*03:04 \n 3 eMR13 C*07:01:01 C*07:01 \n 4 eLH54 B*40:02:01 B*40:02 \n 5 ePD82 C*08:01:01 C*08:01 \n 6 eQD121 B*57:01:01 B*57:01 \n 7 eAV88 C*07:04 C*07:04 \n 8 eDH105 B*40:01:02 B*40:01 \n 9 ePD86 C*14:02:01 C*14:02 \n10 eOX46 C*04:01 C*04:01 \n\n\nThe asterisk, i.e. * is a rather annoying character because of ambiguity, so:\n\nT20: Clean the data a bit more, by removing the asterisk and redundant variables:\n\n\n\n\n\nmeta_data |> \n sample_n(size = 10)\n\n# A tibble: 10 × 2\n Experiment Allele\n \n 1 eDH96 A02:01\n 2 eQD127 C02:02\n 3 eJL162 B55:01\n 4 eLH51 C12:04\n 5 eHO133 A32:01\n 6 eJL157 B18:01\n 7 eQD123 C07:02\n 8 eLH59 A02:01\n 9 eXL32 A01:01\n10 eLH45 C12:03\n\n\n\n\n\nClick here for hint 1\n\n\nAgain, the stringr package may come in handy. Perhaps there is a function remove, one or more such pesky characters?\n\n\n\n\nClick here for hint 2\n\n\nGetting a weird error? Recall, that character ambiguity needs to be “escaped”, you did this somehow earlier on…\n\nRecall the peptide_data?\n\npeptide_data |>\n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide\n \n 1 eEE240 CASSQRSNTGELFF TCRBV28-01 TCRBJ… AFLLFL… 14 9\n 2 eAV93 CATSDPPGWGQGAAYSNQPQHF TCRBV24-01 TCRBJ… TLACFV… 22 10\n 3 eOX54 CSASKLDSNNEQFF TCRBV20-01 TCRBJ… SLIDFY… 14 10\n 4 eXL27 CASSPSGAGEQFF TCRBV27-01 TCRBJ… FLWLLW… 13 9\n 5 eXL27 CASSDPFSGFYEQYF TCRBV05-01 TCRBJ… VYFLQS… 15 9\n 6 eOX49 CASSGAGSNQPQHF TCRBV09-01 TCRBJ… LLLDDF… 14 9\n 7 eEE228 CASRTGGSSYNEQFF TCRBV19-01 TCRBJ… IELSLI… 15 10\n 8 eHO135 CASSLRSNQPQHF TCRBV27-01 TCRBJ… ITLATC… 13 9\n 9 eEE226 CASSFSDYEQYF TCRBV05-06 TCRBJ… FLNGSC… 12 9\n10 eHO124 CATSEALQETQYF TCRBV24-01 TCRBJ… KVFRSS… 13 9\n\n\n\nT21: Create a dplyr pipeline, starting with the peptide_data, which joins it with the meta_data and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names peptide_meta_data (If you get a warning, discuss in your group what it means?)\n\n\n\n\n\n\n\nClick here for hint 1\n\n\nWhich family of functions do we use to join data? Also, perhaps here it would be prudent to start with working on a smaller data set, recall we could sample a number of rows yielding a smaller development data set\n\n\n\n\nClick here for hint 2\n\n\nYou should get a data set of around +3.000.000, take a moment to consider how that would have been to work with in Excel? Also, in case the servers are not liking this, you can consider subsetting the peptide_data prior to joining to e.g. 100,000 or 10,000 rows.\n\n\npeptide_meta_data |>\n sample_n(10)\n\n# A tibble: 10 × 8\n Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide Allele\n \n 1 eHO134 CASSDRTPQETQYF TCRBV2… TCRBJ… HTTDPS… 14 11 A24:02\n 2 eEE224 CSALGLEVNEQYF TCRBV2… TCRBJ… FYLCFL… 13 9 A02:01\n 3 eEE226 CASSLGPDGYNEQFF TCRBV0… TCRBJ… ITEEVG… 15 14 A02:01\n 4 eEE224 CASSFSGLSYEQYF TCRBV0… TCRBJ… IDFYLC… 14 10 C07:04\n 5 eHH175 CASSQVQGVRSGANVLTF TCRBV0… TCRBJ… IPTNFT… 18 9 C07:02\n 6 eDH105 CASSRGTSRNTEAFF TCRBV1… TCRBJ… LALLLL… 15 9 B48:01\n 7 eEE226 CSVVGTSGGHEQYF TCRBV2… TCRBJ… DGVYFA… 14 10 B35:02\n 8 eQD128 CASSSPSGGINEQFF TCRBV1… TCRBJ… YLCFLA… 15 9 A02:10\n 9 eOX43 CASSLRGTSYGYTF TCRBV1… TCRBJ… FLPRVF… 14 9 C07:04\n10 eOX46 CASSWDNYNEQFF TCRBV0… TCRBJ… YIIKLI… 13 9 A02:01\n\n\n\n\n\nAnalysis\nNow, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:\n\nWhat characterises the peptides binding to the HLAs?\nWhat characterises T-cell Receptors binding to the pMHC-complexes?\n\n\nPeptides binding to HLA\nAs we have touched upon multiple times, R is very flexible and naturally you can also create sequence logos. Finally, let us create a binding motif using the package ggseqlogo (More info here).\n\nT22: Subset the final peptide_meta_data data to A02:01 and unique observations of peptides of length 9 and re-create the below sequence logo\n\n\n\n\nClick here for hint\n\n\nYou can pipe a vector of peptides into ggseqlogo, but perhaps you first need to pull that vector from the relevant variable in your tibble? Also, consider before that, that you’ll need to make sure, you are only looking at peptides of length 9\n\n\n\n\n\n\n\n\n\n\n\nT23: Repeat for e.g. B07:02 or another of your favourite alleles\n\nNow, let’s take a closer look at the sequence logo:\n\nQ14: Which positions in the peptide determines binding to HLA?\n\n\n\n\nClick here for hint\n\n\nRecall your Introduction to Bioinformatics course? And/or perhaps ask your fellow group members if they know?\n\n\n\nCDR3b-sequences binding to pMHC\n\nT24: Subset the peptide_meta_data, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:\n\n\n\n\n\n\n\n\n\n\n\nQ15: In your group, discuss what you see?\nT25: Play around with other combinations of k_CDR3b, Allele, and peptide and inspect how the logo changes\n\nDisclaimer: In this data set, we only get: A given CDR3b was found to recognise a given peptide in a given subject and that subject had a given haplotype - Something’s missing… Perhaps if you have had immunology, then you can spot it? There is a trick to get around this missing information, but that’s beyond scope of what we’re working with here." + "text": "Creating the Micro-Report\n\nBackground\nFeel free to copy paste the one stated in the background-section above\n\n\nAim\nState the aim of the micro-report, i.e. what are the questions you are addressing?\n\n\nLoad Libraries\n\n\n\nLoad the libraries needed\n\n\nLoad Data\nRead the two data sets into variables peptide_data and meta_data.\n\n\n\nClick here for hint\n\n\nThink about which Tidyverse package deals with reading data and what are the file types we want to read here?\n\n\n\n\n\n\nData Description\nIt is customary to include a description of the data, helping the reader if the report, i.e. your stakeholder, to get an easy overview\n\nThe Subject Meta Data\nLet’s take a look at the meta data:\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 30\n Experiment Subject `Cell Type` `Target Type` Cohort Age Gender Race \n \n 1 eXL27 19830 naive_CD8 C19_cI Health… 24 M White\n 2 eHO141 3238 PBMC C19_cI COVID-… NA \n 3 eHH175 20300 naive_CD8 C19_cI Health… 28 M White\n 4 eHO124 3819 PBMC C19_cI Health… 62 M \n 5 ePD100 1811 PBMC C19_cI COVID-… 66 M \n 6 ePD85 5869 naive_CD8 C19_cI Health… 27 F \n 7 eJL161 1005703 PBMC C19_cI COVID-… 31 F White\n 8 eHH173 19829 naive_CD8 C19_cI Health… 50 M White\n 9 eLH59 1770 B- depleted PBMCs C19_cII COVID-… NA \n10 ePD73 4423 naive_CD8 minigene_Set1 Health… 37 F White\n# ℹ 22 more variables: `HLA-A...9` , `HLA-A...10` ,\n# `HLA-B...11` , `HLA-B...12` , `HLA-C...13` ,\n# `HLA-C...14` , DPA1...15 , DPA1...16 , DPB1...17 ,\n# DPB1...18 , DQA1...19 , DQA1...20 , DQB1...21 ,\n# DQB1...22 , DRB1...23 , DRB1...24 , DRB3...25 ,\n# DRB3...26 , DRB4...27 , DRB4...28 , DRB5...29 ,\n# DRB5...30 \n\n\n\nQ1: How many observations of how many variables are in the data?\nQ2: Are there groupings in the variables, i.e. do certain variables “go together” somehow?\nT1: Re-create this plot\n\nRead this first:\n\nThink about: What is on the x-axis? What is on the y-axis? And also, it looks like we need to do some counting. Recall, that we can stick together a dplyr pipeline with a call to ggplot, so here we will have to count of Cohort and Gender before plotting\n\n\n\n\n\n\nDoes your plot look different somehow? Consider peeking at the hint…\n\n\n\nClick here for hint\n\n\nPerhaps not everyone agrees on how to denote NAs in data. I have seen -99, -11, _ and so on… Perhaps this can be dealt with in the instance we read the data from the file? I.e. in the actual function call to your read() function. Recall, how can we get information on the parameters of a ?function\n\n\nT2: Re-create this plot\n\n\n\n\n\n\n\n\n\nClick here for hint\n\n\nPerhaps there is a function, which can cut continuous observations into a set of bins?\n\n\nSTOP! Make sure you handled how NAs are denoted in the data before proceeding, see hint below T1\n\nT3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables Subject, Cell Type and Target Type\n\n\n\n\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 27\n Experiment Cohort Age Gender Race `HLA-A...9` `HLA-A...10` `HLA-B...11`\n \n 1 eJL154 COVID-19… 35 F Nati… \"A*02:01:0… \"A*29:02:01\" \"B*15:02:01\"\n 2 eXL31 Healthy … 28 M White \"A*02:01\" \"A*29:02\" \"B*07:02\" \n 3 ePD90 COVID-19… 29 M \"\" \"\" \"\" \n 4 eQD116 COVID-19… 66 F \"A*03:01:0… \"A*11:01:01\" \"B*35:01:01\"\n 5 eMR23 COVID-19… 22 F \"\" \"\" \"\" \n 6 eJL158 COVID-19… 33 M White \"A*02:01:0… \"A*24:02:01\" \"B*15:01:01\"\n 7 eAV93 Healthy … 41 M White \"A*11:01\" \"A*68:01\" \"B*35:01\" \n 8 eLH45 COVID-19… 53 M \"A*02:01:0… \"A*03:01:01\" \"B*07:02:01\"\n 9 eLH59 COVID-19… NA \"A*01:01:0… \"A*02:01:01\" \"B*40:01:02\"\n10 eNL192 COVID-19… NA \"\" \"\" \"\" \n# ℹ 19 more variables: `HLA-B...12` , `HLA-C...13` ,\n# `HLA-C...14` , DPA1...15 , DPA1...16 , DPB1...17 ,\n# DPB1...18 , DQA1...19 , DQA1...20 , DQB1...21 ,\n# DQB1...22 , DRB1...23 , DRB1...24 , DRB3...25 ,\n# DRB3...26 , DRB4...27 , DRB4...28 , DRB5...29 ,\n# DRB5...30 \n\n\nNow, a classic way of describing a cohort, i.e. the group of subjects used for the study, is the so-called table1 and while we could build this ourselves, this one time, in the interest of exercise focus and time, we are going to “cheat” and use an R-package, like so:\nNB!: This may look a bit odd initially, but if you render your document, you should be all good!\n\nlibrary(\"table1\") # <= Yes, this should normally go at the beginning!\nmeta_data |>\n mutate(Gender = factor(Gender),\n Cohort = factor(Cohort)) |>\n table1(x = formula(~ Gender + Age + Race | Cohort),\n data = _)\n\n\n\n\n\n\nCOVID-19-Acute(N=4)\nCOVID-19-B-Non-Acute(N=8)\nCOVID-19-Convalescent(N=90)\nCOVID-19-Exposed(N=3)\nHealthy (No known exposure)(N=39)\nOverall(N=144)\n\n\n\n\nGender\n\n\n\n\n\n\n\n\nF\n1 (25.0%)\n4 (50.0%)\n33 (36.7%)\n1 (33.3%)\n17 (43.6%)\n56 (38.9%)\n\n\nM\n2 (50.0%)\n3 (37.5%)\n36 (40.0%)\n0 (0%)\n21 (53.8%)\n62 (43.1%)\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n21 (23.3%)\n2 (66.7%)\n1 (2.6%)\n26 (18.1%)\n\n\nAge\n\n\n\n\n\n\n\n\nMean (SD)\n50.7 (17.0)\n43.7 (7.74)\n51.5 (15.3)\n35.0 (NA)\n33.3 (9.93)\n44.9 (15.7)\n\n\nMedian [Min, Max]\n52.0 [33.0, 67.0]\n42.0 [33.0, 53.0]\n53.0 [21.0, 79.0]\n35.0 [35.0, 35.0]\n31.0 [21.0, 62.0]\n42.0 [21.0, 79.0]\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n21 (23.3%)\n2 (66.7%)\n0 (0%)\n25 (17.4%)\n\n\nRace\n\n\n\n\n\n\n\n\nAfrican American\n1 (25.0%)\n0 (0%)\n0 (0%)\n0 (0%)\n1 (2.6%)\n2 (1.4%)\n\n\nWhite\n2 (50.0%)\n7 (87.5%)\n13 (14.4%)\n0 (0%)\n28 (71.8%)\n50 (34.7%)\n\n\nAsian\n0 (0%)\n0 (0%)\n3 (3.3%)\n0 (0%)\n2 (5.1%)\n5 (3.5%)\n\n\nHispanic or Latino/a\n0 (0%)\n0 (0%)\n1 (1.1%)\n0 (0%)\n0 (0%)\n1 (0.7%)\n\n\nNative Hawaiian or Other Pacific Islander\n0 (0%)\n0 (0%)\n0 (0%)\n1 (33.3%)\n0 (0%)\n1 (0.7%)\n\n\nBlack or African American\n0 (0%)\n0 (0%)\n0 (0%)\n0 (0%)\n3 (7.7%)\n3 (2.1%)\n\n\nMixed Race\n0 (0%)\n0 (0%)\n0 (0%)\n0 (0%)\n1 (2.6%)\n1 (0.7%)\n\n\nMissing\n1 (25.0%)\n1 (12.5%)\n73 (81.1%)\n2 (66.7%)\n4 (10.3%)\n81 (56.3%)\n\n\n\n\n\n\nNote how good this looks! If you have ever done a “Table 1” before, you know how painful they can be and especially if something changes in your cohort - Dynamic reporting to the rescue!\nLastly, before we proceed, the meta_data contains HLA data for both class I and class II (see background), but here we are only interested in class I, recall these are denoted HLA-A, HLA-B and HLA-C, so make sure to remove any non-class I, i.e. the one after, denoted D-something.\n\nT4: Create a new version of the meta_data, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. HLA-A...9 becomes A1 oand HLA-A...10 becomes A2 and so on for B1, B2, C1 and C2 (Think: How can we rename variables? And here, just do it “manually” per variable). Remember to assign this new data to the same meta_data variable\n\n\n\n\nClick here for hint\n\n\nWhich tidyverse function subsets variables? Perhaps there is a function, which somehow matches a set of variables? And perhaps for the initiated this is compatible with regular expressions (If you don’t know what this means - No worries! If you do, see if you utilise this to simplify your variable selection)\n\n\n\n\nBefore we proceed, this is the data we will carry on with:\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 11\n Experiment Cohort Age Gender Race A1 A2 B1 B2 C1 C2 \n \n 1 eLH54 COVID-19-C… NA \"A*0… \"A*0… \"B*0… \"B*4… \"C*0… \"C*0…\n 2 eHH173 Healthy (N… 50 M White \"A*0… \"A*0… \"B*3… \"B*4… \"C*0… \"C*0…\n 3 eMR25 COVID-19-C… 21 F \"\" \"\" \"\" \"\" \"\" \"\" \n 4 ePD91 COVID-19-C… 52 M White \"\" \"\" \"\" \"\" \"\" \"\" \n 5 eDH105 COVID-19-C… 32 F \"A*2… \"A*2… \"B*4… \"B*4… \"C*0… \"C*0…\n 6 ePD87 COVID-19-C… 47 M White \"A*0… \"A*2… \"B*0… \"B*0… \"C*0… \"C*0…\n 7 eAM23 COVID-19-C… 48 M \"A*1… \"A*2… \"B*1… \"B*5… \"C*0… \"C*1…\n 8 eOX54 Healthy (N… 39 F Afri… \"A*0… \"A*2… \"B*1… \"B*5… \"C*0… \"C*1…\n 9 eMR12 COVID-19-C… NA \"A*0… \"A*0… \"B*4… \"B*5… \"C*0… \"C*1…\n10 eQD131 COVID-19-E… NA \"A*0… \"A*3… \"B*1… \"B*5… \"C*0… \"C*0…\n\n\nNow, we have a beautiful tidy dataset, recall that this entails, that each row is an observation, each column is a variable and each cell holds one value.\n\n\n\nThe Peptide Details Data\nLet’s start with simply having a look see:\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n `TCR BioIdentity` TCR Nucleotide Seque…¹ Experiment `ORF Coverage`\n \n 1 CASSEAPGLEFGNTIYF+TCRBV02-0… ACAAAGCTGGAGGACTCAGCC… eXL30 ORF1ab \n 2 CASSHEDRGRPGELFF+TCRBV03-01… TCCCTGGAGCTTGGTGACTCT… eMR17 ORF1ab \n 3 CASSPPTDTQYF+TCRBV27-01+TCR… CTGATCCTGGAGTCGCCCAGC… eXL31 surface glyco…\n 4 CASSLGLAGEQYF+TCRBV07-02+TC… ACGATCCAGCGCACAGAGCAG… eDH113 ORF1ab \n 5 CASSYWNEQFF+TCRBV06-05+TCRB… NNNNNNNNNNNNNTGTCGGCT… eOX52 ORF1ab \n 6 CASSLVGGDPSTDTQYF+TCRBV13-0… TTGGAGCTGGGGGACTCAGCC… eEE226 ORF1ab \n 7 CASSIGVGRAYEQYF+TCRBV19-01+… ACATCGGCCCAAAAGAACCCG… ePD83 ORF3a \n 8 CSALGQGNVQFF+TCRBV29-01+TCR… CTGACTGTGAGCAACATGAGC… eEE226 ORF6 \n 9 CASSQLRYTEAFF+TCRBV04-03+TC… CACCTACACACCCTGCAGCCA… eHO124 ORF1ab \n10 CASSLFGRGPTYNEQFF+TCRBV27-0… CCCAGCCCCAACCAGACCTCT… eLH43 ORF1ab \n# ℹ abbreviated name: ¹​`TCR Nucleotide Sequence`\n# ℹ 3 more variables: `Amino Acids` , `Start Index in Genome` ,\n# `End Index in Genome` \n\n\n\nQ3: How many observations of how many variables are in the data?\n\nThis is a rather big data set, so let us start with two “tricks” to handle this, first:\n\nWrite the data back into your data folder, using the filename peptide-detail-ci.csv.gz, note the appending of .gz, which is automatically recognised and results in gz-compression\nNow, check in your data folder, that you have two files peptide-detail-ci.csv and peptide-detail-ci.csv.gz, delete the former\nAdjust your reading-the-data-code in the “Load Data”-section, to now read in the peptide-detail-ci.csv.gz file\n\n\n\n\nClick here for hint\n\n\nJust as you can read a file, you can of course also write a file. Note the filetype we want to write here is csv. If you in the console type e.g. readr::wr and then hit the Tab key, you will see the different functions for writing different filetypes\n\nThen:\n\nT5: As before, let’s immediately subset the peptide_data to the variables of interest: TCR BioIdentity, Experiment and Amino Acids. Remember to assign this new data to the same peptide_data variable to avoid cluttering your environment with redundant variables. Bonus: Did you know you can click the Environment pane and see which variables you have?\n\n\n\n\nOnce again, before we proceed, this is the data we will carry on with:\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 3\n Experiment `TCR BioIdentity` `Amino Acids` \n \n 1 eEE226 CASSQDSDGGGNTIYF+TCRBV04-02+TCRBJ01-03 AEAELAKNVSL,AELAKNVSLDNVL \n 2 eOX54 CASSTVGGPFQPQHF+TCRBV12-X+TCRBJ01-05 MMISAGFSL \n 3 eXL27 CASRKTTDTQYF+TCRBV27-01+TCRBJ02-03 AFLLFLVLI,FLAFLLFLV,FYLCF…\n 4 eHO135 CAWRRGGKLFF+TCRBV30-01+TCRBJ01-04 AYKTFPPTEPK,KTFPPTEPK \n 5 eXL36 CASSVAAAVSYNEQFF+TCRBV09-01+TCRBJ02-01 VDDPCPIHFY,VVDDPCPIHFY,YV…\n 6 eEE228 CASSFPQNTQYF+TCRBV07-09+TCRBJ02-03 FLWLLWPVT,FLWLLWPVTL,LWLL…\n 7 eEE226 CASSSRTEGSTDTQYF+TCRBV11-02+TCRBJ02-03 EEHVQIHTI \n 8 eOX52 CASSVEGTVNEKLFF+TCRBV09-01+TCRBJ01-04 FVDGVPFVV \n 9 eQD137 CSVVSGISYNEQFF+TCRBV29-01+TCRBJ02-01 AFLLFLVLI,FLAFLLFLV,FYLCF…\n10 ePD83 CASSIGLGLAEYNEQFF+TCRBV19-01+TCRBJ02-01 SEHDYQIGGYTEKW,YQIGGYTEK,…\n\n\n\nQ4: Is this tidy data? Why/why not?\nT6: See if you can find a way to create the below data, from the above\n\n\n\n\n\npeptide_data |> \n sample_n(size = 10)\n\n# A tibble: 10 × 5\n Experiment CDR3b V_gene J_gene `Amino Acids` \n \n 1 ePD76 CASSPRPGLAGGRDTQYF TCRBV07-06 TCRBJ02-03 SSANNCTFEY,VYSSANN…\n 2 eEE226 CASSEASMNTEAFF TCRBV06-01 TCRBJ01-01 LPAADLDDF \n 3 eOX49 CASSRQTEAFF TCRBV03-01/03-02 TCRBJ01-01 FLNGSCGSV \n 4 eOX43 CASSLRGTGESEFF TCRBV12-X TCRBJ02-01 AFPFTIYSL,GYINVFAF…\n 5 eOX43 CASSHAASRSYEQYF TCRBV04-01 TCRBJ02-07 APKEIIFL,KEIIFLEGE…\n 6 eEE226 CASSHWSVAEETQYF TCRBV03-01/03-02 TCRBJ02-05 KLSYGIATV \n 7 eJL164 CSASERSTTLGQTTQYF TCRBV20-01 TCRBJ02-03 KLWAQCVQL \n 8 eEE228 CAISDRLISGSTGELFF TCRBV10-03 TCRBJ02-02 FPNITNLCPF,QPTESIV…\n 9 eQD132 CASSSRTKGYEQYF TCRBV06-05 TCRBJ02-07 STQDLFLPFF,TQDLFLP…\n10 eOX52 CASSIGPLDSYGYTF TCRBV19-01 TCRBJ01-02 KLSYGIATV \n\n\n\n\n\nClick here for hint\n\n\nFirst: Compare the two datasets and identify what happened? Did any variables “disappear” and did any “appear”? Ok, so this is a bit tricky, but perhaps there is a function to separate a composite (untidy) column into a set of new variables based on a separator? But what is a separator? Just like when you read a file with Comma Separated Values, a separator denotes how a composite string is divided into fields. So, look for such a repeated value, which seem to indeed separate such fields. Also, be aware, that character, which can mean more than one thing, may need to be “escaped” using an initial two backslashed, i.e. “\\x”, where x denotes the character needing to be “escaped”\n\n\nT7: Add a variable, which counts how many peptides are in each observation of Amino Acids\n\n\n\n\n\n\n\nClick here for hint\n\n\nWe have been working with the stringr package, perhaps the contains a function to somehow count the number of occurrences of a given character in a string? Again, remember you can type e.g. stringr::str_ and then hit the Tab key to see relevant functions\n\n\npeptide_data |> \n sample_n(size = 10)\n\n# A tibble: 10 × 6\n Experiment CDR3b V_gene J_gene `Amino Acids` n_peptides\n \n 1 ePD91 CASSIGLTEAFF TCRBV19-01 TCRBJ… ILGTVSWNL,SN… 2\n 2 eEE240 CATSRPMNTEAFF TCRBV15-01 TCRBJ… AFLLFLVLI,FL… 11\n 3 eHH175 CASSDGPGYEQYF TCRBV12-03/12… TCRBJ… KMKDLSPRW 1\n 4 eHO135 CASSLAGAQPQHF TCRBV05-01 TCRBJ… LSPRWYFYY,SP… 2\n 5 eEE228 CASSPTSGINNEQFF TCRBV18-01 TCRBJ… KLSYGIATV 1\n 6 eXL31 CASSKATGEGGNYEQYF TCRBV21-01 TCRBJ… FLQSINFVR,FL… 13\n 7 eOX49 CASSYGLAGGEETQYF TCRBV07-03 TCRBJ… KLWAQCVQL 1\n 8 eLH48 CASSQDVGRGVQETQYF TCRBV03-01/03… TCRBJ… RIRGGDGKM,RI… 2\n 9 eXL31 CASSHGTGGELFF TCRBV27-01 TCRBJ… QLMCQPILL,QL… 2\n10 eEE226 CASSQDRVVAGGQGDTQYF TCRBV03-01/03… TCRBJ… APKEIIFL,KEI… 2\n\n\n\nT8: Re-create the following plot\n\n\n\n\n\n\n\nQ4: What is the maximum number of peptides assigned to one observation?\nT9: Using the str_c() and the seq() functions, re-create the below\n\n\n\n[1] \"peptide_1\" \"peptide_2\" \"peptide_3\" \"peptide_4\" \"peptide_5\"\n\n\n\n\n\nClick here for hint\n\n\nIf you’re uncertain on how a function works, try going into the console and in this case e.g. type str_c(\"a\", \"b\") and seq(from = 1, to = 3) and see if you combine these?\n\n\nT10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data\n\n\n\n\n\n\n\nClick here for hint\n\n\nIn the console, write ?separate and think about how you used it earlier. Perhaps you can not only specify a vector to separate into, but also specify a function, which returns a vector?\n\n\npeptide_data |> \n sample_n(size = 10)\n\n# A tibble: 10 × 18\n Experiment CDR3b V_gene J_gene peptide_1 peptide_2 peptide_3 peptide_4\n \n 1 eOX52 CASSPPISYEQ… TCRBV… TCRBJ… ILGTVSWNL SNEKQEIL… \n 2 eEE226 CASPFPGQGHE… TCRBV… TCRBJ… KLSYGIATV \n 3 eQD124 CASRTLGAGEL… TCRBV… TCRBJ… HTTDPSFL… \n 4 eAV88 CASSPGLDYNE… TCRBV… TCRBJ… KPLEFGAT… \n 5 eEE224 CSVEGLPGRET… TCRBV… TCRBJ… APAHISTI LIVNSVLL… LLFLAFVV… SVLLFLAFV\n 6 eQD108 CASSATGALAS… TCRBV… TCRBJ… FIASFRLFA SYFIASFR… YFIASFRLF YFIASFRL…\n 7 eXL27 CASSLGDSNTE… TCRBV… TCRBJ… ELYSPIFLI LYSPIFLIV QELYSPIFL VQELYSPIF\n 8 eOX49 CASSLNHLGDR… TCRBV… TCRBJ… FVCNLLLL… LLFVTVYS… TVYSHLLLV \n 9 eAV93 CATTEGTANTE… TCRBV… TCRBJ… SSANNCTF… VYSSANNC… \n10 eOX46 CASSSSTAGEQ… TCRBV… TCRBJ… FPPTSFGPL \n# ℹ 10 more variables: peptide_5 , peptide_6 , peptide_7 ,\n# peptide_8 , peptide_9 , peptide_10 , peptide_11 ,\n# peptide_12 , peptide_13 , n_peptides \n\n\n\nQ5: Now, presumable you got a warning, discuss in your group why that is?\nQ6: With respect to peptide_n, discuss in your group, if this is wide- or long-data?\n\nNow, finally we will use the what we prepared for today, data pivoting. There are two functions, namely pivot_wider() and pivot_longer(). Also, now, we will use a trick when developing ones data pipeline, while working with new functions, that on might not be completely comfortable with. You have seen the sample_n() function several times above and we can use that to randomly sample n observations from data. This we can utilise to work with a smaller data set in the development face and once we are ready, we can increase this n gradually to see if everything continues to work as anticipated.\n\nT11: Using the peptide_data, run a few sample_n() calls with varying degree of n to make sure, that you get a feeling for what is going on\nT12: From the peptide_data data above, with peptide_1, peptide_2, etc. create this data set using one of the data pivoting functions. Remember to start initially with sampling a smaller data set and then work on that first! Also, once you’re sure you’re good to go, reuse the peptide_data variable as we don’t want huge redundant data sets floating around in our environment\n\n\n\n\n\n\n\nClick here for hint\n\n\nIf the pivoting is not clear at all, then do what I do, create some example data:\n\nmy_data <- tibble(\n id = str_c(\"id_\", 1:10),\n var_1 = round(rnorm(10),1),\n var_2 = round(rnorm(10),1),\n var_3 = round(rnorm(10),1))\n\n…and then play around with that. A small set like the one above is easy to handle, so perhaps start with that and then pivot back and forth a few times using pivot_wider()/pivot_longer(). Use View() to inspect and get a better overview of the results of pivoting.\n\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment CDR3b V_gene J_gene n_peptides peptide_n peptide\n \n 1 eOX49 CASSRTNEQFF TCRBV28… TCRBJ… 7 peptide_5 TLACFV…\n 2 eHO130 CASSQATGALGYGYTF TCRBV04… TCRBJ… 10 peptide_8 SIWNLD…\n 3 ePD83 CASSTGQGLGYEQYF TCRBV19… TCRBJ… 3 peptide_… \n 4 eLH47 CASSYPSEGASYNEQFF TCRBV06… TCRBJ… 1 peptide_2 \n 5 eEE228 CSATDLAGVGEQYF TCRBV20… TCRBJ… 7 peptide_9 \n 6 eEE226 CASSYSIHSLLGTGGTGELFF TCRBV06… TCRBJ… 1 peptide_… \n 7 eEE228 CASSPRGPSGIQETQYF TCRBV07… TCRBJ… 11 peptide_1 AFLLFL…\n 8 eEE240 CASSYFGGWGANVLTF TCRBV11… TCRBJ… 1 peptide_7 \n 9 eQD111 CASSEGSGVVQPQHF TCRBV10… TCRBJ… 1 peptide_3 \n10 eQD125 CASRHSEGGVYDNEQFF TCRBV05… TCRBJ… 1 peptide_9 \n\n\n\nQ7: You will see some NAs in the peptide variable, discuss in your group from where these arise?\nQ8: How many rows and columns now and how does this compare with Q3? Discuss why/why not it is different?\nT13: Now, lose the redundant variables n_peptides and peptide_n, get rid of the NAs in the peptide column, and make sure that we only have unique observations (i.e. there are no repeated rows/observations).\n\n\n\n\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 5\n Experiment CDR3b V_gene J_gene peptide \n \n 1 eOX54 CSATANTGELFF TCRBV20-X TCRBJ02-02 AFPFTIYSL \n 2 eOX52 CASSSGLASAYEQYF TCRBV12-X TCRBJ02-07 LLDDFVEII \n 3 eEE240 CATSEGSGANVLTF TCRBV24-01 TCRBJ02-06 FLQSINFVR \n 4 eEE226 CSVEQASYEQYF TCRBV29-01 TCRBJ02-07 SLIDFYLCFL\n 5 eHH175 CASSQLNTGELFF TCRBV03-01/03-02 TCRBJ02-02 FLYIIKLIFL\n 6 eEE226 CSAWTSGETQYF TCRBV20-X TCRBJ02-05 FYLCFLAFLL\n 7 eEE228 CASSHGPESGLGRNQPQHF TCRBV06-X TCRBJ01-05 NVFAFPFTI \n 8 eOX43 CASSETGTGYEQYF TCRBV02-01 TCRBJ02-07 SLIDFYLCFL\n 9 eXL31 RASSLRRGAEQYF TCRBV07-03 TCRBJ02-07 TLACFVLAAV\n10 eEE226 CASSLLTAEPEAFF TCRBV07-09 TCRBJ01-01 TFKVSIWNL \n\n\n\nQ8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?\n\nAgain, we turn to the stringr package, as we need to make sure that the sequence data does indeed only contain valid characters. There are a total of 20 proteogenic amino acids, which we symbolise using ARNDCQEGHILKMFPSTWYV.\n\nT14: Use the str_detect() function to filter the CDR3b and peptide variables using a pattern of [^ARNDCQEGHILKMFPSTWYV] and then play with the negate parameter so see what happens\n\n\n\n\n\n\n\nClick here for hint\n\n\nAgain, try to play a bit around with the function in the console, type e.g. str_detect(string = \"ARND\", pattern = \"A\") and str_detect(string = \"ARND\", pattern = \"C\") and then recall, that the filter() function requires a logical vector, i.e. a vector of TRUE and FALSE to filter the rows\n\n\nT15: Add two new variables to the data, k_CDR3b and k_peptide each signifying the length of the respective sequences\n\n\n\n\n\n\n\nClick here for hint\n\n\nAgain, we’re working with strings, so perhaps there is a package of interest and perhaps in that package, there is a function, which can get the length of a string?\n\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide\n \n 1 eXL31 CASSQVTGELFF TCRBV03-01/03-02 TCRBJ… FLAFLL… 12 9\n 2 eEE224 CASVAAGELFF TCRBV06-06 TCRBJ… INFVRI… 11 9\n 3 eQD114 CASSWEGAGDTDTQYF TCRBV06-06 TCRBJ… HTTDPS… 16 11\n 4 eHO140 CASSPPDRGNYGYTF TCRBV18-01 TCRBJ… QSINFV… 15 9\n 5 eXL30 CSVDTGGGTEAFF TCRBV29-01 TCRBJ… YLCFLA… 13 9\n 6 eHO140 CASSLSLPIGDEQYF TCRBV27-01 TCRBJ… LWPVTL… 15 9\n 7 eMR12 CASSFENTGELFF TCRBV28-01 TCRBJ… HTTDPS… 13 11\n 8 eXL30 CASSPGGDTQYF TCRBV06-05 TCRBJ… NVFAFP… 12 9\n 9 eEE226 CASSPAGTPTF TCRBV12-03/12-04 TCRBJ… GYINVF… 11 10\n10 eOX52 CASSWGLAGADTQYF TCRBV07-03 TCRBJ… LYSPIF… 15 9\n\n\n\nT16: Re-create this plot\n\n\n\n\n\n\n\nQ9: What is the most predominant length of the CDR3b-sequences?\nT17: Re-create this plot\n\n\n\n\n\n\n\nQ10: What is the most predominant length of the peptide-sequences?\nQ11: Discuss in your group, if this data set is tidy or not?\n\n\npeptide_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide\n \n 1 eHH175 CASSLSIAGTGGEQYF TCRBV27-01 TCRBJ02-07 MIELSLID… 16 10\n 2 eQD125 CASSFRGPQETQYF TCRBV27-01 TCRBJ02-05 HTTDPSFL… 14 11\n 3 eQD125 CASRLTGTVAYEQYF TCRBV28-01 TCRBJ02-07 VLHSYFTS… 15 10\n 4 eEE240 CASRRTTDTQYF TCRBV27-01 TCRBJ02-03 LIDFYLCFL 12 9\n 5 eEE226 CASSNPGQGEETQYF TCRBV12-X TCRBJ02-05 LPFNDGVYF 15 9\n 6 eAV93 CASSSGTGGDTEAFF TCRBV07-09 TCRBJ01-01 RSVASQSII 15 9\n 7 eHO135 CAITPGSTGELFF TCRBV10-03 TCRBJ02-02 LWPVTLACF 13 9\n 8 eMR16 CASSLGRLAQLNEQFF TCRBV11-02 TCRBJ02-01 FLYLYALV… 16 10\n 9 eXL30 CASRLTGTGELFF TCRBV06-05 TCRBJ02-02 GYINVFAF… 13 10\n10 eXL27 CSAEREGLYNEQFF TCRBV20-X TCRBJ02-01 IELSLIDF… 14 10\n\n\n\n\nCreating one data set from two data sets\nBefore we move onto using the family of *_join() functions you prepared for today, we will just take a quick peek at the meta data again:\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 11\n Experiment Cohort Age Gender Race A1 A2 B1 B2 C1 C2 \n \n 1 eHO125 COVID-19-C… 52 M A*02… A*02… B*39… B*44… C*07… C*07…\n 2 eHH169 Healthy (N… 24 F Blac… A*02… A*74… B*35… B*35… C*04… C*04…\n 3 eHO129 COVID-19-C… 66 F Asian A*24… A*24… B*15… B*40… C*08… C*15…\n 4 eLH47 COVID-19-C… 35 F White A*01… A*02… B*07… B*08… C*07… C*07…\n 5 ePD85 Healthy (N… 27 F A*02… A*29… B*07… B*18… C*07… C*15…\n 6 ePD80 COVID-19-C… 67 M A*02… A*66… B*15… B*41… C*03… C*17…\n 7 eJL149 COVID-19-C… 60 F A*02… A*02… B*44… B*50… C*06… C*16…\n 8 eQD109 COVID-19-C… 61 M A*03… A*69… B*07… B*07… C*07… C*07…\n 9 eEE226 Healthy (N… 21 F White A*01… A*02… B*35… B*39… C*04… C*07…\n10 eJL154 COVID-19-E… 35 F Nati… A*02… A*29… B*15… B*44… C*04… C*16…\n\n\nRemember you can scroll in the data.\n\nQ12: Discuss in your group, if this data with respect to the A1, A2, B1, B2, C1 and C2 variables is a wide or a long data format?\n\nAs with the peptide_data, we will now have to use data pivoting again. I.e.:\n\nT18: use either pivot_wider() or pivot_longer() to create the following data:\n\n\n\n\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment Cohort Age Gender Race Gene Allele\n \n 1 eDH107 COVID-19-Convalescent 72 F A2 \"A*03…\n 2 eQD117 COVID-19-Convalescent 70 F B1 \"B*35…\n 3 eAV100 COVID-19-Convalescent 29 F C1 \"C*03…\n 4 eHO138 COVID-19-B-Non-Acute NA A2 \"\" \n 5 eJL149 COVID-19-Convalescent 60 F C1 \"C*06…\n 6 eMR25 COVID-19-Convalescent 21 F C1 \"\" \n 7 eQD113 COVID-19-Convalescent 36 M A1 \"A*03…\n 8 eHH169 Healthy (No known exposure) 24 F Black or Af… A1 \"A*02…\n 9 eOX43 Healthy (No known exposure) 24 M White A1 \"A*02…\n10 eQD127 COVID-19-Convalescent 61 F C1 \"C*02…\n\n\nRemember, what we are aiming for here, is to create one data set from two. So:\n\nQ13: Discuss in your group, which variable(s?) define the same observations between the peptide_data and the meta_data?\n\nOnce you have agreed upon Experiment, then use that knowledge to subset the meta_data to the variables-of-interest:\n\n\n\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 2\n Experiment Allele \n \n 1 eQD108 A*68:01:02\n 2 eHO130 B*08:01 \n 3 ePD82 C*14:03:01\n 4 ePD83 C*03:04 \n 5 eQD116 C*04:01:01\n 6 eQD123 A*02:01:01\n 7 eQD112 C*07:02:01\n 8 eOX43 C*03:04 \n 9 eHO134 C*07:01:01\n10 eLH45 A*02:01:01\n\n\nUse the View() function again, to look at the meta_data. Notice something? Some alleles are e.g. A*11:01, whereas others are B*51:01:02. You can find information on why, by visiting Nomenclature for Factors of the HLA System.\nLong story short, we only want to include Field 1 (allele group) and Field 2 (Specific HLA protein). You have prepared the stringr package for today. See if you can find a way to reduce e.g. B*51:01:02 to B*51:01 and then create a new variable Allele_F_1_2 accordingly, while also removing the ...x (where x is a number) subscripts from the Gene variable (It is an artifact from having the data in a wide format, where you cannot have two variables with the same name) and also, remove any NAs and \"\"s, denoting empty entries.\n\n\n\nClick here for hint\n\n\nThere are several ways this can be achieved, the easiest being to consider if perhaps a part of the string based on indices could be of interest. This term “a part of a string” is called a substring, perhaps the stringr package contains a function work with substring? In the console, type stringr:: and hit tab. This will display the functions available in the stringr package. Scroll down and find the functionst starting with str_ and look for on, which might be relevant and remember you can use ?function_name to get more information on how a given function works.\n\n\n\n\n\nT19: Create the following data, according to specifications above:\n\n\nmeta_data |> \n sample_n(10)\n\n# A tibble: 10 × 3\n Experiment Allele Allele_F_1_2\n \n 1 eQD128 B*39:01:01 B*39:01 \n 2 eOX46 A*02:01 A*02:01 \n 3 eLH45 C*12:03:01 C*12:03 \n 4 eQD120 A*31:01:02 A*31:01 \n 5 ePD81 B*40:02:01 B*40:02 \n 6 eXL27 C*07:04 C*07:04 \n 7 ePD79 B*07:02:01 B*07:02 \n 8 eDH105 A*24:02:01 A*24:02 \n 9 eAV91 C*05:01 C*05:01 \n10 eEE240 B*40:01 B*40:01 \n\n\nThe asterisk, i.e. * is a rather annoying character because of ambiguity, so:\n\nT20: Clean the data a bit more, by removing the asterisk and redundant variables:\n\n\n\n\n\nmeta_data |> \n sample_n(size = 10)\n\n# A tibble: 10 × 2\n Experiment Allele\n \n 1 eLH43 B44:03\n 2 eJL147 A11:01\n 3 eHH169 A02:01\n 4 eJL154 C16:01\n 5 eQD119 C07:01\n 6 eJL143 C08:02\n 7 eHH169 B35:01\n 8 eHO125 C07:01\n 9 eOX52 A02:01\n10 eLH48 B08:01\n\n\n\n\n\nClick here for hint 1\n\n\nAgain, the stringr package may come in handy. Perhaps there is a function remove, one or more such pesky characters?\n\n\n\n\nClick here for hint 2\n\n\nGetting a weird error? Recall, that character ambiguity needs to be “escaped”, you did this somehow earlier on…\n\nRecall the peptide_data?\n\npeptide_data |>\n sample_n(10)\n\n# A tibble: 10 × 7\n Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide\n \n 1 eXL30 CASSLEISYEQYF TCRBV05-01 TCRBJ02-07 VPHVGEI… 13 11\n 2 eOX54 CASSASMSDTQYF TCRBV09-01 TCRBJ02-03 KLSYGIA… 13 9\n 3 eQD111 CASSELAGADTQYF TCRBV06-01 TCRBJ02-03 HTTDPSF… 14 11\n 4 eOX49 CSAHFPGQGFGEQFF TCRBV20-X TCRBJ02-01 YLCFLAF… 15 9\n 5 eHO128 CASSLQSPSSAGNEQFF TCRBV27-01 TCRBJ02-01 QSINFVR… 17 9\n 6 eOX49 CASSLWGDNEQFF TCRBV27-01 TCRBJ02-01 FYLCFLA… 13 9\n 7 eEE240 CASSFYSSGGAEGEQFF TCRBV27-01 TCRBJ02-01 LEYHDVR… 17 9\n 8 eEE228 CASSTKGRTNTGELFF TCRBV27-01 TCRBJ02-02 LIVNSVL… 16 10\n 9 eOX43 CASRGLAGDNSYEQYF TCRBV25-01 TCRBJ02-07 SLIDFYL… 16 10\n10 eOX52 CASSRGTGSEQYF TCRBV19-01 TCRBJ02-07 FLQSINF… 13 9\n\n\n\nT21: Create a dplyr pipeline, starting with the peptide_data, which joins it with the meta_data and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names peptide_meta_data (If you get a warning, discuss in your group what it means?)\n\n\n\n\n\n\n\nClick here for hint 1\n\n\nWhich family of functions do we use to join data? Also, perhaps here it would be prudent to start with working on a smaller data set, recall we could sample a number of rows yielding a smaller development data set\n\n\n\n\nClick here for hint 2\n\n\nYou should get a data set of around +3.000.000, take a moment to consider how that would have been to work with in Excel? Also, in case the servers are not liking this, you can consider subsetting the peptide_data prior to joining to e.g. 100,000 or 10,000 rows.\n\n\npeptide_meta_data |>\n sample_n(10)\n\n# A tibble: 10 × 8\n Experiment CDR3b V_gene J_gene peptide k_CDR3b k_peptide Allele\n \n 1 eEE240 CASSYGQGTPLHF TCRBV06… TCRBJ… FPQSAP… 13 9 A02:01\n 2 eOX52 CATSDFSGSNTGELFF TCRBV24… TCRBJ… LWPVTL… 16 9 B40:01\n 3 eEE224 CASSTQGSGELFF TCRBV27… TCRBJ… IELSLI… 13 10 C07:04\n 4 eQD124 CASRTGSNQPQHF TCRBV06… TCRBJ… ASAFFG… 13 9 B51:01\n 5 eOX49 CASSQDSKLTGSYEQYF TCRBV14… TCRBJ… YLYALV… 17 9 A02:01\n 6 eLH47 CASDGAGGYTF TCRBV07… TCRBJ… WLLWPV… 11 9 C07:02\n 7 eOX52 CASSLVQGAYNEQFF TCRBV05… TCRBJ… LLFLVL… 15 9 B15:17\n 8 eEE226 CASSLDGTPGNTIYF TCRBV11… TCRBJ… DGVYFA… 15 10 C07:02\n 9 eXL30 CASNFFPGLDNEQFF TCRBV02… TCRBJ… APKEII… 15 8 B35:02\n10 eAV93 CASSFTGLSYEQYF TCRBV05… TCRBJ… VLPFND… 14 10 C04:01\n\n\n\n\n\nAnalysis\nNow, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:\n\nWhat characterises the peptides binding to the HLAs?\nWhat characterises T-cell Receptors binding to the pMHC-complexes?\n\n\nPeptides binding to HLA\nAs we have touched upon multiple times, R is very flexible and naturally you can also create sequence logos. Finally, let us create a binding motif using the package ggseqlogo (More info here).\n\nT22: Subset the final peptide_meta_data data to A02:01 and unique observations of peptides of length 9 and re-create the below sequence logo\n\n\n\n\nClick here for hint\n\n\nYou can pipe a vector of peptides into ggseqlogo, but perhaps you first need to pull that vector from the relevant variable in your tibble? Also, consider before that, that you’ll need to make sure, you are only looking at peptides of length 9\n\n\n\n\n\n\n\n\n\n\n\nT23: Repeat for e.g. B07:02 or another of your favourite alleles\n\nNow, let’s take a closer look at the sequence logo:\n\nQ14: Which positions in the peptide determines binding to HLA?\n\n\n\n\nClick here for hint\n\n\nRecall your Introduction to Bioinformatics course? And/or perhaps ask your fellow group members if they know?\n\n\n\nCDR3b-sequences binding to pMHC\n\nT24: Subset the peptide_meta_data, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:\n\n\n\n\n\n\n\n\n\n\n\nQ15: In your group, discuss what you see?\nT25: Play around with other combinations of k_CDR3b, Allele, and peptide and inspect how the logo changes\n\nDisclaimer: In this data set, we only get: A given CDR3b was found to recognise a given peptide in a given subject and that subject had a given haplotype - Something’s missing… Perhaps if you have had immunology, then you can spot it? There is a trick to get around this missing information, but that’s beyond scope of what we’re working with here." }, { "objectID": "lab05.html#epilogue", @@ -774,7 +774,7 @@ "href": "primer_on_linear_models_in_r.html#example", "title": "Primer on Linear Models in R", "section": "Example", - "text": "Example\n\n\n\n\nBackground\nLet’s say we wanted to study the genetic mechanism protecting a plant from heat shock, then:\n\nIndependent: Environmental Condition (temperature)\nDependent: Gene Expression Level (related to heat shock protection)\n\nHere, the independent variable is the temperature and the dependent variable is the gene expression level. It is clear, that the temperature, does not rely on the gene expression level, but the gene expression level of heat shock related genes, does rely on the temperature.\nSo, we keep plants under different temperatures and collect samples, from which we can extract RNA and run a transcriptomics analysis uncovering gene expression levels.\n\n\nData\nFor the data here, we are going to simulate the relationship between gene expression levels and temperature, as a function in R:\n\nrun_simulation <- function(temp){\n measurement_error <- rnorm(n = length(temp), mean = 0, sd = 3)\n gene_expression_level <- 2 * temp + 3 + measurement_error\n return( gene_expression_level )\n}\n\nNote, how we’re adding some measurement error to our simulation, otherwise we would get a perfect relationship, which we all know never happens.\nNow, we can easily run simulations:\n\nrun_simulation(temp = c(15, 20, 25, 30, 35))\n\n[1] 26.90906 42.74464 50.93029 69.31215 71.14476\n\n\nLet’s just go ahead and create some data, we can work with. For this example, we take samples starting at 5 degree celsius and then in increments of 1 up to 50 degrees:\n\nset.seed(806017)\nexperiment_data <- tibble(\n temperature = seq(from = 5, to = 50, by = 1),\n gene_expression_level = run_simulation(temp = temperature)\n)\nexperiment_data |> \n sample_n(10) |> \n arrange(temperature)\n\n# A tibble: 10 × 2\n temperature gene_expression_level\n \n 1 12 28.7\n 2 13 30.2\n 3 16 35.6\n 4 19 44.0\n 5 22 48.8\n 6 27 64.6\n 7 32 61.1\n 8 33 73.9\n 9 35 76.7\n10 40 84.9\n\n\n\n\nVisualising\nNow, that we have the data, we can visualise the relationship between the temperature- and gene_expression_level-variables:\n\nmy_viz <- experiment_data |> \n ggplot(aes(x = temperature,\n y = gene_expression_level)) +\n geom_point() +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0)\nmy_viz\n\n\n\n\n\n\n\n\nNow, we can easily add the best fit line using the geom_smooth()-function, where we specify that we want to use method = \"lm\" and for now, we exclude the confidence interval, by setting se = FALSE:\n\nmy_viz +\n geom_smooth(method = \"lm\",\n se = FALSE)\n\n\n\n\n\n\n\n\nWhat happens here, is that a best-fit line is added to the plot by calculating the line, such that the sum of the squared errors is as small as possible, where the error is the distance from the line to a given point. This is a basic linear regression and is known as Ordinary Least Squares (OLS). But what if we want to work with this regression model, beyond just adding a line to a plot?\n\n\nModelling\nOne of the super powers of R is the build in capability to do modelling. Because we simulated the data (see above), we know that the true intercept is 3 and the true slope of the temperature variable is 2. Let see what we get, if we run a linear model:\n\nmy_lm_mdl <- lm(formula = gene_expression_level ~ temperature,\n data = experiment_data)\nmy_lm_mdl\n\n\nCall:\nlm(formula = gene_expression_level ~ temperature, data = experiment_data)\n\nCoefficients:\n(Intercept) temperature \n 2.816 2.021 \n\n\nImportant, the formula notation gene_expression_level ~ temperature is central to R and should be read as: “gene_expression_level modelled as a function of temperature”, i.e. gene_expression_level is the dependent variable often denoted y and temperature is the independendt variable often denoted x.\nOkay that’s pretty close! Recall the reason for the difference is, that we are adding measurement error, when we run the simulation (see above).\nIn other words our model says, that:\n\\[gene\\_expression\\_level = 2.816 + 2.021 \\cdot temperature\\]\nI.e. the estimate of the intercept is 2.816 and the estimate of the slope is 2.021, meaning that when the temperature = 0, we estimate that the gene_expression_level is 2.816 and for each 1 degree increase in temperature, we estimate, that the increase in gene_expression_level is 2.021.\nThese estimates are pretty close to the true model underlying our simulation:\n\\[gene\\_expression\\_level = 3 + 2 \\cdot temperature\\]\nIn general form, such a linear model can be written like so:\n\\[y = \\beta_{0} + \\beta_{1} \\cdot x_{1}\\]\nWhere the \\(\\beta\\)-coefficients are termed estimates, because that is exactly what we do, given the observed data, we estimate their values.\n\n\nWorking with a lm-object:\nThe model format you saw above, is a bit quirky, but luckily, there is a really nice way to get these kind of model object into a more tidy-format:\n\nlibrary(\"broom\")\nmy_lm_mdl |> \n tidy()\n\n# A tibble: 2 × 5\n term estimate std.error statistic p.value\n \n1 (Intercept) 2.82 1.16 2.44 1.89e- 2\n2 temperature 2.02 0.0378 53.4 1.16e-41\n\n\nBriefly, here we term, estimate, std.error, statistic and p.value. We discussed the term and estimate. The std.error pertains to the estimate and the statistic is used to calculate the p.value.\n\nThe P-value\nNow, because we now have a tidy object, we can simply plug-‘n’-play with other tidyverse tools, so let us visualise the p.value. Note, because of the often vary large differences in p.values, we use a -log10-transformation, this means that larger values are “more significant”. Below, the dashed line signifies \\(p=0.05\\), so anything above that line is considered “statistically significant”:\n\nmy_lm_mdl |> \n tidy() |> \n ggplot(aes(x = term,\n y = -log10(p.value))) +\n geom_point() +\n geom_hline(yintercept = -log10(0.05),\n linetype = \"dashed\")\n\n\n\n\n\n\n\n\nNow, as mentioned the p-values are computed based on the statistic and are defined as: “The probability of observing a statistic as or more extreme given, that the null-hypothesis is true”. Where the null-hypothesis it that there is no effect, i.e. the estimate for the term is zero.\nFrom this, it is quite clear, that there very likely is a relationship between the gene_expression_level and temperature. In fact, we know there is, because we simulated the data.\n\n\nThe Confidence Intervals\nWe can further easily include the confidence intervals of the estimates:\n\nmy_lm_mdl_tidy <- my_lm_mdl |> \n tidy(conf.int = TRUE,\n conf.level = 0.95)\nmy_lm_mdl_tidy\n\n# A tibble: 2 × 7\n term estimate std.error statistic p.value conf.low conf.high\n \n1 (Intercept) 2.82 1.16 2.44 1.89e- 2 0.488 5.14\n2 temperature 2.02 0.0378 53.4 1.16e-41 1.94 2.10\n\n\n…and as before easily do a plug’n’play into ggplot:\n\nmy_lm_mdl_tidy |> \n ggplot(aes(x = estimate,\n y = term,\n xmin = conf.low,\n xmax = conf.high)) +\n geom_errorbarh(height = 0.1) +\n geom_point()\n\n\n\n\n\n\n\n\nNote, what the 0.95 = 95% confidence intervals means is that: “If we were to repeat this experiment 100 times, then 95 of the times, the generated confidence interval would contain the true value”.\n\n\n\nSummary\nWhat we have gone through here is a basic linear regression, where we are aiming to model the continuous variable gene_expression_level as a function of yet another continous variable temperature. We simulated data, where the true intercept and slope were 3 and 2 respectively and by fitting a linear regression model, the estimates of the intercept and slope respectively were 2.82 [0.49;5.14] and 2.02 [1.94;2.10].\nLinear models allow us to gain insights into data, by modelling relationships." + "text": "Example\n\n\n\n\nBackground\nLet’s say we wanted to study the genetic mechanism protecting a plant from heat shock, then:\n\nIndependent: Environmental Condition (temperature)\nDependent: Gene Expression Level (related to heat shock protection)\n\nHere, the independent variable is the temperature and the dependent variable is the gene expression level. It is clear, that the temperature, does not rely on the gene expression level, but the gene expression level of heat shock related genes, does rely on the temperature.\nSo, we keep plants under different temperatures and collect samples, from which we can extract RNA and run a transcriptomics analysis uncovering gene expression levels.\n\n\nData\nFor the data here, we are going to simulate the relationship between gene expression levels and temperature, as a function in R:\n\nrun_simulation <- function(temp){\n measurement_error <- rnorm(n = length(temp), mean = 0, sd = 3)\n gene_expression_level <- 2 * temp + 3 + measurement_error\n return( gene_expression_level )\n}\n\nNote, how we’re adding some measurement error to our simulation, otherwise we would get a perfect relationship, which we all know never happens.\nNow, we can easily run simulations:\n\nrun_simulation(temp = c(15, 20, 25, 30, 35))\n\n[1] 30.52223 38.13767 54.47297 63.80020 72.71315\n\n\nLet’s just go ahead and create some data, we can work with. For this example, we take samples starting at 5 degree celsius and then in increments of 1 up to 50 degrees:\n\nset.seed(806017)\nexperiment_data <- tibble(\n temperature = seq(from = 5, to = 50, by = 1),\n gene_expression_level = run_simulation(temp = temperature)\n)\nexperiment_data |> \n sample_n(10) |> \n arrange(temperature)\n\n# A tibble: 10 × 2\n temperature gene_expression_level\n \n 1 12 28.7\n 2 13 30.2\n 3 16 35.6\n 4 19 44.0\n 5 22 48.8\n 6 27 64.6\n 7 32 61.1\n 8 33 73.9\n 9 35 76.7\n10 40 84.9\n\n\n\n\nVisualising\nNow, that we have the data, we can visualise the relationship between the temperature- and gene_expression_level-variables:\n\nmy_viz <- experiment_data |> \n ggplot(aes(x = temperature,\n y = gene_expression_level)) +\n geom_point() +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0)\nmy_viz\n\n\n\n\n\n\n\n\nNow, we can easily add the best fit line using the geom_smooth()-function, where we specify that we want to use method = \"lm\" and for now, we exclude the confidence interval, by setting se = FALSE:\n\nmy_viz +\n geom_smooth(method = \"lm\",\n se = FALSE)\n\n\n\n\n\n\n\n\nWhat happens here, is that a best-fit line is added to the plot by calculating the line, such that the sum of the squared errors is as small as possible, where the error is the distance from the line to a given point. This is a basic linear regression and is known as Ordinary Least Squares (OLS). But what if we want to work with this regression model, beyond just adding a line to a plot?\n\n\nModelling\nOne of the super powers of R is the build in capability to do modelling. Because we simulated the data (see above), we know that the true intercept is 3 and the true slope of the temperature variable is 2. Let see what we get, if we run a linear model:\n\nmy_lm_mdl <- lm(formula = gene_expression_level ~ temperature,\n data = experiment_data)\nmy_lm_mdl\n\n\nCall:\nlm(formula = gene_expression_level ~ temperature, data = experiment_data)\n\nCoefficients:\n(Intercept) temperature \n 2.816 2.021 \n\n\nImportant, the formula notation gene_expression_level ~ temperature is central to R and should be read as: “gene_expression_level modelled as a function of temperature”, i.e. gene_expression_level is the dependent variable often denoted y and temperature is the independendt variable often denoted x.\nOkay that’s pretty close! Recall the reason for the difference is, that we are adding measurement error, when we run the simulation (see above).\nIn other words our model says, that:\n\\[gene\\_expression\\_level = 2.816 + 2.021 \\cdot temperature\\]\nI.e. the estimate of the intercept is 2.816 and the estimate of the slope is 2.021, meaning that when the temperature = 0, we estimate that the gene_expression_level is 2.816 and for each 1 degree increase in temperature, we estimate, that the increase in gene_expression_level is 2.021.\nThese estimates are pretty close to the true model underlying our simulation:\n\\[gene\\_expression\\_level = 3 + 2 \\cdot temperature\\]\nIn general form, such a linear model can be written like so:\n\\[y = \\beta_{0} + \\beta_{1} \\cdot x_{1}\\]\nWhere the \\(\\beta\\)-coefficients are termed estimates, because that is exactly what we do, given the observed data, we estimate their values.\n\n\nWorking with a lm-object:\nThe model format you saw above, is a bit quirky, but luckily, there is a really nice way to get these kind of model object into a more tidy-format:\n\nlibrary(\"broom\")\nmy_lm_mdl |> \n tidy()\n\n# A tibble: 2 × 5\n term estimate std.error statistic p.value\n \n1 (Intercept) 2.82 1.16 2.44 1.89e- 2\n2 temperature 2.02 0.0378 53.4 1.16e-41\n\n\nBriefly, here we term, estimate, std.error, statistic and p.value. We discussed the term and estimate. The std.error pertains to the estimate and the statistic is used to calculate the p.value.\n\nThe P-value\nNow, because we now have a tidy object, we can simply plug-‘n’-play with other tidyverse tools, so let us visualise the p.value. Note, because of the often vary large differences in p.values, we use a -log10-transformation, this means that larger values are “more significant”. Below, the dashed line signifies \\(p=0.05\\), so anything above that line is considered “statistically significant”:\n\nmy_lm_mdl |> \n tidy() |> \n ggplot(aes(x = term,\n y = -log10(p.value))) +\n geom_point() +\n geom_hline(yintercept = -log10(0.05),\n linetype = \"dashed\")\n\n\n\n\n\n\n\n\nNow, as mentioned the p-values are computed based on the statistic and are defined as: “The probability of observing a statistic as or more extreme given, that the null-hypothesis is true”. Where the null-hypothesis it that there is no effect, i.e. the estimate for the term is zero.\nFrom this, it is quite clear, that there very likely is a relationship between the gene_expression_level and temperature. In fact, we know there is, because we simulated the data.\n\n\nThe Confidence Intervals\nWe can further easily include the confidence intervals of the estimates:\n\nmy_lm_mdl_tidy <- my_lm_mdl |> \n tidy(conf.int = TRUE,\n conf.level = 0.95)\nmy_lm_mdl_tidy\n\n# A tibble: 2 × 7\n term estimate std.error statistic p.value conf.low conf.high\n \n1 (Intercept) 2.82 1.16 2.44 1.89e- 2 0.488 5.14\n2 temperature 2.02 0.0378 53.4 1.16e-41 1.94 2.10\n\n\n…and as before easily do a plug’n’play into ggplot:\n\nmy_lm_mdl_tidy |> \n ggplot(aes(x = estimate,\n y = term,\n xmin = conf.low,\n xmax = conf.high)) +\n geom_errorbarh(height = 0.1) +\n geom_point()\n\n\n\n\n\n\n\n\nNote, what the 0.95 = 95% confidence intervals means is that: “If we were to repeat this experiment 100 times, then 95 of the times, the generated confidence interval would contain the true value”.\n\n\n\nSummary\nWhat we have gone through here is a basic linear regression, where we are aiming to model the continuous variable gene_expression_level as a function of yet another continous variable temperature. We simulated data, where the true intercept and slope were 3 and 2 respectively and by fitting a linear regression model, the estimates of the intercept and slope respectively were 2.82 [0.49;5.14] and 2.02 [1.94;2.10].\nLinear models allow us to gain insights into data, by modelling relationships." }, { "objectID": "primer_on_r_packages.html", diff --git a/lab02.qmd b/lab02.qmd index 62d49e3..2024777 100644 --- a/lab02.qmd +++ b/lab02.qmd @@ -6,7 +6,7 @@ ## Schedule -- 08.00 - 08.15: [Pre-course Survey Walk-through](https://raw.githack.com/r4bds/r4bds.github.io/main/pre_course_questionnaire_summary.html) +- 08.00 - 08.15: [pre-course anonymous questionaire Walk-through](https://raw.githack.com/r4bds/r4bds.github.io/main/pre_course_questionnaire_summary.html) - 08.15 - 08.30: Recap: RStudio Cloud, RStudio and R - The Very Basics (Live session) - 08.30 - 09.00: [Lecture](https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab02.html) - 09.00 - 09.15: Break diff --git a/pre_course_questionnaire_summary.html b/pre_course_questionnaire_summary.html index 87bb73d..a263448 100644 --- a/pre_course_questionnaire_summary.html +++ b/pre_course_questionnaire_summary.html @@ -1174,178 +1174,121 @@

    Is there any specific area of bio-research you would like to see covered? -
    -

    General Bioinformatics & Data Analysis:

    -
      -
    • General interest in bioinformatics.
    • -
    • Comfortable handling different types of biological data in R.
    • -
    • How to work with data and make it easier to analyze.
    • -
    • Visualizations of big data.
    • -
    +
    +

    Data Analysis & Machine Learning

    +

    “I’m interested in more advanced statistics, mapping, and machine learning techniques applied to biological data, especially handling large datasets like RNA-seq and proteomics.”

    -
    -

    Genetics & Genomics:

    -
      -
    • Genetics, Genomics, and Evolution.
    • -
    • Transcriptomics, Metagenomics, Genomic data analysis, and RNA-seq.
    • -
    • Single cell omics, Bulk RNA-seq data manipulation.
    • -
    • Gene expression data analysis in R.
    • -
    +
    +

    Genetics and Genomics

    +

    “I would like to see topics related to gene-based disease discovery, genome sequencing, and CRISPR applications in bio-research.”

    -
    -

    Disease & Medical Research:

    -
      -
    • Personalized medicine and precision medicine.
    • -
    • Research related to specific diseases: obesity, cancer, autoimmune diseases, infectious diseases.
    • -
    • Drug design, clinical drug trials, and drug trial data analysis.
    • -
    • Analysis of complex cancer patient data.
    • -
    +
    +

    Immunology

    +

    “Immunology research, particularly related to autoimmune diseases and tumor immunology, would be valuable to explore.”

    -
    -

    Immunology & Microbiology:

    -
      -
    • Immunology, especially related to MHC, neoantigens, antibodies, and antigens.
    • -
    • Clinical research in R related to immunology.
    • -
    • Gut microbiome, Microbiome studies on cancer, and Microbiologic studies.
    • -
    • Immune response bio-research and immune system or stem cells.
    • -
    +
    +

    Multi-omics & Omics Data

    +

    “Multi-omics approaches, including proteomics, genomics, and the analysis of RNA-seq data, are areas I’m very interested in.”

    -
    -

    Advanced Computational Techniques:

    -
      -
    • Predictive modeling and visualizations of complex networks/pathways.
    • -
    • Deep learning in R and Artificial Intelligence.
    • -
    • Analysis of peptide sequencing via mass spec (TD-search and de novo).
    • -
    +
    +

    Epidemiology & Clinical Data

    +

    “It would be helpful to cover epidemiology topics, including predictive modeling of disease outbreaks and the analysis of clinical datasets.”

    -
    -
    -

    Specific Bio-research Topics:

    -
      -
    • Food-related research.
    • -
    • Ecology.
    • -
    • Plastic degradation by microorganisms or enzymes.
    • -
    • CO2 capture by microorganisms.
    • -
    • Mass screening and patient profiling, especially for cancer.
    • -
    • LC-MS data, peptide prediction from proteins.
    • -
    +
    +
    +
    +

    Briefly, what are your general expectations to this course?

    -
    -

    Others:

    -
      -
    • Some students have expressed that they are open to any topic or aren’t particularly focused on a specific area.
    • -
    • A few are excited about the course in general and don’t have specific preferences.
    • -
    • There’s interest in the integration of bioinformatics with scientific articles and hospital data.
    • -
    +
    +

    R Programming Proficiency

    +

    “I expect to become proficient in R programming, gaining confidence in writing, interpreting, and using R code in a variety of bio-research contexts. I aim to improve my R skills for future work in biological research and industry.”

    -
    -
    -
    -

    Briefly, what are your general expectations to this course?

    +
    +
    +

    Data Handling and Analysis

    +

    “My goal is to learn how to effectively handle and analyze large datasets, including cleaning, organizing, and applying statistical methods to biological data. I hope to gain the ability to manage big data and automate data analysis tasks in R.”

    -
    -

    R Proficiency:

    -
      -
    • Many students wish to gain or improve proficiency in using R for data analysis.
    • -
    • There’s an emphasis on understanding the R environment, syntax, and packages.
    • -
    • Some students are already familiar with R and wish to polish and expand their skills, while others are complete beginners hoping to grasp the basics.
    • -
    +
    +

    Practical Applications in Biology

    +

    “I want to apply the R programming skills learned in this course to real-world biological problems, such as genomic data analysis, bioinformatics, and pipeline development. Understanding how to use R in practical bio-research scenarios is a key expectation.”

    -
    -

    Bioinformatics and Biology Application:

    -
      -
    • Many students want to learn how to apply R specifically to biological and bioinformatic datasets.
    • -
    • They expect to work with real-life bio data and learn how to handle and analyze data relevant to their bio studies.
    • -
    +
    +

    Visualization & Data Presentation

    +

    “I hope to develop skills in visualizing and presenting biological data in a clear and effective way. Learning how to create plots and presentations that make large datasets more accessible is a critical aspect I expect to master.”

    -
    -

    Data Visualization & Manipulation:

    -
      -
    • Students are keen to learn about data visualization and manipulation in R.
    • -
    • They are interested in using R for creating plots, visualizations, and handling various data types.
    • -
    +
    +

    Confidence and Efficiency in Using R

    +

    “I aim to feel more confident and efficient in using R for bio-data analysis by the end of this course. I hope to reduce the time spent on coding, improve the readability of my code, and tackle intermediate challenges independently.”

    + + + +
    +
    +
    +

    Is there anything you would like to add? Comments, suggestions, anything?

    -
    -

    Practical Skills for Future Application:

    -
      -
    • Several students hope the course will prepare them for future projects, research, or roles that require data analysis.
    • -
    • Some students are interested in using the skills they gain in this course for their thesis or future studies.
    • -
    • A few want to be able to transfer the knowledge they gain to other programming languages, like Python.
    • -
    +
    +

    Course Pace & Structure

    +

    “No rushing through the learning material would greatly benefit my understanding. A slower, more deliberate pace will help those of us who are new to programming.”

    -
    -

    Data Analysis Techniques:

    -
      -
    • Students wish to learn about different data analysis methods, including statistics, RNA sequencing data processing, etc.
    • -
    • They hope to understand how to organize, clean, and interpret data.
    • -
    +
    +

    Learning Resources

    +

    “It would be really helpful to have additional learning resources, such as recommended books or websites, to support learning outside of class. Pointers for getting extra help would be appreciated.”

    -
    -

    Tool and Package Familiarity:

    -
      -
    • Some students want to familiarize themselves with specific R packages, like tidyverse.
    • -
    • There’s an interest in learning how to utilize GitHub for shared programming.
    • -
    • Several mention wanting to know how to use specific tools for data analysis, such as data wrangling and lab notebook style coding.
    • -
    +
    +

    Industry Applications

    +

    “Including company talks with content related to protein engineering and data handling would be valuable. It would be great to see how skills from the course can be applied to real industry problems, particularly in the context of protein data.”

    -
    -

    Learning Environment:

    -
      -
    • A few students mentioned hoping for a structured or gradual introduction, especially for those without prior knowledge.
    • -
    • Some have heard from past students and have expectations based on word-of-mouth.
    • -
    • A couple of students are concerned about the timing and structure of the exam.
    • -
    +
    +

    Community & Atmosphere

    +

    “Looking forward to the course and excited about the learning experience! There’s a general sense of anticipation and eagerness to start the class.”

    -
    -

    Miscellaneous:

    -
      -
    • There are mentions of topics like the application of R in various biological datasets, multiomics, and genetics.
    • -
    • Some are looking forward to learning how to plan scientific studies or adjust chosen methods.
    • -
    • A few students don’t have specific expectations, while others hope for a challenging but rewarding experience.
    • -
    +
    +

    Individual Programming Projects

    +

    “I’d love to focus more on coding custom functions and understanding how they can be applied in different contexts, beyond just the basics.”

    @@ -1385,8 +1328,8 @@

    This course

    This course - In other words

    • Creates the foundation for you to explore the multitude of bioinformatics subjects
    • -
    • Gives you concrete skills to handle (almost) any kind of bio data
    • -
    • Trains your collaborative and communicative meta skills
    • +
    • Gives you concrete tool skills to handle (almost) any kind of bio data and to do collaborative coding projects
    • +
    • Trains your general collaborative and communicative meta skills
    diff --git a/pre_course_questionnaire_summary.qmd b/pre_course_questionnaire_summary.qmd index ac988d3..aa4e338 100644 --- a/pre_course_questionnaire_summary.qmd +++ b/pre_course_questionnaire_summary.qmd @@ -91,180 +91,140 @@ d_pca_obj_aug |> + -## General Bioinformatics & Data Analysis: - -- General interest in bioinformatics. -- Comfortable handling different types of biological data in R. -- How to work with data and make it easier to analyze. -- Visualizations of big data. +## Data Analysis & Machine Learning +_"I'm interested in more advanced statistics, mapping, and machine learning techniques applied to biological data, especially handling large datasets like RNA-seq and proteomics."_ -## Genetics & Genomics: - -- Genetics, Genomics, and Evolution. -- Transcriptomics, Metagenomics, Genomic data analysis, and RNA-seq. -- Single cell omics, Bulk RNA-seq data manipulation. -- Gene expression data analysis in R. +## Genetics and Genomics +_"I would like to see topics related to gene-based disease discovery, genome sequencing, and CRISPR applications in bio-research."_ -## Disease & Medical Research: +## Immunology +_"Immunology research, particularly related to autoimmune diseases and tumor immunology, would be valuable to explore."_ -- Personalized medicine and precision medicine. -- Research related to specific diseases: obesity, cancer, autoimmune diseases, infectious diseases. -- Drug design, clinical drug trials, and drug trial data analysis. -- Analysis of complex cancer patient data. + + + +## Multi-omics & Omics Data +_"Multi-omics approaches, including proteomics, genomics, and the analysis of RNA-seq data, are areas I’m very interested in."_ + -## Immunology & Microbiology: - -- Immunology, especially related to MHC, neoantigens, antibodies, and antigens. -- Clinical research in R related to immunology. -- Gut microbiome, Microbiome studies on cancer, and Microbiologic studies. -- Immune response bio-research and immune system or stem cells. +## Epidemiology & Clinical Data +_"It would be helpful to cover epidemiology topics, including predictive modeling of disease outbreaks and the analysis of clinical datasets."_ -## Advanced Computational Techniques: +# Briefly, what are your general expectations to this course? + -- Predictive modeling and visualizations of complex networks/pathways. -- Deep learning in R and Artificial Intelligence. -- Analysis of peptide sequencing via mass spec (TD-search and de novo). -## Specific Bio-research Topics: -- Food-related research. -- Ecology. -- Plastic degradation by microorganisms or enzymes. -- CO2 capture by microorganisms. -- Mass screening and patient profiling, especially for cancer. -- LC-MS data, peptide prediction from proteins. +## R Programming Proficiency +"I expect to become proficient in R programming, gaining confidence in writing, interpreting, and using R code in a variety of bio-research contexts. I aim to improve my R skills for future work in biological research and industry." -## Others: - -- Some students have expressed that they are open to any topic or aren't particularly focused on a specific area. -- A few are excited about the course in general and don't have specific preferences. -- There's interest in the integration of bioinformatics with scientific articles and hospital data. +## Data Handling and Analysis +"My goal is to learn how to effectively handle and analyze large datasets, including cleaning, organizing, and applying statistical methods to biological data. I hope to gain the ability to manage big data and automate data analysis tasks in R." -# Briefly, what are your general expectations to this course? +## Practical Applications in Biology +"I want to apply the R programming skills learned in this course to real-world biological problems, such as genomic data analysis, bioinformatics, and pipeline development. Understanding how to use R in practical bio-research scenarios is a key expectation." -## R Proficiency: - -- Many students wish to gain or improve proficiency in using R for data analysis. -- There's an emphasis on understanding the R environment, syntax, and packages. -- Some students are already familiar with R and wish to polish and expand their skills, while others are complete beginners hoping to grasp the basics. +## Visualization & Data Presentation +"I hope to develop skills in visualizing and presenting biological data in a clear and effective way. Learning how to create plots and presentations that make large datasets more accessible is a critical aspect I expect to master." -## Bioinformatics and Biology Application: - -- Many students want to learn how to apply R specifically to biological and bioinformatic datasets. -- They expect to work with real-life bio data and learn how to handle and analyze data relevant to their bio studies. - +## Confidence and Efficiency in Using R +"I aim to feel more confident and efficient in using R for bio-data analysis by the end of this course. I hope to reduce the time spent on coding, improve the readability of my code, and tackle intermediate challenges independently." -## Data Visualization & Manipulation: +# Is there anything you would like to add? Comments, suggestions, anything? -- Students are keen to learn about data visualization and manipulation in R. -- They are interested in using R for creating plots, visualizations, and handling various data types. -## Practical Skills for Future Application: - -- Several students hope the course will prepare them for future projects, research, or roles that require data analysis. -- Some students are interested in using the skills they gain in this course for their thesis or future studies. -- A few want to be able to transfer the knowledge they gain to other programming languages, like Python. +## Course Pace & Structure +"No rushing through the learning material would greatly benefit my understanding. A slower, more deliberate pace will help those of us who are new to programming." -## Data Analysis Techniques: - -- Students wish to learn about different data analysis methods, including statistics, RNA sequencing data processing, etc. -- They hope to understand how to organize, clean, and interpret data. +## Learning Resources +"It would be really helpful to have additional learning resources, such as recommended books or websites, to support learning outside of class. Pointers for getting extra help would be appreciated." -## Tool and Package Familiarity: - -- Some students want to familiarize themselves with specific R packages, like tidyverse. -- There's an interest in learning how to utilize GitHub for shared programming. -- Several mention wanting to know how to use specific tools for data analysis, such as data wrangling and lab notebook style coding. +## Industry Applications +"Including company talks with content related to protein engineering and data handling would be valuable. It would be great to see how skills from the course can be applied to real industry problems, particularly in the context of protein data." -## Learning Environment: - -- A few students mentioned hoping for a structured or gradual introduction, especially for those without prior knowledge. -- Some have heard from past students and have expectations based on word-of-mouth. -- A couple of students are concerned about the timing and structure of the exam. +## Community & Atmosphere +"Looking forward to the course and excited about the learning experience! There's a general sense of anticipation and eagerness to start the class." -## Miscellaneous: - -- There are mentions of topics like the application of R in various biological datasets, multiomics, and genetics. -- Some are looking forward to learning how to plan scientific studies or adjust chosen methods. -- A few students don't have specific expectations, while others hope for a challenging but rewarding experience. - +## Individual Programming Projects +"I'd love to focus more on coding custom functions and understanding how they can be applied in different contexts, beyond just the basics." @@ -304,8 +264,8 @@ d_pca_obj_aug |> ## This course - In other words - Creates the foundation for you to explore the multitude of bioinformatics subjects -- Gives you concrete skills to handle (almost) any kind of bio data -- Trains your collaborative and communicative meta skills +- Gives you concrete tool skills to handle (almost) any kind of bio data and to do collaborative coding projects +- Trains your general collaborative and communicative meta skills