-
Notifications
You must be signed in to change notification settings - Fork 5
/
lab07.qmd
380 lines (273 loc) · 23.4 KB
/
lab07.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
# Lab 7: Collaborative Bio Data Science using GitHub via RStudio {.unnumbered}
## Package(s)
- [usethis](https://usethis.r-lib.org)
- [gitcreds](https://gitcreds.r-lib.org/reference/gitcreds_get.html)
- [git](https://git-scm.com/) (Actually not an R-package this time)
## Schedule
- 08.00 - 08.10: [Midway evaluation](https://raw.githack.com/r4bds/r4bds.github.io/main/midway_eval_lab07.html)
- 08.10 - 08.30: [Recap of Lab 6](https://raw.githack.com/r4bds/r4bds.github.io/main/recap_lab07.html)
- 08.30 - 09.00: [Lecture](https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab07.html)
- 09.00 - 09.15: Break
- 09.00 - 12.00: [Exercises](#sec-exercises)
## Learning Materials
_Please prepare the following materials_
- Book: [Happy Git and GitHub for the useR](https://happygitwithr.com/) -- Read chapters 1 (intro), 20 (basic terms), 22 (branching), 23 (remotes). Do not pay much attention to syntax of specific commands, because we are not going to use them during the exercises, focus on the idea
- Book: [Introduction to Data Science - Data Analysis and Prediction Algorithms with R by Rafael A. Irizarry: Chapter 40 Git and GitHub](https://rafalab.github.io/dsbook/git.html) -- Some of the information here is redundant with the previous book, but very important thing is a visualization of basic git actions and screenshots of how to perform them using RStudio
- Video: [RStudio and Git - an Overview (Part 1)](https://www.youtube.com/watch?v=KjLycV1IWqc) -- Basic git concepts, for those who prefer listen rather than read. Books, however, contain more information
- Video: [How to use Git and GitHub with R](https://www.youtube.com/watch?v=kL6L2MNqPHg) -- Basic operating on git in RStudio. Complementary to second book. You can skip to 2:50, we are not going to link to git manually either way
_Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book_
## Learning Objectives
_A student who has met the objectives of the session will be able to:_
- Explain why reproducible data analysis is important
- Identify associated relevant challenges
- Describe the components of a reproducible data analysis
- Use RStudio and GitHub (git) for collaborative bio data science projects
## Exercises {#sec-exercises}
### Prologue
![](images/git_collab.gif){fig-align="center" width=80%}
*Using `git` is completely central for collaborative bio data science. You can use `git` not only for `R`, but for any code-based collaboration. All bioinformatics departments in major players in Danish Pharma, are using `git` - Learning `git` is a key bio data science skill! Today, you will take your first steps - Happy Learning!*
- **T1:** Find the `GitHub` repository for the `ggplot2` R-package
- **Q1:** How many Contributors are there?
- **T1:** Find the `GitHub` repository for the `Linux kernel`
- **Q2:** How many Contributors are there?
- **Q3:** Discuss in your group why having an organised approach to version control is central? And consider the simple contrast of the challenges you were facing when doing the course assignments.
### Getting Started
_First, make sure to read and discuss the feedback you got from last week's assignment!_
<span style="color: red;">**The following exercises have to be done in your groups! You must move at the same pace and progress together as a team through the exercises!**</span>
<span style="color: red;">**Be aware of which editor you are using in RStudio! You must work in the "Visual" editor instead of the "Source" editor. ALL OF YOU! If one of you works in a different editor, it could cause a conflict issue!**</span>
*GitHub is **the** place for collaborative coding and different group members will have to do different tasks in a specific order, to make it through the exercises together, so... Team Up and don't rush it!*
First, select a team **<span style="color:darkgreen;">Captain</span>**, that person will have to carry out specific tasks. If **<span style="color:darkgreen;">Team</span>** is stated, then that refers to all group members and lastly if **<span style="color:darkgreen;">Crew</span>** is stated, then that is everyone but the **<span style="color:darkgreen;">Captain</span>**. Please note, that tasks are sequential, so if a task is assigned to the **<span style="color:darkgreen;">Captain</span>**, then the **<span style="color:darkgreen;">Crew</span>** has to await completion before proceeding!
#### **<span style="color:darkgreen;">Team</span>**
1. Go to [GitHub](https://github.com) and login
1. In the upper right corner, click on your profile picture. Then in the menu, click on "Your Organizations" and go into rforbiodatascienceYY (replace `YY` with year). If you are not a member, contact a TA NOW before proceeding. Next, go into repositories (top left).
#### **<span style="color:darkgreen;">Captain</span>**
1. Click "New repository"
1. Since you created the project from the organisation, the default owner of the repository is the organisation (rforbiodatascienceYY). Verify and keep that, please.
1. Name the repository `groupXX`, where `XX` is your group number, e.g. `02`
1. Select `Public`
1. Tick `Add a README file`
1. Click `Create repository`
1. Click `Settings` in the menu line starting with `<> Code`
1. Under `Access`, click `Collaborators and teams`
1. Click `Add people`
1. Write a group members username and select the role "Maintain".
1. Click the correct suggestion
1. Select `Add username to this repository`
1. Repeat for all group members
#### **<span style="color:darkgreen;">Crew</span>**
1. Refresh the page and go into the new repository. If you don't see it, check your mail and accept your invitation to join the repository.
#### **<span style="color:darkgreen;">Team</span>**
1. [Click here](https://teaching.healthtech.dtu.dk/22100/rstudio.php) to go to the course RStudio cloud server and login
1. In the upper right corner, where it should say ![](images/paths_new_proj.png)`r_for_bio_data_science`, click
1. Choose ![](images/paths_new_proj_plus.png) `New Project...`
1. Select ![](images/icon_version_control.png) **Version Control**
1. Select ![](images/icon_git.png) **Git**
1. Under `Repository URL:`, enter `https://github.com/rforbiodatascienceYY/groupXX`, where again you replace `YY` with year and replace `XX` with your group number, e.g. `02`.
1. Under `Project directory name:`, enter `lab07_git_exercises`
1. Under `Create project as a subdirectory of:`, make sure that is says `~/projects`
1. Click `Create Project`
*Congratulations! You have just cloned your first GitHub repository!*
In your `Files` pane, you should now see:
![](images/git_first_files.png){fig-align="center" width=40%}
...and now in your `Environment` pane, you should see a new `git` tab:
![](images/git_environment_pane.png){fig-align="center" width=50%}
If you click the `git` tab, you should see:
![](images/environment_pane_git_tab.png){fig-align="center" width=50%}
#### Setting up your Credentials and Personal Access Token (**<span style="color:darkgreen;">Team</span>**)
*GitHub doesn't know who you are, you're just someone who cloned your first GitHub repository and now you want to do all sorts of stuff! We can't have that, so let's fix that GitHub doesn't know who you are!*
1. In the `Console`, run the command:
```{r}
#| echo: true
#| eval: false
usethis::use_git_config(user.name = "YOUR_GITHUB_USERNAME", user.email = "THE_EMAIL_YOU_USED_TO_CREATE_YOUR_GITHUB_ACCOUNT")
```
2. In the `Console`, run the command:
```{r}
#| echo: true
#| eval: false
usethis::create_github_token()
```
3. Login
4. Under `Note`, where it says `DESCRIBE THE TOKEN'S USE CASE`, delete the text and write e.g.`R for Bio Data Science lab 7 git exercises`
5. Under `Expiration`, where it says `30 days`, change that to `90 days`
6. Do not change any other setting, but simply scroll down and click `Generate token`
7. A new `Personal access tokens (classic)` page will appear, stating your personal access token, which starts with `ghp_`, go ahead and copy it
8. Store the `PAT` somewhere save, e.g. in a password management tool. It will serve as your password when connecting to the repository via R.
9. Again in the console, run the below command and enter your `PAT` when prompted for the password:
```{r}
#| echo: true
#| eval: false
gitcreds::gitcreds_set()
```
Stop, wait a minute and make sure, that **ALL** group members are at this stage of the exercises.
#### Making your Credentials and Personal Access Token stick around (**<span style="color:darkgreen;">Team</span>**)
*The thing is, we're actually working on a Linux server here, this entails a that you're credentials and PAT exists in a cache, which is cleared every 15 minutes. This. Is. Annoying! So let's fix it!*
First of all RStudio is a (G)raphical (U)ser (I)nterface, meaning, that it allows you to do pointy-clicky stuff, but at the end of the day, everything happens at the command line. That also goes for git, so your button-pushing simply gets converted into commands, which are executed at the command line. If you are not comfortable with the command line, don't worry, the pointy-clicky stuff will be sufficient for now! If you are comfortable with the command line, I advice to get comfortable with git on the command line as well, you do get some extra bang-for-the-buck!
Now, a brief visit of the command line:
1. In the `Console` pane, click the `Terminal` tab, which gains you access to the system underlying RStudio
1. You should see something like `user@pupilX:/net/pupilX/home/people/user/projects/lab07_git_exercises$`
1. Enter `ls -a`, this will `l`ist `a`ll the files, compare with your `Files` pane, you will see that RStudio is "hiding" something from you, namely the `.git` folder
1. Enter `ls -a .git` and you will see the content of the folder, this is where all the git-magic happens
1. See the config file? That holds the configuration, enter `git config --global --list`
1. Recall you entered your GitHub username and mail? This is where it ended up! Note the `credential.helper=cache`, which tells us, that the credentials are being cached. Now, enter `git config --global credential.helper 'cache --timeout=86400'`. Hint: You can paste with Ctrl-Shift-V in the terminal on Windows.
1. Re-run the command `git config --global --list` and confirm the change
1. Go back to your `Console` and run the command `gitcreds::gitcreds_set()`. Select option 2, and re-enter your PAT.
1. If you didn't get a list of options in the prior step, restart your R session and retry.
*Congratulations! You have now used the command line and you will forever be part of an elite few, who know that everything you see in hacker-movies is BS, except perhaps for Wargames and Mr. Robot! Also, your PAT will now be good for 24h*
### Your first collaboration
#### **<span style="color:darkgreen;">Team</span>**
1. Create a new Quarto document, title it `student_id` and save id as `student_id.qmd`, where `student_id` is your... You guessed it! **Make sure that "Use visual markdown editor" is checked.**
1. In the `environment` pane, click the `git` tab
1. Tick the 3 boxes under `staged`
1. Click `commit`
1. In the upper right corner, add a `Commit message`, e.g "First commit by student_id"
1. Click the `Commit` button
1. A pop-up, will give you details on your commit, look through them and then click `Close`
1. Now, very important **ALWAYS** click the ![](images/pull.png) `Pull` button **BEFORE** clicking the ![](images/push.png) `Push` button
1. Clicking `Pull`, you should see "Already up to date."
1. Then click `Push`
- **Q4:** Discuss in your group what each of the steps did and why they are performed?
### Your next collaboration
*That first collaboration was easy right!? Well, you were all working on different files...*
#### **<span style="color:darkgreen;">Captain</span>**
1. Create a new Quarto Document, title it "Group Document" and save it as `group_document.qmd`
1. Using markdown headers, create one section for each group member, incl. yourself. Here, you can use your names, student ids or whatever you deem appropriate
1. Again in the `Environment` pane, make sure you are in the `git` tab and then tick the box next to `group_document.qmd`. If you do not see the document in this window after saving, refresh the browser and it should appear.
1. Click `Commit`
1. Add a commit message, e.g. "Add group document"
1. Click the `Commit` button
1. Click `Pull` (You should be "Already up to date.")
1. Click `Push`
1. Click `Close`
1. Again, go to the group GitHub and confirm that you see the new document you just created
#### **<span style="color:darkgreen;">Team</span>**
1. Click `Pull` and confirm that you now also have the file `group_document.qmd`
1. Open `group_document.qmd` and find your assigned section
1. In your section and your section only, enter some text, add a few code chunks with some `R`-code
1. Make sure to save the document
1. Now again, find the `group_document.qmd` in the `git` tab of the `Environment` pane and tick the box under `Staged`
1. Click `Commit`
1. Note how your changes to the document are highlighted in green
1. Add a commit message, e.g. "Update the STUDENT_ID section" and click `Commit`
1. Click `Close`
1. Click `Pull`
1. Click `Push`
1. Go to the group GitHub, find the `group_document.qmd` and click it, do you see your changes?
Once **everyone** has added to their assigned section, **everyone** should do a pull/push, so **everyone** has the complete version of the `group_document.qmd`.
### Your first branching
Ok that's pretty great so far - Right? The thing is... Consider, the `ggplot2` repository, that you found in *T1*. Thousands of companies and even more thousands of people rely on `ggplot` for advanced data visualisation. What would happen, if you wanted to add a new feature or wanted to optimise an existing one, while people were actively installing your package? They would get what stage your code was in, which may or may not be functional - Enter branching!
Below here, is an illustration, consider the **<span style="color:green;">Master</span>** the stable version people can download and use and then **<span style="color:blue;">Your Work</span>** will be the feature or update that you personally are working on. **<span style="color:orange;">Someone Else's Work</span>** will be another team and that persons work on a new feature or update. Before the last **<span style="color:green;">green</span>** circle, note how both the **<span style="color:blue;">Your Work</span>**- and **<span style="color:orange;">Someone Else's Work</span>**-branches are merged onto the **<span style="color:green;">Master</span>**-branch.
![](images/git_branching_merging.png){fig-align="center" width=90%}
*There can be even more branches*
- **Q5:** Again, find the `ggplot` GitHub site and see if you can find how many branches there are?
#### **<span style="color:darkgreen;">Team</span>**
1. Go to your RStudio session and in the `git` tab of the `Environment` pane, click ![](images/git_new_branch.png) `New Branch`
1. In branch name, enter your student id
1. You should see a pop-up `Branch 'STUDENT_ID' set up to track remote branch 'STUDENT_ID' from 'origin'.`, go ahead and click `Close`
1. Next to where you clicked `New Branch` just before, it should now say `STUDENT_ID`, click it and confirm that you see `STUDENT_ID` and `Main`
1. Look at the illustration above and make sure you are on par with that from the original branch `main`, which is equivalent with **<span style="color:green;">Master</span>**, you have created a new branch `STUDENT_ID`, which is equivalent to **<span style="color:blue;">Your Work</span>**
1. Click `STUDENT_ID` and you will get a confirmation that you are already on that branch and that you are up-to-date, click `Close`
1. In the `group_document.qmd` under your section, using markdown, enter a new sub-header and name it e.g. `New feature` or `New analysis`. Make sure that you all use the "Visual" editor for the .qmd file!
1. Enter some text, a few code chunks with a bit of `R`-code of your choosing and save the document
1. Again, in the `git` tab, tick the box under staged and click `Commit`
1. Add a commit message, e.g. `New feature from STUDENT_ID` and click `Commit`
1. Click `Close`, `Pull`, `Close`, `Push` and `Close` and then close the commit window
1. Go to your group GitHub and confirm that you now have at least 2 branches
1. Make sure you are in the main branch and then click the `group_document.qmd`, you should now **not** see your new feature/analysis that you added
1. On the left, where it says `main`, click and select your `STUDENT_ID`, you **should** now see your new feature/analysis that you added
*Congratulations! You have now succesfully done your first branching!*
### Your first branch merging
#### **<span style="color:darkgreen;">Team</span>**
1. Go to your groups GitHub page, at the top it should say `STUDENT_ID had recent pushes...`, click the `Compare & pull request`
1. Your commit message will appear and where it says `Leave a comment`, add a comment like e.g. `I'm done, all seems to be working now!` or similar
1. Click `Create pull request`
1. It should now say `This branch have no conflicts with the base branch`, confirm and click `Merge pull request`
1. Click `Confirm merge` after which it should now say `Pull request successfully merged and closed`
1. You have now fully merged, so go ahead and click `Delete branch`
1. Revisit the previous illustration and compare with your branch workflow, make sure that everyone in the groups are on par here
1. Finally return to your RStudio session and make sure you switch to the `main` branch
*Congratulations! You have now successfully done your first branch merge!*
But wait, what was this `Pull request`??? What you actually did, was:
1. Created a new branch
1. Completed a new feature/analysis
1. Push'ed the new feature/analysis to GitHub
1. Created a `Pull request` for merging your branch `STUDENT_ID` into the `main` branch
1. Approved and completed the `Pull request`
Think about e.g. again the `ggplot2` repo, if anyone could create a new branch and then do as described above, then there would be no way of making quality control. Therefore, typically such pull requests will have to be approved by someone. This can either be someone who is close-to-the-code e.g. in the case of `ggplot2`, that'd be someone like Hadley. In a company, then that might be some senior developer approving junior developers pull requests. At one point you might have seen something like "The main branch is unprotected", this is exactly that!
### Your first merge conflict
Okay, that's all good an well. Seems easy and straight forward, right? Well, that is as long as we don't have a `conflict`. A conflict is when two or more changes to one file are not compatible, consider:
```{verbatim}
Top 10 Bio Data Science Languages of ALL Time:
10. It is
9. Impossible
8. to rank them
7. Because
6. programming is subjective
5. and everyone
4. has
3. different
2. tastes
1. R
```
In fact, let's screw things up!
#### **<span style="color:darkgreen;">Captain</span>**
1. Go to the RStudio session and where you usually click `Quarto Document...`, this time let's just create a simple `Text File`
1. Copy/paste the `Top 10 Bio Data Science Languages of ALL Time` into the file
1. Save the file as `best_ds_langs.txt`
1. Make sure you are in the `main` branch
1. Do a commit/pull/push and check the file ended at your group github
#### **<span style="color:darkgreen;">Crew</span>**
Rrrrrrr... Let's commit mutiny! (Double pun intended)
1. Make sure you are in the `main` branch and then hit `Pull`
1. Open the `best_ds_langs.txt`
1. Each of you separately (wrongfully) replace `R` with an (inferior) bio data science language of your choice (you can really write anything)
1. Now, this time **we do not pull first**, simply commit and then push
1. Now you can `pull`
1. You will now get notified about a `merge conflict`
1. Close and re-open the `best_ds_langs.txt` file, it should look something like:
```{verbatim}
Top 10 Bio Data Science Languages of ALL Time:
10. It is
9. Impossible
8. to rank them
7. Because
6. programming is subjective
5. and everyone
4. has
3. different
2. tastes
<<<<<<< HEAD
1. C++
=======
1. Python
>>>>>>> 85283ca3246b7f7462ab633085f92ac5f173d3e7
```
#### **<span style="color:darkgreen;">Captain</span>**
Get your **<span style="color:darkgreen;">Crew</span>** in order!
1. Simply edit the file so that you delete everything below 2. and add the final line, which emphasises that `1. R is superior!`
1. The box under `Staged` looks a bit odd right, just click it
1. Click `Commit` and add a commit message, e.g. `Fixed merge conflict, got crew back in order!`
1. Click `Commit`, `Close`, `Pull`, `Close`, `Push`, `Close` and close the commit window
1. Return to your GitHub group and confirm all is in order and you can continue with confidence knowing that `R is superior!` (Check the file to make sure)
*Congratulations! You have now fixed your first merge conflict!*
Lastly, click the `History` button next to the `Push` button and explore the history of how your branches and commits have changed through these exercises.
### The .gitignore file
You may have noticed the `.gitignore` file?
#### **<span style="color:darkgreen;">One crew member</span>**
1. Go to your RStudio session, find the `.gitignore` file and click it
It contains a list of files and folders, which should **not** end up at your `git` repo, i.e. files which should be ignored. An important aspect of working with GitHub is that GitHub is meant for code, **not** data! Let's say we had a `data` and a `data/_raw/`, we could add those to the `.gitignore` file to avoid the data being included in our commits. Currently, files like `.RData` and folders like `.Rproj.user` should be listed in the `.gitignore` file, as these are unique to each of your own sessions and shouldn't be committed to the repo.
1. Update the `.gitignore` file and get the updated version to GitHub
### Summary
You have not gotten to play around with collaborative coding using `git` - Well done!
If you are more curious for more, please feel free to play around with a new file, edit commit/pull/push etc. Perhaps also take an extra look at the GitHub, explore and learn more 👍
### <span style="color: red;">GROUP ASSIGNMENT</span> (Important, see: [how to](assignments.qmd))
First of all, here we have just scratched the surface. You can [read much more here](https://happygitwithr.com).
Your assignment this time, will be to:
1. Go to [this post](https://clauswilke.com/blog/2020/09/07/pca-tidyverse-style/) on "PCA tidyverse style" by Claus O. Wilke, Professor of Integrative Biology
1. Again, create a reproducible micro-report *together*, this time, where you do a code-along with either the data in the post, the `gravier` data, and/or any other dataset.
1. Use your GitHub group repository to collaborate - you should all contribute. Again, make sure that you are using the "Visual" editor while working on this!
1. Your hand in will be a link to your micro-report in your GitHub group repository
1. *Note, focus here is on the `git` learning objectives and doing a code-along, which is not a copy/paste of the code, but using it as inspiration to create your own nice concise **tidy** micro-report!*
1. Make sure to check the [Assignment Guidelines](https://r4bds.github.io/assignments.html)
1. And also follow the [Course Code Styling](https://r4bds.github.io/code_styling.html)
1. **HOW TO HAND IN**: Go to `http://github.com/rforbiodatascienceYY/groupXX`, replace `YY` with year and `groupXX` with appropriate group number, then copy the link and paste it into an empty text (.txt) file. Hand in this text file. **No need to zip your html file,** since we are accessing it through the repository!