-
Notifications
You must be signed in to change notification settings - Fork 10
/
class1.qmd
284 lines (191 loc) · 11.3 KB
/
class1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
---
title: "R Coding Basics"
subtitle: "Coding Basics, Day 1"
author: "Matthew Sutcliffe, Madeline Gillman, JP Flores"
format:
html:
toc: true
---
```{css, echo = FALSE}
.output {
max-height: 500px;
overflow-y: scroll;
}
```
## Introduction
> Many biologists starting out in bioinformatics tend to equate "learning bioinformatics" with "learning how to run bioinformatics software"... This is analogous to thinking "learning molecular biology" is just "learning pipetting."
>
> --- Vince Buffalo
In Vince's quote above, replace "bioinformatics" with "coding."
Our goal for How to Learn to Code is to familiarize students with the R programming language and RStudio environment, equip students with the skills and knowledge to wrangle, visualize, and analyze data, and to provide a foundation for more advanced coding skills.
In Module 1: Coding Basics, we will cover:
- Variables
- Reproducible environments
- RStudio IDE
- Various R script and file formats
- R syntax
- Commenting, writing, and executing code
- Functions
- Data structures in R
- Data types in R
- Manipulating data types and structures
Curious about what the rest of the classes will look like?
- Module 1: Coding Basics
- Module 2: Data Visualization
- Module 3: Data Wrangling
- Module 4: Project Management (and applying everything you've learned to a real-world dataset!)
## Objectives of Coding Basics: Class 1
- Be able to create a variable, define what it is, and follow good variable naming practices
- Understand basic data structures in R
- Understand basic data types in R
- Perform basic manipulations with data structures and types
- Describe benefits of knowing how to code
## Exploring a dataset
R has a few built in datasets that we can use until we cover installing/loading packages and reading in data files. For the following examples we will use a built-in dataset in R called "iris" that has some measurements across a few species of flowers. It is one of the most popular built-in datasets in R. We will use this dataset to explore key coding concepts: **variables**, **data types**, and **functions**.
First, let's take a look at the dataset. You can view the dataset multiple ways. Let's try one--copy the below line of code into your console and run it.
```{r}
#| eval: false
iris
```
As we can see, this dataset has a few columns of numbers, in addition to the species. Let's try a few other ways to look at this dataset. As you try each method, think about what is different about each method. When would one method be more beneficial than another?
```{r}
head(iris)
```
```{r}
#View(iris)
```
You are probably already thinking of questions you need the answers to in order to familiarize yourself with this dataset. What does each row represent? Each column? How many observations (rows) do we have? What is the average petal length? Think about other questions you may want to ask. Think about how you would go about answering those questions with what you already know. Maybe you'd count each row on your screen to get the number of observations, or copy the values under `Petal.Length` into your phone calculator to calculate the mean. By the end of this class, you'll be able to do all those things very quickly in R!
## Variables
A variable is a named space in your computer's memory which can be referenced and manipulated. It's sort of a name you give "something", and that something can be just about anything.
![https://mclark45.medium.com/variables-8d0ba47d9694](data/class1_files/variables.png)
Variables in R are created (assigned) using an arrow: `<-` The variable name always goes on the left, and the thing being assigned to that variable on the right. For example:
```{r}
greeting <- "Hello"
animal <- "panda"
age <- 51
```
The value something is assigned to is often referred to as the variable name. For example, the variable name of `"Hello"` is `greeting` . We used really basic variable names--just letters, that are real words, all lowercase. Of course, there are other ways to name variables too! Play around with variable names. Try using uppercase letters, symbols, and numbers. What works, and what doesn't? Come up with some rules for variable naming. Here's some variable naming ideas to get you started:
```{r}
#| eval: false
GrEeTiNg <- "Hello"
5greeting <- "Hello"
greeting.5 <- "Hello"
greeting@5 <- "Hello"
```
Now that you know some general rules for variable naming, we can refer to the [Style Guide](https://style.tidyverse.org/syntax.html#syntax) for "proper" variable/object naming. Update your variable naming rule to include the preferred style for variable names according to the Style Guide.
And now that we know how to properly name variables, assign the iris dataset to a variable!
```{r}
iris_dataset_copy <- iris
```
## Data types
As you probably know from your own work, data can come in many forms. You can classify dragons as either "purple" or "green" and also record the number of spines on their backs as numeric types (15, 27). Data types are important to understand in R because the type of data impacts what you can do with that data. For example, it wouldn't make sense to calculate a mean for the dragon color, but it would for the number of back spines.
In R, we will focus on three basic data types that are used specify the type of data stored in a variable (there are a few more, but you probably won't ever run into them): **character, numeric,** and **logical.**
**Character:** A character represents a string value. This can be anything from a single letter to entire paragraphs. Examples include `“a”, “B”, “c is third”, "5"`
**Numeric:** A decimal value. Examples include `1.0, 3.1415926535`.
**Logical:** Logical data types have only two possible values: `TRUE` or `FALSE`.
So far, we have learned about basic data structures (vectors, matrices, etc.) and basic data types (numeric, character, logical). Now, we want to start manipulating or *doing things* to them that can be helpful.
## Converting Data Types
For example, sometimes when we read in data from a file, numbers can appear as strings of characters rather than a "numeric" type.
```{r}
my_numbers <- c("4", "2", "7", "10")
print(my_numbers)
```
How can we tell? Because the numbers above are in quotations, indicating that they are of the `character` type and R is interpreting them as text. Before doing any math or further analysis with these data points, it's a good idea to convert them to the `numeric` type first.
```{r}
my_numbers <- as.numeric(my_numbers)
print(my_numbers)
```
Note that the quotations are now gone. Now, we can do basic (or more advanced) calculations like the ones below.
```{r}
# Get minimum out of a list of values
min(my_numbers)
```
```{r}
# Get maximum out of a list of values
max(my_numbers)
```
```{r}
# Get average (mean) out of a list of values
mean(my_numbers)
```
We can also sort this list of values to go from smallest to largest. After doing so, the smallest value will be first in the list and the largest value will be last.
```{r}
my_numbers <- sort(my_numbers)
my_numbers
```
We can reverse the order to go from largest to smallest. There is an option using the `sort` function to do this.
```{r}
my_numbers <- sort(my_numbers, decreasing = TRUE)
my_numbers
```
## Accessing parts of a list
One thing we'll be doing a lot of is looking at parts of our data. For example, we might want to look at individual items in a vector. These items could be numbers or characters.
```{r}
my_data <- c("A", "B", "C", "D", "E", "F")
my_data
```
In this case, let's say I'm really interested in that "E" and want to pull it out separately from the rest of the data. I can do that with "indexing". Here, I can tell that it's the 5th item in the list, so I can extract it using the following:
```{r}
my_data[5]
```
We can also extract multiple items. If we wanted "D", "E", and "F", we can get all the values from item 4 ("D") to item 6 ("F").
```{r}
my_data[4:6]
```
Let's say we forgot to include some of our data and now we want to add it to this list. We can update `my_data` to also include these values.
```{r}
my_data <- c(my_data, "G", "H", "I")
my_data
```
Before we move on, let's cover creating vectors. We already did this several times above, but didn't discuss it. Typically, we'll want to make vectors of numbers (e.g. our data values) or vectors of characters (e.g. labels for our data). Depending on whether we use quotes or not, R will interpret them as either numeric vectors or character vectors.
```{r}
# Numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)
numeric_vector
# Character vector
character_vector <- c("apple", "banana", "orange")
character_vector
```
Remember the iris dataset from earlier? Let's return to it to cover extracting some of the rows or columns from this data.
We can access specific columns in one of two ways. Typically, we will want to access it by the name of the column. We do this using the name of the data frame, followed by the dollar sign, and finally the name of the column. For example:
```{r}
iris$Petal.Length
```
If we knew which column it was (or it wasn't named), we can also use indexing. Inside the brackets, we will need to indicate which \[row , column\] we want from this data frame. Since we want all the rows, we will leave the "row" blank. We can see that the Petal.Length was the 3rd column.
```{r}
iris[, 3]
```
Let's say we didn't care the exact measurement of the Petal.Length of these flowers. We only cared whether they were "big" or not, and let's say that "big" is a Petal.Length of greater than 5.
```{r}
iris$Petal.Length > 5
```
Some of them are "big" (with values of TRUE) and many of them are "small" (with values of FALSE). We can add this information to our dataset by making another column. Similar to how we extracted this column, we can also make a new one (with a name of our choice).
```{r}
iris$BigPetals <- iris$Petal.Length > 5
```
And now it is added to our dataset.
```{r}
#| class: output
iris
```
## Functions
A function is a block of code that does a task. It only executes that task when it is called/executed. Using a function in R always follows the same basic format:
`function_name(arguments)`
The arguments are passed to the function, i.e. they are values that the function will manipulate. Functions can be built into R, included in packages, or you can write your own.
Functions can do very basic tasks:
```{r}
print("Hello world!")
```
Or more complex tasks, where multiple arguments are required, each separated by a comma:
```{r}
substr(x = "Hello world!", start = 2, stop = 4)
```
We have already been using functions throughout this class--some examples include `sort()`, `min()`, and `max()`.
We will be using functions all the time in How to Learn to Code, but for today just know what a function is and what an argument is. Whenever you use a function, it's important to ensure you understand what it's doing: are you getting the expected result? Are you using the input arguments correctly? That is not only crucial for learning how to code, but how to think like a coder.
## I already need help!
Since this is a built-in dataset, we can get some help. Try running the code below:
```{r}
?iris
?mean()
```
Adding a `?` before the name of a function or data frame (built-in or from a package) pulls up a help file in the Help tab of the Output pane. If you aren't sure what a function does, this should be your first step.