diff --git a/assignments/HW1.Rmd b/assignments/HW1.Rmd index 1ce1edc..7874e79 100644 --- a/assignments/HW1.Rmd +++ b/assignments/HW1.Rmd @@ -8,16 +8,14 @@ library(ggplot2) opts_chunk$set(fig.align="center", fig.height=4, fig.width=5.5) ``` -*Enter your name and EID here* +**This homework is due on Jan. 25, 2024 at 11:00pm. Please submit as a pdf file on Canvas.** -**This homework is due on Jan. 17, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - -**Problem 1: (4 pts)** Demonstrate basic command of Markdown by creating a bulleted list with three items, a numbered list with three items, and a sentence that has one word in bold and one word in italics. +**Problem 1: (8 pts)** Demonstrate basic command of Markdown by creating a bulleted list with three items, a numbered list with three items, and a sentence that has one word in bold and one word in italics. *Your text goes here.* -**Problem 2: (3 pts)** The `economics` dataset contains various time series data from the US economy: +**Problem 2: (6 pts)** The `economics` dataset contains various time series data from the US economy: ```{r} economics @@ -29,7 +27,7 @@ Use ggplot to make a line plot of the total population (`pop`, in thousands) ver # your code goes here ``` -**Problem 3: (3 pts)** Again using the `economics` dataset, now make a scatter plot (using `geom_point()`) of the number of unemployed versus the total population (`pop`), and color points by date. +**Problem 3: (6 pts)** Again using the `economics` dataset, now make a scatter plot (using `geom_point()`) of the number of unemployed versus the total population (`pop`), and color points by date. ```{r} # your code goes here diff --git a/assignments/HW1.html b/assignments/HW1.html index 58002ab..427365b 100644 --- a/assignments/HW1.html +++ b/assignments/HW1.html @@ -353,15 +353,14 @@

Homework 1

-

Enter your name and EID here

-

This homework is due on Jan. 17, 2023 at 11:00pm. Please +

This homework is due on Jan. 25, 2024 at 11:00pm. Please submit as a pdf file on Canvas.

-

Problem 1: (4 pts) Demonstrate basic command of +

Problem 1: (8 pts) Demonstrate basic command of Markdown by creating a bulleted list with three items, a numbered list with three items, and a sentence that has one word in bold and one word in italics.

Your text goes here.

-

Problem 2: (3 pts) The economics +

Problem 2: (6 pts) The economics dataset contains various time series data from the US economy:

economics
## # A tibble: 574 × 6
@@ -377,12 +376,12 @@ 

Homework 1

## 8 1968-02-01 534. 199920 12.3 4.5 3001 ## 9 1968-03-01 544. 200056 11.7 4.1 2877 ## 10 1968-04-01 544 200208 12.3 4.6 2709 -## # … with 564 more rows
+## # ℹ 564 more rows

Use ggplot to make a line plot of the total population (pop, in thousands) versus time (column date).

# your code goes here
-

Problem 3: (3 pts) Again using the +

Problem 3: (6 pts) Again using the economics dataset, now make a scatter plot (using geom_point()) of the number of unemployed versus the total population (pop), and color points by date.

diff --git a/assignments/HW2.Rmd b/assignments/HW2.Rmd deleted file mode 100644 index 07033de..0000000 --- a/assignments/HW2.Rmd +++ /dev/null @@ -1,51 +0,0 @@ ---- -title: "Homework 2" ---- - -```{r global_options, include=FALSE} -library(knitr) -library(tidyverse) -opts_chunk$set(fig.align="center", fig.height=4, fig.width=5.5) - -# data prep: -txhouse <- txhousing %>% - filter(city %in% c('Austin', 'Houston', 'San Antonio', 'Dallas')) %>% - filter(year %in% c('2000', '2005', '2010', '2015')) %>% - group_by(city, year) %>% - summarize(total_sales = sum(sales)) - -``` - -*Enter your name and EID here* - -**This homework is due on Jan. 24, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - - -**Problem 1: (3 pts)** We will work with the dataset `txhouse` that has been derived from the `txhousing` dataset provided by **ggplot2**. See here for details of the original dataset: https://ggplot2.tidyverse.org/reference/txhousing.html. `txhouse` contains three columns: `city` (containing four Texas cities), `year` (containing four years between 2000 and 2015) and `total_sales` indicating the total number of sales for the specified year and city. - -```{r} -txhouse -``` - -Use ggplot to make a bar plot of the total housing sales (column `total_sales`) for each `city` and show one panel per `year`. You do not have to worry about the order of the bars. Hint: Use `facet_wrap()`. See slides from Class 2. - - -```{r} -# your code goes here -``` - -**Problem 2: (3 pts)** Use ggplot to make a bar plot of the total housing sales (column `total_sales`) for each `year`. Color the bar borders with color `"gray20"` and assign a fill color based on the `city` column. - -```{r} -# your code goes here -``` - -**Problem 3: (4 pts)** Modify the plot from Problem 2 by placing the bars for each city side-by-side rather than stacked. Next, reorder the bars for each `year` by `total_sales` in descending order. See slides from Class 4. - -```{r} -# your code goes here -``` - - - - diff --git a/assignments/HW2.html b/assignments/HW2.html deleted file mode 100644 index adb50d8..0000000 --- a/assignments/HW2.html +++ /dev/null @@ -1,451 +0,0 @@ - - - - - - - - - - - - - -Homework 2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This homework is due on Jan. 24, 2023 at 11:00pm. Please -submit as a pdf file on Canvas.

-

Problem 1: (3 pts) We will work with the dataset -txhouse that has been derived from the -txhousing dataset provided by ggplot2. See -here for details of the original dataset: https://ggplot2.tidyverse.org/reference/txhousing.html. -txhouse contains three columns: city -(containing four Texas cities), year (containing four years -between 2000 and 2015) and total_sales indicating the total -number of sales for the specified year and city.

-
txhouse
-
## # A tibble: 16 × 3
-## # Groups:   city [4]
-##    city         year total_sales
-##    <chr>       <int>       <dbl>
-##  1 Austin       2000       18621
-##  2 Austin       2005       26905
-##  3 Austin       2010       19872
-##  4 Austin       2015       18878
-##  5 Dallas       2000       45446
-##  6 Dallas       2005       59980
-##  7 Dallas       2010       42383
-##  8 Dallas       2015       36735
-##  9 Houston      2000       52459
-## 10 Houston      2005       72800
-## 11 Houston      2010       56807
-## 12 Houston      2015       48109
-## 13 San Antonio  2000       15590
-## 14 San Antonio  2005       24034
-## 15 San Antonio  2010       18449
-## 16 San Antonio  2015       16455
-

Use ggplot to make a bar plot of the total housing sales (column -total_sales) for each city and show one panel -per year. You do not have to worry about the order of the -bars. Hint: Use facet_wrap(). See slides from Class 2.

-
# your code goes here
-

Problem 2: (3 pts) Use ggplot to make a bar plot of -the total housing sales (column total_sales) for each -year. Color the bar borders with color -"gray20" and assign a fill color based on the -city column.

-
# your code goes here
-

Problem 3: (4 pts) Modify the plot from Problem 2 by -placing the bars for each city side-by-side rather than stacked. Next, -reorder the bars for each year by total_sales -in descending order. See slides from Class 4.

-
# your code goes here
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/HW3.Rmd b/assignments/HW3.Rmd deleted file mode 100644 index b3dbf86..0000000 --- a/assignments/HW3.Rmd +++ /dev/null @@ -1,70 +0,0 @@ ---- -title: "Homework 3" ---- - -```{r global_options, include=FALSE} -library(knitr) -library(tidyverse) -options(scipen = 999) -opts_chunk$set(fig.align="center", fig.height=4, fig.width=5.5) - -# data prep: -OH_pop <- midwest %>% - filter(state == "OH") %>% - arrange(desc(poptotal)) %>% - mutate(row = row_number()) %>% - filter(poptotal >= 100000) %>% - select(c(county, poptotal)) - -``` - -*Enter your name and EID here* - -**This homework is due on Feb. 7, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - - -**Problem 1: (4 pts)** For problem 1, we will work with the `diamonds` dataset. See here for details: https://ggplot2.tidyverse.org/reference/diamonds.html. - -```{r} -diamonds -``` - -(a) Use ggplot to make a bar plot of the total diamond count per `color` and show the proportion of each `cut` within each `color` category. - -(b) In two sentences, explain when to use `geom_bar()` instead of `geom_col()`. Which of these functions requires only an `x` or `y` variable? - -```{r} -# your code goes here -``` - -**Problem 2: (4 pts)** -For problem 2 and 3, we will work with the dataset `OH_pop` that contains Ohio state demographics and has been derived from the `midwest` dataset provided by **ggplot2**. See here for details of the original dataset: https://ggplot2.tidyverse.org/reference/midwest.html. `OH_pop` contains two columns: `county` and `poptotal` (the county's total population), and it only contains counties with at least 100,000 inhabitants. - -```{r} -OH_pop -``` - -(a) Use ggplot to make a scatter plot of `county` vs total population (column `poptotal`) and order the counties by increasing population. - -(b) Rename the axes and set appropriate limits, breaks and labels. Note: Do not use `xlab()` or `ylab()` to label the axes. - -```{r} -# your code goes here - -``` - -**Problem 3: (2 pts)** - -(a) Modify the plot from Problem 2 by changing the scale for `poptotal` to logarithmic. - -(b) Adjust the limits, breaks and labels for the logarithmic scale. - -```{r} -# your code goes here -``` - - - - - - diff --git a/assignments/HW3.html b/assignments/HW3.html deleted file mode 100644 index ceaf62c..0000000 --- a/assignments/HW3.html +++ /dev/null @@ -1,472 +0,0 @@ - - - - - - - - - - - - - -Homework 3 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This homework is due on Feb. 7, 2023 at 11:00pm. Please -submit as a pdf file on Canvas.

-

Problem 1: (4 pts) For problem 1, we will work with -the diamonds dataset. See here for details: https://ggplot2.tidyverse.org/reference/diamonds.html.

-
diamonds
-
## # A tibble: 53,940 × 10
-##    carat cut       color clarity depth table price     x     y     z
-##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
-##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
-##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
-##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
-##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
-##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
-##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
-##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
-##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
-##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
-## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
-## # … with 53,930 more rows
-
    -
  1. Use ggplot to make a bar plot of the total diamond count per -color and show the proportion of each cut -within each color category.

  2. -
  3. In two sentences, explain when to use geom_bar() -instead of geom_col(). Which of these functions requires -only an x or y variable?

  4. -
-
# your code goes here
-

Problem 2: (4 pts) For problem 2 and 3, we will work -with the dataset OH_pop that contains Ohio state -demographics and has been derived from the midwest dataset -provided by ggplot2. See here for details of the -original dataset: https://ggplot2.tidyverse.org/reference/midwest.html. -OH_pop contains two columns: county and -poptotal (the county’s total population), and it only -contains counties with at least 100,000 inhabitants.

-
OH_pop
-
## # A tibble: 25 × 2
-##    county     poptotal
-##    <chr>         <int>
-##  1 CUYAHOGA    1412140
-##  2 FRANKLIN     961437
-##  3 HAMILTON     866228
-##  4 MONTGOMERY   573809
-##  5 SUMMIT       514990
-##  6 LUCAS        462361
-##  7 STARK        367585
-##  8 BUTLER       291479
-##  9 LORAIN       271126
-## 10 MAHONING     264806
-## # … with 15 more rows
-
    -
  1. Use ggplot to make a scatter plot of county vs total -population (column poptotal) and order the counties by -increasing population.

  2. -
  3. Rename the axes and set appropriate limits, breaks and labels. -Note: Do not use xlab() or ylab() to label the -axes.

  4. -
-
# your code goes here
-

Problem 3: (2 pts)

-
    -
  1. Modify the plot from Problem 2 by changing the scale for -poptotal to logarithmic.

  2. -
  3. Adjust the limits, breaks and labels for the logarithmic -scale.

  4. -
-
# your code goes here
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/HW4.Rmd b/assignments/HW4.Rmd deleted file mode 100644 index eb36695..0000000 --- a/assignments/HW4.Rmd +++ /dev/null @@ -1,51 +0,0 @@ ---- -title: "Homework 4" ---- - -```{r global_options, include=FALSE} -library(knitr) -library(ggplot2) -library(ggridges) -opts_chunk$set(fig.align="center", fig.height=4, fig.width=5.5) -``` - -*Enter your name and EID here* - -**This homework is due on Feb. 14, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - - -**Problem 1: (4 pts)** We will work with the `mpg` dataset provided by **ggplot2**. See here for details: https://ggplot2.tidyverse.org/reference/mpg.html - -Make two different strip charts of highway fuel economy (`hwy`) versus number of cylinders (`cyl`), the first one without horizontal jitter and second one with horizontal jitter. In both plots, please replace names of the data columns (`hwy`, `cyl`) along the axes with nice, easily readable lables. - -Explain in 1-2 sentences why the plot without jitter is misleading. - -Hint: Make sure you do not accidentally apply vertical jitter. This is a common mistake many people make. - -```{r} -# your code goes here. -``` - -*Your explanation goes here.* - - -**Problem 2: (6 pts)** For this problem, we will continue working with the `mpg` dataset. Visualize the distribution of each car's city fuel economy by class (`class`) and type of drive train (`drv`) with (i) boxplots and (ii) ridgelines. Make one plot per geom and do not use faceting. In both cases, put city mpg on the x axis and class on the y axis. Use color to indicate the car's drive train. As in Problem 1, rename the axis labels. - -The boxplot ggplot generates will have a problem. Describe what the problem is. (You do not have to solve it.) - -Hint: To change the name of the legend, use `+ labs(fill = "legend name")` - -```{r} -# your code goes here. -``` - -*Your explanation goes here.* - - - - - - - - - diff --git a/assignments/HW4.html b/assignments/HW4.html deleted file mode 100644 index 190a7ef..0000000 --- a/assignments/HW4.html +++ /dev/null @@ -1,435 +0,0 @@ - - - - - - - - - - - - - -Homework 4 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This homework is due on Feb. 14, 2023 at 11:00pm. Please -submit as a pdf file on Canvas.

-

Problem 1: (4 pts) We will work with the -mpg dataset provided by ggplot2. See here -for details: https://ggplot2.tidyverse.org/reference/mpg.html

-

Make two different strip charts of highway fuel economy -(hwy) versus number of cylinders (cyl), the -first one without horizontal jitter and second one with horizontal -jitter. In both plots, please replace names of the data columns -(hwy, cyl) along the axes with nice, easily -readable lables.

-

Explain in 1-2 sentences why the plot without jitter is -misleading.

-

Hint: Make sure you do not accidentally apply vertical jitter. This -is a common mistake many people make.

-
# your code goes here.
-

Your explanation goes here.

-

Problem 2: (6 pts) For this problem, we will -continue working with the mpg dataset. Visualize the -distribution of each car’s city fuel economy by class -(class) and type of drive train (drv) with (i) -boxplots and (ii) ridgelines. Make one plot per geom and do not use -faceting. In both cases, put city mpg on the x axis and class on the y -axis. Use color to indicate the car’s drive train. As in Problem 1, -rename the axis labels.

-

The boxplot ggplot generates will have a problem. Describe what the -problem is. (You do not have to solve it.)

-

Hint: To change the name of the legend, use -+ labs(fill = "legend name")

-
# your code goes here.
-

Your explanation goes here.

- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/HW5.Rmd b/assignments/HW5.Rmd deleted file mode 100644 index ad50cfc..0000000 --- a/assignments/HW5.Rmd +++ /dev/null @@ -1,73 +0,0 @@ ---- -title: "Homework 5" -output: - html_document: - df_print: paged ---- - -```{r global_options, include=FALSE} -library(knitr) -library(tidyverse) -library(colorspace) -opts_chunk$set(fig.align="center", fig.height=4, fig.width=5.5) - -# data prep: -ufo_sightings <- - read_csv("https://wilkelab.org/classes/SDS348/data_sets/ufo_sightings_clean.csv") %>% - separate(datetime, into = c("month", "day", "year"), sep = "/") %>% - separate(year, into = c("year", "time"), sep = " ") %>% - separate(date_posted, into = c("month_posted", "day_posted", "year_posted"), sep = "/") %>% - select(-time, -month_posted, -day_posted) %>% - mutate( - year = as.numeric(year), - state = toupper(state) - ) %>% - filter(!is.na(country)) -``` - -*Enter your name and EID here* - -**This homework is due on Feb. 28, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - -**Problem 1: (4 pts)** We will work with the `ufo_sightings` dataset. - -Since 2000 (inclusive), what are the top 10 cities that have reported the most UFO sightings? Create a new dataframe to answer the question. No plots are necessary. - - -```{r} -# your code here -``` - - -**Problem 2: (4 pts)** - -Next, how has the number of UFO sightings changed for five states since 1940? Please follow these steps: - -1. Filter the dataset to keep the following five states: AZ, IL, NM, OR, WA -2. Keep only the records from 1940 and onwards. -3. Find the number of records for each year and state. -4. Output the new table below your code block. - -Your final table should be in long format and have three columns, `year`, `state`, and `count`. You will plot this table in Problem 3. - - -```{r} -# your code here -``` - -**Problem 3: (2 pts)** - -Use the new dataframe you made in Problem 2 and add an appropriate color scale from the `colorspace` package to the plot below. - -```{r eval = FALSE} -new_df %>% # use the dataframe from Problem 2 here, and set eval = TRUE in the chunk header - ggplot(aes(x = year, y = count, color = state)) + - geom_line() + - xlab("Year") + - ylab("UFO Sightings (Count)") + - theme_bw() -``` - - - - diff --git a/assignments/HW5.html b/assignments/HW5.html deleted file mode 100644 index abda679..0000000 --- a/assignments/HW5.html +++ /dev/null @@ -1,1707 +0,0 @@ - - - - - - - - - - - - - -Homework 5 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This homework is due on Feb. 28, 2023 at 11:00pm. Please -submit as a pdf file on Canvas.

-

Problem 1: (4 pts) We will work with the -ufo_sightings dataset.

-

Since 2000 (inclusive), what are the top 10 cities that have reported -the most UFO sightings? Create a new dataframe to answer the question. -No plots are necessary.

-
# your code here
-

Problem 2: (4 pts)

-

Next, how has the number of UFO sightings changed for five states -since 1940? Please follow these steps:

-
    -
  1. Filter the dataset to keep the following five states: AZ, IL, NM, -OR, WA
  2. -
  3. Keep only the records from 1940 and onwards.
  4. -
  5. Find the number of records for each year and state.
  6. -
  7. Output the new table below your code block.
  8. -
-

Your final table should be in long format and have three columns, -year, state, and count. You will -plot this table in Problem 3.

-
# your code here
-

Problem 3: (2 pts)

-

Use the new dataframe you made in Problem 2 and add an appropriate -color scale from the colorspace package to the plot -below.

-
new_df %>% # use the dataframe from Problem 2 here, and set eval = TRUE in the chunk header 
- ggplot(aes(x = year, y = count, color = state)) +
- geom_line() +
- xlab("Year") +
- ylab("UFO Sightings (Count)") +
- theme_bw()
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/HW6.Rmd b/assignments/HW6.Rmd deleted file mode 100644 index bb3517e..0000000 --- a/assignments/HW6.Rmd +++ /dev/null @@ -1,70 +0,0 @@ ---- -title: "Homework 6" -output: - html_document: - df_print: paged ---- - -```{r global_options, include=FALSE} -library(knitr) -library(tidyverse) -library(colorspace) -library(ggforce) -opts_chunk$set(fig.align="center", fig.height=4, fig.width=5.5) - -# data prep: -olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv') -olympics_2002 <- olympics %>% - filter(year == 2002, season == "Winter") %>% - select(sex) %>% - count(sex) %>% - pivot_wider(names_from = sex, values_from = n) - -``` - -*Enter your name and EID here* - -**This homework is due on Mar. 7, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - -**Problem 1: (6 pts)** We will work with the dataset `olympics_2002` that contains the count of all athletes by sex for the 2002 Winter Olympics in Salt Lake City. It has been derived from the `olympics` dataset, which is described here: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md - -```{r} -olympics_2002 -``` -Follow these steps and display the modified dataframe after each step: - -1. Rearrange the dataframe into long form. The resulting dataframe will have two columns, which you should call `sex` and `count`. There will be two rows of data, one for female and one for male athletes. -2. Create a new column with the percent for each `sex` -3. Rename the values in `sex` to "Female" and "Male". - -```{r} -# your code here -``` -```{r} -# your code here -``` -```{r} -# your code here -``` - - -**Problem 2: (4 pts)** - -Now make a pie chart of the dataset you generated in Problem 1. Use `theme_void()` to remove all distracting elements. - -```{r} -# your code here -``` - - - - - - - - - - - - - diff --git a/assignments/HW6.html b/assignments/HW6.html deleted file mode 100644 index 68712bb..0000000 --- a/assignments/HW6.html +++ /dev/null @@ -1,1704 +0,0 @@ - - - - - - - - - - - - - -Homework 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This homework is due on Mar. 7, 2023 at 11:00pm. Please -submit as a pdf file on Canvas.

-

Problem 1: (6 pts) We will work with the dataset -olympics_2002 that contains the count of all athletes by -sex for the 2002 Winter Olympics in Salt Lake City. It has been derived -from the olympics dataset, which is described here: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

-
olympics_2002
-
- -
-

Follow these steps and display the modified dataframe after each -step:

-
    -
  1. Rearrange the dataframe into long form. The resulting dataframe will -have two columns, which you should call sex and -count. There will be two rows of data, one for female and -one for male athletes.
  2. -
  3. Create a new column with the percent for each sex
  4. -
  5. Rename the values in sex to “Female” and “Male”.
  6. -
-
# your code here
-
# your code here
-
# your code here
-

Problem 2: (4 pts)

-

Now make a pie chart of the dataset you generated in Problem 1. Use -theme_void() to remove all distracting elements.

-
# your code here
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/HW7.Rmd b/assignments/HW7.Rmd deleted file mode 100644 index 3562111..0000000 --- a/assignments/HW7.Rmd +++ /dev/null @@ -1,72 +0,0 @@ ---- -title: "Homework 7" -output: - html_document: - df_print: paged ---- - -```{r global_options, include=FALSE} -library(knitr) -library(tidyverse) -library(colorspace) -library(naniar) -opts_chunk$set(fig.align="center", fig.height=4, fig.width=5.5) - -#data prep: -midwest2 <- midwest %>% - filter(state != "IN") - -#data prep for problem 3: -oceanbuoys$year <- factor(oceanbuoys$year) -oceanbuoys <- na.omit(oceanbuoys) -``` - -*Enter your name and EID here* - -**This homework is due on Mar. 28, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - -**Problem 1: (2 pts)** - -Use the color picker app from the **colorspace** package (`colorspace::choose_color()`) to create a qualitative color scale containing four colors. One of the four colors should be `#A23C42`, so you need to find three additional colors that go with this one. Use the function `swatchplot()` to plot your colors. `swatchplot()` takes in a vector of colors. - -```{r} -# your code goes here -``` - - -**Problem 2: (4 pts)** - -For this problem, we will work with the `midwest2` dataset (derived from `midwest`). In the following plot, you may notice that the axis tick labels are smaller than the axis titles, and also in a different color (gray instead of black). - -1. Use the colors you chose in Problem 1 to color the points. -2. Make the axis tick labels the same size (`size = 12`) and give them the color black (`color = "black"`) -3. Set the entire plot background to the color `"#FEF8F0"`. Make sure there are no white areas remaining, such as behind the plot panel or under the legend. - -```{r} -ggplot(midwest2, aes(popdensity, percollege, fill = state)) + - geom_point(shape = 21, size = 3, color = "white", stroke = 0.2) + - scale_x_log10(name = "population density") + - scale_y_continuous(name = "percent college educated") + - # your color choices go here in a scale fucntion. - theme_classic(12) + - theme( - # your theme customization code goes here - ) -``` - - - -**Problem 3: (4 pts)** - -For this problem, we will work with the `oceanbuoys` dataset from the `naniar` library that contains west pacific tropical atmosphere ocean data for 1993 and 1997. - -Write a function that converts temperature from Celsius to Fahrenheit. Then, use this function and any other data wrangling code you learned in class to make a summary table of average sea temperature and air temperature (in Fahrenheit) for each year in the dataset. The formula for converting Celsius to Fahrenheit is `Fahrenheit = (Celsius*1.8) + 32`. -```{r} -oceanbuoys -# your code goes here -``` - - - - - diff --git a/assignments/HW7.html b/assignments/HW7.html deleted file mode 100644 index 255dbaf..0000000 --- a/assignments/HW7.html +++ /dev/null @@ -1,1725 +0,0 @@ - - - - - - - - - - - - - -Homework 7 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This homework is due on Mar. 28, 2023 at 11:00pm. Please -submit as a pdf file on Canvas.

-

Problem 1: (2 pts)

-

Use the color picker app from the colorspace package -(colorspace::choose_color()) to create a qualitative color -scale containing four colors. One of the four colors should be -#A23C42, so you need to find three additional colors that -go with this one. Use the function swatchplot() to plot -your colors. swatchplot() takes in a vector of colors.

-
# your code goes here
-

Problem 2: (4 pts)

-

For this problem, we will work with the midwest2 dataset -(derived from midwest). In the following plot, you may -notice that the axis tick labels are smaller than the axis titles, and -also in a different color (gray instead of black).

-
    -
  1. Use the colors you chose in Problem 1 to color the points.
  2. -
  3. Make the axis tick labels the same size (size = 12) and -give them the color black (color = "black")
  4. -
  5. Set the entire plot background to the color "#FEF8F0". -Make sure there are no white areas remaining, such as behind the plot -panel or under the legend.
  6. -
-
ggplot(midwest2, aes(popdensity, percollege, fill = state)) +
-  geom_point(shape = 21, size = 3, color = "white", stroke = 0.2) +
-  scale_x_log10(name = "population density") +
-  scale_y_continuous(name = "percent college educated") +
-  # your color choices go here in a scale fucntion. 
-  theme_classic(12) +
-  theme(
-    # your theme customization code goes here
-  )
-

-

Problem 3: (4 pts)

-

For this problem, we will work with the oceanbuoys -dataset from the naniar library that contains west pacific -tropical atmosphere ocean data for 1993 and 1997.

-

Write a function that converts temperature from Celsius to -Fahrenheit. Then, use this function and any other data wrangling code -you learned in class to make a summary table of average sea temperature -and air temperature (in Fahrenheit) for each year in the dataset. The -formula for converting Celsius to Fahrenheit is -Fahrenheit = (Celsius*1.8) + 32.

-
oceanbuoys
-
- -
-
# your code goes here
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/HW8.Rmd b/assignments/HW8.Rmd deleted file mode 100644 index 15e443b..0000000 --- a/assignments/HW8.Rmd +++ /dev/null @@ -1,55 +0,0 @@ ---- -title: "Homework 8" -output: - html_document: - df_print: paged - pdf_document: default ---- - -```{r global_options, include=FALSE} -library(knitr) -library(tidyverse) -library(broom) -opts_chunk$set(fig.align="center", fig.height=4, fig.width=5.5) - -#data prep: -BA_degrees <- read_csv("https://wilkelab.org/SDS375/datasets/BA_degrees.csv") -BA_degrees -``` - -*Enter your name and EID here* - -**This homework is due on April 4, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - -**Problem 1: (6 pts)** The dataset `BA_degrees` contains information about the proportion of different degrees students receive, as a function of time. - -```{r} -head(BA_degrees) -``` - -Create a subset of the `BA_degrees` dataset that only considers the degree fields "Business", "Education", and "Psychology". Then make a single plot that satisfies these three criteria: - -(a) Plot a time series of the proportion of degrees (colum `perc`) in each field over time and create a separate panel per degree field. -(b) Add a straight line fit to each panel. -(c) Order the panels by the difference between the maximum and the minimum proportion (i.e., the range of the data). - - -```{r} -# your code goes here -``` - -**Problem 2: (4 pts)** -Create a single pipeline that fits a linear model to each of the three fields from Problem 1 and outputs results in a tidy linear model summary table. The first column of the table should be `field` and the remaining columns should contain the linear model summary statistics such as `r.squared` for each field. Display the resulting table below. - -```{r} -# your code goes here -``` - - - - - - - - - diff --git a/assignments/HW8.html b/assignments/HW8.html deleted file mode 100644 index 583c6b8..0000000 --- a/assignments/HW8.html +++ /dev/null @@ -1,1706 +0,0 @@ - - - - - - - - - - - - - -Homework 8 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This homework is due on April 4, 2023 at 11:00pm. Please -submit as a pdf file on Canvas.

-

Problem 1: (6 pts) The dataset -BA_degrees contains information about the proportion of -different degrees students receive, as a function of time.

-
head(BA_degrees)
-
- -
-

Create a subset of the BA_degrees dataset that only -considers the degree fields “Business”, “Education”, and “Psychology”. -Then make a single plot that satisfies these three criteria:

-
    -
  1. Plot a time series of the proportion of degrees (colum -perc) in each field over time and create a separate panel -per degree field.
  2. -
  3. Add a straight line fit to each panel.
  4. -
  5. Order the panels by the difference between the maximum and the -minimum proportion (i.e., the range of the data).
  6. -
-
# your code goes here
-

Problem 2: (4 pts) Create a single pipeline that -fits a linear model to each of the three fields from Problem 1 and -outputs results in a tidy linear model summary table. The first column -of the table should be field and the remaining columns -should contain the linear model summary statistics such as -r.squared for each field. Display the resulting table -below.

-
# your code goes here
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/HW9.Rmd b/assignments/HW9.Rmd deleted file mode 100644 index 4fe2d73..0000000 --- a/assignments/HW9.Rmd +++ /dev/null @@ -1,48 +0,0 @@ ---- -title: "Homework 9" -output: - html_document: - df_print: paged ---- - -```{r global_options, include=FALSE} -library(knitr) -library(tidyverse) -library(broom) -opts_chunk$set(fig.align="center", fig.height=4.326, fig.width=7) -``` - -*Enter your name and EID here* - -**This homework is due on April 11, 2023 at 11:00pm. Please submit as a pdf file on Canvas.** - -For all problems in this homework, we will work with the `heart_disease_data` dataset, which is a simplified and recoded version of a dataset available from kaggle. You can read about the original dataset here: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download - -The `heart_disease_data` dataset contains 9 variables: `HeartDisease`(whether or not the participant has heart disease), `BMI` (body mass index), `PhysicalHealth` (how many days a month was their physical health not good), `MentalHealth` (how many days a month was their mental health not good), `ApproximateAge` (participants age), `SleepTime` (how many hours of sleep do they get in a 24-hour period), `Smoking` (1-smoker, 0-nonsmoker), `AlcoholDrinking` (1-drinks alcohol, 0-does not drink), `PhysicalActivity` (1-did physical activity or exercise during the past 30 days, 0-hardly any physical activity). Compared to the original dataset, the columns `ApproximateAge`, `Smoking`, `AlcoholDrinking`, and `PhysicalActivity` have been converted into numeric columns so they can be included in a PCA. - -**Note:** This homework is about the contents of the plots. Don't worry about styling. It's OK to use the default theme and plot labeling. - - -```{r message = FALSE} -heart_data <- read_csv("https://wilkelab.org/SDS375/datasets/heart_disease_data.csv") -``` - -**Problem 1: (5 pts)** - -Perform a PCA of the `heart_disease_data` dataset and make two plots, a rotation plot of components 1 and 2 and a plot of the eigenvalues, showing the amount of variance explained by the various components. - -```{r} -# your code here -``` - -```{r} -# your code here -``` - - -**Problem 2: (5 pts)** Make a scatter plot of PC 2 versus PC 1 and color by heart disease status. Then use the rotation plot from Problem 1 to describe the variables/factors by which we can separate the study participants with heart disease from the study participants without heart disease. - - -```{r} -# your code here -``` diff --git a/assignments/HW9.html b/assignments/HW9.html deleted file mode 100644 index 1af9975..0000000 --- a/assignments/HW9.html +++ /dev/null @@ -1,1713 +0,0 @@ - - - - - - - - - - - - - -Homework 9 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This homework is due on April 11, 2023 at 11:00pm. Please -submit as a pdf file on Canvas.

-

For all problems in this homework, we will work with the -heart_disease_data dataset, which is a simplified and -recoded version of a dataset available from kaggle. You can read about -the original dataset here: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download

-

The heart_disease_data dataset contains 9 variables: -HeartDisease(whether or not the participant has heart -disease), BMI (body mass index), -PhysicalHealth (how many days a month was their physical -health not good), MentalHealth (how many days a month was -their mental health not good), ApproximateAge (participants -age), SleepTime (how many hours of sleep do they get in a -24-hour period), Smoking (1-smoker, 0-nonsmoker), -AlcoholDrinking (1-drinks alcohol, 0-does not drink), -PhysicalActivity (1-did physical activity or exercise -during the past 30 days, 0-hardly any physical activity). Compared to -the original dataset, the columns ApproximateAge, -Smoking, AlcoholDrinking, and -PhysicalActivity have been converted into numeric columns -so they can be included in a PCA.

-

Note: This homework is about the contents of the -plots. Don’t worry about styling. It’s OK to use the default theme and -plot labeling.

-
heart_data <- read_csv("https://wilkelab.org/SDS375/datasets/heart_disease_data.csv")
-

Problem 1: (5 pts)

-

Perform a PCA of the heart_disease_data dataset and make -two plots, a rotation plot of components 1 and 2 and a plot of the -eigenvalues, showing the amount of variance explained by the various -components.

-
# your code here
-
# your code here
-

Problem 2: (5 pts) Make a scatter plot of PC 2 -versus PC 1 and color by heart disease status. Then use the rotation -plot from Problem 1 to describe the variables/factors by which we can -separate the study participants with heart disease from the study -participants without heart disease.

-
# your code here
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/Project_1.Rmd b/assignments/Project_1.Rmd deleted file mode 100644 index c91241e..0000000 --- a/assignments/Project_1.Rmd +++ /dev/null @@ -1,69 +0,0 @@ ---- -title: "Project 1" -output: html_document ---- - -```{r setup, include=FALSE} -library(tidyverse) -knitr::opts_chunk$set(echo = TRUE) -``` - -This is the dataset you will be working with: -```{r message = FALSE} -olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv') - -olympics_alpine <- olympics %>% - filter(!is.na(weight)) %>% # only keep athletes with known weight - filter(sport == "Alpine Skiing") %>% # keep only alpine skiers - mutate( - medalist = case_when( # add column to - is.na(medal) ~ FALSE, # NA values go to FALSE - !is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE - ) - ) -``` - -`olympics_alpine` is a subset of `olympics` and contains only the data for alpine skiers. More information about the original `olympics` dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md and https://www.sports-reference.com/olympics.html. - -For this project, use `olympics_alpine` to answer the following questions about the weights of alpine skiers: - -1. Are there weight differences for male and female Olympic skiers who were successful or not in earning a medal? -2. Are there weight differences for skiers who competed in different alpine skiing events? -3. How has the weight distribution of alpine skiers changed over the years? - -You should make one plot per question. - -**Hints:** - -- We recommend you use a violin plot for question 1 and boxplots for questions 2 and 3. However, you are free to use any of the plots we have discussed in class so far. -- For question 3, it may be helpful to consider only a subset of alpine skiers, such as those who competed in a specific event. -- To make a series of boxplots over time, you will have to add the following to your `aes()` statement: `group = year`. -- It can be a bit tricky to re-label facets generated with `facet_wrap()`. The trick is to add a `labeller` argument, for example: -```r - + facet_wrap( - # your other arguments to facet_wrap() go here - ..., - # this replaces "TRUE" with "medaled" and "FALSE" with "did not medal" - labeller = as_labeller(c(`TRUE` = "medaled", `FALSE` = "did not medal")) - ) -``` - -**Introduction:** *Your introduction here.* - -**Approach:** *Your approach here.* - -**Analysis:** - -```{r} -# Your R code here -``` - -```{r} -# Your R code here -``` - -```{r} -# Your R code here -``` - -**Discussion:** *Your discussion of results here.* diff --git a/assignments/Project_1.html b/assignments/Project_1.html deleted file mode 100644 index 6218cde..0000000 --- a/assignments/Project_1.html +++ /dev/null @@ -1,458 +0,0 @@ - - - - - - - - - - - - - -Project 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

This is the dataset you will be working with:

-
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
-
-olympics_alpine <- olympics %>% 
-  filter(!is.na(weight)) %>%             # only keep athletes with known weight
-  filter(sport == "Alpine Skiing") %>%   # keep only alpine skiers
-  mutate(
-    medalist = case_when(                # add column to 
-      is.na(medal) ~ FALSE,              # NA values go to FALSE
-      !is.na(medal) ~ TRUE               # non-NA values (Gold, Silver, Bronze) go to TRUE
-    )
-  )
-

olympics_alpine is a subset of olympics and -contains only the data for alpine skiers. More information about the -original olympics dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md -and https://www.sports-reference.com/olympics.html.

-

For this project, use olympics_alpine to answer the -following questions about the weights of alpine skiers:

-
    -
  1. Are there weight differences for male and female Olympic skiers who -were successful or not in earning a medal?
  2. -
  3. Are there weight differences for skiers who competed in different -alpine skiing events?
  4. -
  5. How has the weight distribution of alpine skiers changed over the -years?
  6. -
-

You should make one plot per question.

-

Hints:

- -
 + facet_wrap(
-    # your other arguments to facet_wrap() go here
-    ...,
-    # this replaces "TRUE" with "medaled" and "FALSE" with "did not medal"
-    labeller = as_labeller(c(`TRUE` = "medaled", `FALSE` = "did not medal"))
-  )
-

Introduction: Your introduction here.

-

Approach: Your approach here.

-

Analysis:

-
# Your R code here
-
# Your R code here
-
# Your R code here
-

Discussion: Your discussion of results -here.

- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/Project_1_example.html b/assignments/Project_1_example.html deleted file mode 100644 index 5a17de0..0000000 --- a/assignments/Project_1_example.html +++ /dev/null @@ -1,576 +0,0 @@ - - - - - - - - - - - - - -Project 1 Example Solution - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Claus O. Wilke, EID

-

This is the dataset you will be working with:

-
NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
-
-NCbirths
-
## # A tibble: 1,409 × 10
-##    Plural   Sex MomAge Weeks Gained Smoke BirthWeightGm   Low Premie Marital
-##     <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>         <dbl> <dbl>  <dbl>   <dbl>
-##  1      1     1     32    40     38     0         3147.     0      0       0
-##  2      1     2     32    37     34     0         3289.     0      0       0
-##  3      1     1     27    39     12     0         3912.     0      0       0
-##  4      1     1     27    39     15     0         3856.     0      0       0
-##  5      1     1     25    39     32     0         3430.     0      0       0
-##  6      1     1     28    43     32     0         3317.     0      0       0
-##  7      1     2     25    39     75     0         4054.     0      0       0
-##  8      1     2     15    42     25     0         3204.     0      0       1
-##  9      1     2     21    39     28     0         3402      0      0       0
-## 10      1     2     27    40     37     0         3515.     0      0       1
-## # … with 1,399 more rows
-

Questions:

-
    -
  1. Is there a relationship between whether a mother smokes or not -and her baby’s weight at birth?

  2. -
  3. How many mothers are smokers or non-smokers?

  4. -
  5. What are the age distributions of mothers of twins or -triplets?

  6. -
-

Introduction: We are working with the -NCbirths dataset, which contains 1409 birth records from -North Carolina in 2001. In this dataset, each row corresponds to one -birth, and there are ten columns providing information about the birth, -the mother, and the baby. Information about the birth includes whether -it is a single, twin, or triplet birth, the number of completed weeks of -gestation, and whether the birth is premature. Information about the -baby includes the sex, the weight at birth, and whether the birth weight -should be considered low. Information about the mother includes her age, -the weight gained during pregnancy, whether she is a smoker, and whether -she is married.

-

To answer the three questions, we will work with five variables, the -baby’s birthweight (column BirthWeightGm), whether the baby -was born prematurely (column Premie), whether it was a -singleton, twin, or triplet birth (column Plural), whether -the mother is a smoker or not (column Smoke), and the -mother’s age (column MomAge). The birthweight is provided -as a numeric value, in grams. The premature birth status is encoded as -0/1, where 0 means regular and 1 means premature (36 weeks or sooner). -The number of births is encoded as 1/2/3 representing singleton, twins, -and triplets, respectively. The smoking status is encoded as 0/1, where -0 means the mother is not a smoker and 1 means she is a smoker. The -mother’s age is provided in years.

-

Approach: To show the distributions of birthweights -versus the mothers’ smoking status we will be using violin plots -(geom_violin()). We also separate out regular and premature -births, because babies born prematurely have much lower birthweight and -therefore must be considered separately. Violins make it easy to compare -multiple distributions side-by-side.

-

To show the number of mothers that are smokers or non-smokers we will -use a simple bar plot (geom_bar()). Finally, to show the -distribution of mothers’ ages we will make a strip chart. The number of -twin and triplet births in the dataset is not that large, so a strip -chart is a good option here.

-

Analysis:

-

Question 1: Is there a relationship between whether a mother smokes -or not and her baby’s weight at birth?

-

To answer this question, we plot the birthweight distributions as -violins, separated by both smoking status and by whether the birth was -regular or premature.

-
# The columns `Premie` and `Smoke` are numerical but contain
-# categorical data, so we convert to factors to ensure ggplot
-# treats them correctly
-ggplot(NCbirths, aes(factor(Premie), BirthWeightGm)) +
-  geom_violin(aes(fill = factor(Smoke))) +
-  scale_x_discrete(
-    name = NULL, # remove axis title entirely
-    labels = c("regular birth", "premature birth")
-  ) +
-  scale_y_continuous(
-    name = "Birth weight (gm)"
-  ) +
-  scale_fill_manual(
-    name = "Mother",
-    labels = c("non-smoker", "smoker"),
-    # explicitly assign colors to specific data values
-    values = c(`0` = "#56B4E9", `1` = "#E69F00")
-  ) + 
-  theme_bw(12)
-

-

There is a clear difference between birthweight for regular and -premature births, and for regular births the birthweight also seems to -be lower when the mother smokes.

-

Question 2: How many mothers are smokers or non-smokers?

-

To answer this question, we make a simple bar plot of the number of -mothers by smoking status.

-
# again, convert `Smoke` into factor so it's categorical
-ggplot(NCbirths, aes(y = factor(Smoke))) +
-  geom_bar() +
-  scale_y_discrete(
-    name = NULL,
-    labels = c("non-smoker", "smoker")
-  ) +
-  scale_x_continuous(
-    # ensure there's no gap between the beginning of the bar
-    # and the edge of the plot panel
-    expand = expansion(mult = c(0, 0.1))
-  ) +
-  theme_bw(12)
-

-

The vast majority of mothers in the dataset are non-smokers (almost -1250). Fewer than 250 are smokers.

-

Question 3. What are the age distributions of mothers of twins or -triplets?

-

To answer this question, we first remove singleton births from the -dataset and then show age distributions as a strip chart.

-
NCbirths %>%
-  filter(Plural > 1) %>% # remove singlet births
-  ggplot(aes(x = factor(Plural), y = MomAge)) +
-  geom_point(
-    # jitter horizontally so points don't overlap
-    position = position_jitter(
-      width = 0.2,
-      height = 0
-    ),
-    # it's nice to make points a little bigger and give them some color
-    size = 2,
-    color = "#1E4A7F"
-  ) +
-  scale_x_discrete(
-    name = NULL,
-    labels = c("twins", "triplets")
-  ) +
-  scale_y_continuous(
-    name = "age of mother (years)"
-  ) +
-  theme_bw(12)
-

-

Mothers of twins span the entire childbearing range, from 15 years to -approximately 40 years old. By contrast, mothers of triplets tend to be -in their thirties.

-

Discussion: The smoking status of the mother appears -to have a small effect on the average birth weight for regular births. -We can see this by comparing the two left-most violins in the first -plot, where we see that they are slightly vertically shifted relative to -each other but have otherwise a comparable shape. However, a much bigger -effect comes from whether the baby is born prematurely or not. Premature -births have on average a much lower birthweight than regular births, and -the variance is also bigger (the two right-most violins are taller than -the two left-most violins). Interestingly, smoking status does not seem -to affect the distribution of birthweights for premature births much. We -can see this from the fact that the two right-most violins look -approximately the same. We would have to run a multivariate statistical -analysis to determine whether any of these observed patterns are -statistically significant.

-

There are many more births to non-smoking mothers than to smoking -mothers in the dataset. This is important because it means we have more -complete data for non-smoking mothers. Some of the differences we saw in -the first graph, such as the slightly lower variance in birthweight for -premature births to smoking mothers—as compared to premature births to -non-smoking mothers—may simply be due to a smaller data set.

-

When comparing age distributions of mothers of twins or of triplets -we see an unexpected difference. It appears that mothers of all ages, -from teenage moms to moms in their early fourties, all can have twins. -By contrast, only mothers in their thirties appear to have triplets. We -can think of a possible explanation. Twin births happen due to natural -causes and therefore can occur in mothers of all ages. Triplet births, -however, are extremely unlikely to occur naturally, and most commonly -are caused by fertility treatments that cause multiple eggs to mature at -once. It is unlikely that women in their late teens or twenties will -undergo fertility treatment, whereas women in their thirties do so -frequently. We also note, however, that there are only four triplet -births in the dataset, so the lack of younger mothers could be due to -random chance. We would have to perform further analysis or run -statistical tests develop a clearer picture of what mechanisms may have -caused the observed patterns in the data.

- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/Project_1_instructions.html b/assignments/Project_1_instructions.html deleted file mode 100644 index a016caa..0000000 --- a/assignments/Project_1_instructions.html +++ /dev/null @@ -1,475 +0,0 @@ - - - - - - - - - - - - - -Project 1 Instructions - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Please use the project template R Markdown document to complete your -project. The knitted R Markdown document (as a PDF) and the raw -R Markdown file (as .Rmd) must be submitted to Canvas by 11:00pm on -Tues., Feb 21, 2023. These two documents will be graded -jointly, so they must be consistent (as in, don’t change the R Markdown -file without also updating the knitted document!).

-

All results presented must have corresponding code. -Any answers/results given without the corresponding R code that -generated the result will be considered absent. To be clear: if -you do calculations by hand instead of using R and then report the -results from the calculations, you will not receive -credit for those calculations. All code reported in your final -project document should work properly. Please do not include any -extraneous code or code which produces error messages. (Code which -produces warnings is acceptable, as long as you understand what the -warnings mean.)

-

For this project, you will be using an Olympic Games dataset, which -is a compilation of records for athletes that have competed in the -Olympics from Athens 1896 to Rio 2016.

-

Each record contains information including the name of the athlete -(name), their sex, their age, -their height, their weight, their -team, their nationality (noc), the -games at which they played, the year, the -olympic season, the city where the olympics -took place, the sport, the name of the event -(event), the decade during which the Olympics took place -(decade), whether or not the athlete won a gold medal -(gold), whether or not the athlete won any medal -(medalist) and if the athlete won “Gold”, “Silver”, -“Bronze” or received “no medal” (medal). More information -about the dataset can be found at https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

-

We will provide you with specific questions to answer and specific -instructions on how to answer the questions. The project should be -structured as follows:

- -

We encourage you to be concise. A paragraph should typically not be -longer than 5 sentences.

-

You are not required to perform any statistical -tests in this project, but you may do so if you find it helpful to -answer your question.

-
-

Instructions

-

In the Introduction section, write a brief introduction to the -dataset, the questions, and what parts of the dataset are necessary to -answer the questions. You may repeat some of the information about the -dataset provided above, paraphrasing on your own terms. Imagine that -your project is a standalone document and the grader has no prior -knowledge of the dataset.

-

In the Approach section, describe what types of plots you are going -to make to address your questions. For each plot, provide a clear -explanation as to why this plot (e.g. boxplot, barplot, histogram, etc.) -is best for providing the information you are asking about. (You can -draw on the materials provided -here for guidance.) At least two plots should be of -different types, and at least one of the plots needs to use either color -mapping or facets.

-

In the Analysis section, provide the code that generates your plots. -Use scale functions to provide nice axis labels and guides. You are -welcome to use theme functions to customize the appearance of your plot, -but you are not required to do so. All plots must be made with -ggplot2. Do not use base R plotting functions.

-

In the Discussion section, interpret the results of your analysis. -Identify any trends revealed (or not revealed) by the plots. Speculate -about why the data looks the way it does.

-
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/Project_1_rubric.pdf b/assignments/Project_1_rubric.pdf deleted file mode 100644 index 1729159..0000000 Binary files a/assignments/Project_1_rubric.pdf and /dev/null differ diff --git a/assignments/Project_2.Rmd b/assignments/Project_2.Rmd deleted file mode 100644 index 11a9a4a..0000000 --- a/assignments/Project_2.Rmd +++ /dev/null @@ -1,41 +0,0 @@ ---- -title: "Project 2" -output: html_document ---- - -```{r setup, include=FALSE} -library(tidyverse) -knitr::opts_chunk$set(echo = TRUE) -``` - -*Enter your name and EID here* - -This is the dataset you will be working with: -```{r message = FALSE} -members <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv') - -members -``` - -More information about the dataset can be found at https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-22/readme.md and https://www.himalayandatabase.com/. - -**Question 1:** *Your question 1 here.* - -**Question 2:** *Your question 2 here.* - -**Introduction:** *Your introduction here.* - -**Approach:** *Your approach here.* - -**Analysis:** - -```{r} -# Your R code here -``` - -```{r} -# Your R code here -``` - -**Discussion:** *Your discussion of results here.* - diff --git a/assignments/Project_2.html b/assignments/Project_2.html deleted file mode 100644 index e43f6fe..0000000 --- a/assignments/Project_2.html +++ /dev/null @@ -1,438 +0,0 @@ - - - - - - - - - - - - - -Project 2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

This is the dataset you will be working with:

-
members <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv')
-
-members
-
## # A tibble: 76,519 × 21
-##    expedition…¹ membe…² peak_id peak_…³  year season sex     age citiz…⁴ exped…⁵
-##    <chr>        <chr>   <chr>   <chr>   <dbl> <chr>  <chr> <dbl> <chr>   <chr>  
-##  1 AMAD78301    AMAD78… AMAD    Ama Da…  1978 Autumn M        40 France  Leader 
-##  2 AMAD78301    AMAD78… AMAD    Ama Da…  1978 Autumn M        41 France  Deputy…
-##  3 AMAD78301    AMAD78… AMAD    Ama Da…  1978 Autumn M        27 France  Climber
-##  4 AMAD78301    AMAD78… AMAD    Ama Da…  1978 Autumn M        40 France  Exp Do…
-##  5 AMAD78301    AMAD78… AMAD    Ama Da…  1978 Autumn M        34 France  Climber
-##  6 AMAD78301    AMAD78… AMAD    Ama Da…  1978 Autumn M        25 France  Climber
-##  7 AMAD78301    AMAD78… AMAD    Ama Da…  1978 Autumn M        41 France  Climber
-##  8 AMAD78301    AMAD78… AMAD    Ama Da…  1978 Autumn M        29 France  Climber
-##  9 AMAD79101    AMAD79… AMAD    Ama Da…  1979 Spring M        35 USA     Climber
-## 10 AMAD79101    AMAD79… AMAD    Ama Da…  1979 Spring M        37 W Germ… Climber
-## # … with 76,509 more rows, 11 more variables: hired <lgl>,
-## #   highpoint_metres <dbl>, success <lgl>, solo <lgl>, oxygen_used <lgl>,
-## #   died <lgl>, death_cause <chr>, death_height_metres <dbl>, injured <lgl>,
-## #   injury_type <chr>, injury_height_metres <dbl>, and abbreviated variable
-## #   names ¹​expedition_id, ²​member_id, ³​peak_name, ⁴​citizenship,
-## #   ⁵​expedition_role
-

More information about the dataset can be found at https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-22/readme.md -and https://www.himalayandatabase.com/.

-

Question 1: Your question 1 here.

-

Question 2: Your question 2 here.

-

Introduction: Your introduction here.

-

Approach: Your approach here.

-

Analysis:

-
# Your R code here
-
# Your R code here
-

Discussion: Your discussion of results -here.

- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/Project_2_instructions.html b/assignments/Project_2_instructions.html deleted file mode 100644 index 6e5b2c7..0000000 --- a/assignments/Project_2_instructions.html +++ /dev/null @@ -1,494 +0,0 @@ - - - - - - - - - - - - - -Project 2 Instructions - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Please use the project template R Markdown document to complete your -project. The knitted R Markdown document (as a PDF) and the raw -R Markdown file (as .Rmd) must be submitted to Canvas by 11:00pm on -Tues., March 21, 2023. These two documents will be -graded jointly, so they must be consistent (as in, don’t change the R -Markdown file without also updating the knitted document!).

-

All results presented must have corresponding code. -Any answers/results given without the corresponding R code that -generated the result will be considered absent. All code -reported in your final project document should work properly. Please do -not include any extraneous code or code which produces error messages. -(Code which produces warnings is acceptable, as long as you understand -what the warnings mean.)

-

For this project, you will be using a dataset about Himalayan -expeditions, taken from the Himalayan Database, a compilation of records -for all expeditions that have climbed in the Nepal Himalaya. The dataset -members contains records for all individuals who -participated in expeditions from 1905 through Spring 2019 to more than -465 significant peaks in Nepal.

-

Each record contains information including the name of the mountain -(peak_name), the year of the expedition -(year), the season (season), the age of the -expedition member (age), their citizenship -(citizenship), whether they used oxygen -(oxygen_used), and whether they successfully summitted the -peak (success). More information about the dataset can be -found at https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-22/readme.md -and https://www.himalayandatabase.com/.

-

The project structure will be similar to Project 1. However, this -time you will define the questions that you will then answer. The final -project should be structured as follows:

- -

We encourage you to be concise. A paragraph should typically not be -longer than 5 sentences.

-

You are not required to perform any statistical -tests in this project, but you may do so if you find it helpful to -answer your questions.

-
-

Instructions

-

First state the two questions you will answer. The questions should -be conceptual and open-ended and not prompt a specific analysis. In -particular, make sure you understand the difference between a question -and an instruction.

-

This is a question: How has the weight distribution of alpine -skiers changed over the years?

-

This is not a question; it is an instruction: -Make a series of boxplots of the weight of alpine skiers versus the -year of the olympics.

-

This is a question that prompts a specific analysis; it is actually -an instruction pretending to be a question: What is the value of the -slope parameter in a regression of skier weight versus year?

-

In the Introduction section, write a brief introduction to the -dataset and describe what parts of the dataset are necessary to answer -your questions. Imagine that your project is a standalone document and -the grader has no prior knowledge of the dataset. -Important: You must provide a detailed description of -data columns you are going to use in your analysis, reproducing relevant -information from the data dictionary as necessary.

-

In the Approach section, describe what type of data wrangling you -will perform and what kind of plot you will generate to address your -questions. Provide a clear explanation as to why this plot -(e.g. boxplot, barplot, histogram, etc.) is best for providing the -information you are asking about. (You can draw on the materials provided -here for guidance.) The two plots should be of different -types, and at least one plot needs to use either color mapping or -faceting or both.

-

Across your two questions, your data wrangling code needs to use at -least three different data manipulation functions that modify data -tables, such as mutate(), filter(), -arrange(), select(), summarize(), -etc.

-

In the Analysis section, provide the code that performs required data -wrangling and then generates your plots. You may find it helpful to -compute and output summary tables in addition to making plots. Use scale -functions to provide nice axis labels and guides. Also, use theme -functions to customize the appearance of your plot. For full -points, you will have to apply some unique styling to your -plots; you cannot rely exclusively on preexisting theme -functions. All plots must be made with ggplot2. Do not use base R -plotting functions.

-

In the Discussion section, interpret the results of your analysis. -Identify any trends revealed (or not revealed) by your plots. Speculate -about why the data looks the way it does.

-
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/Project_2_rubric.pdf b/assignments/Project_2_rubric.pdf deleted file mode 100644 index f9c22e5..0000000 Binary files a/assignments/Project_2_rubric.pdf and /dev/null differ diff --git a/assignments/Project_3.Rmd b/assignments/Project_3.Rmd deleted file mode 100644 index 2c3b69f..0000000 --- a/assignments/Project_3.Rmd +++ /dev/null @@ -1,30 +0,0 @@ ---- -title: "Project 3" -output: html_document ---- - -```{r setup, include=FALSE} -library(tidyverse) -knitr::opts_chunk$set(echo = TRUE) -``` - -*Enter your name and EID here* - -**Introduction:** *Your introduction here.* - -```{r} -# Load your dataset here -``` - -**Question:** *Your question here.* - -**Approach:** *Your approach here.* - -**Analysis:** - -```{r} -# Your R code here -``` - -**Discussion:** *Your discussion of results here.* - diff --git a/assignments/Project_3.html b/assignments/Project_3.html deleted file mode 100644 index 18e831b..0000000 --- a/assignments/Project_3.html +++ /dev/null @@ -1,412 +0,0 @@ - - - - - - - - - - - - - -Project 3 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Enter your name and EID here

-

Introduction: Your introduction here.

-
# Load your dataset here
-

Question: Your question here.

-

Approach: Your approach here.

-

Analysis:

-
# Your R code here
-

Discussion: Your discussion of results -here.

- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/Project_3_instructions.html b/assignments/Project_3_instructions.html deleted file mode 100644 index ddec562..0000000 --- a/assignments/Project_3_instructions.html +++ /dev/null @@ -1,477 +0,0 @@ - - - - - - - - - - - - - -Project 3 Instructions - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -

Please use the project template R Markdown document to complete your -project. The knitted R Markdown document (as a PDF) and the raw -R Markdown file (as .Rmd) must be submitted to Canvas by 11:00pm on -Tues., April 18, 2023. These two documents will be -graded jointly, so they must be consistent (as in, don’t change the R -Markdown file without also updating the knitted document!).

-

All results presented must have corresponding code. -Any answers/results given without the corresponding R code that -generated the result will be considered absent. All code -reported in your final project document should work properly. Please do -not include any extraneous code or code which produces error messages. -(Code which produces warnings is acceptable, as long as you understand -what the warnings mean.)

-

For this project, you will be choosing your own dataset, given the -following constraints: Pick one of the datasets published by the Tidy -Tuesday project between June 7, 2022 and December 27, 2022 (both dates -inclusive). All these datasets are available here: https://github.com/rfordatascience/tidytuesday/tree/master/data

-

The project structure will be similar to Project 2, except there will -be only one question. The final project should be structured as -follows:

- -

We encourage you to be concise. A paragraph should typically not be -longer than 5 sentences.

-

Important: Your project needs to include some -material from classes 19 or 21–24, i.e., either some statistical -modeling applied to subsets of data or some dimension reduction or -clustering. We recommend you do a PCA, but you are not required to do so -if you use one of the other techniques.

-
-

Instructions

-

In the Introduction section, write a brief introduction to the -dataset and describe what parts of the dataset are necessary to answer -your question. Imagine that your project is a standalone document and -the grader has no prior knowledge of the dataset. -Important: You must provide a detailed description of -data columns you are going to use in your analysis, reproducing relevant -information from the data dictionary as necessary.

-

Next you will state your question. The question should be conceptual -and open-ended and not prompt a specific analysis. In particular, make -sure you understand the difference between a question and an -instruction.

-

This is a question: How has the weight distribution of alpine -skiers changed over the years?

-

This is not a question; it is an instruction: -Make a series of boxplots of the weight of alpine skiers versus the -year of the olympics.

-

This is a question that prompts a specific analysis; it is actually -an instruction pretending to be a question: What is the value of the -slope parameter in a regression of skier weight versus year?

-

In the Approach section, describe what type of data wrangling you -will perform and what kind of plot(s) you will generate to address your -questions. Provide a clear explanation as to why these plots -(e.g. boxplot, barplot, histogram, etc.) are best for providing the -information you are asking about. (You can draw on the materials provided -here for guidance.)

-

In the Analysis section, provide the code that performs required data -wrangling and then generates your plots and/or summary table. Use scale -functions to provide nice axis labels and guides. Also, use theme -functions to customize the appearance of your plot. For full -points, you will have to apply some unique styling to your -plots; you cannot rely exclusively on preexisting theme -functions. All plots must be made with ggplot2. Do not use base R -plotting functions.

-

In the Discussion section, interpret the results of your analysis. -Identify any trends revealed (or not revealed) by your analysis. -Speculate about why the data looks the way it does.

-
- - - - -
- - - - - - - - - - - - - - - diff --git a/assignments/Project_3_rubric.pdf b/assignments/Project_3_rubric.pdf deleted file mode 100644 index b9c6305..0000000 Binary files a/assignments/Project_3_rubric.pdf and /dev/null differ diff --git a/assignments/grad_assignment.Rmd b/assignments/grad_assignment.Rmd deleted file mode 100644 index 59703da..0000000 --- a/assignments/grad_assignment.Rmd +++ /dev/null @@ -1,32 +0,0 @@ -```{r global_options, include=FALSE} -library(knitr) -opts_chunk$set(fig.align = "center", fig.height = 5, fig.width = 6) -library(tidyverse) -theme_set(theme_bw(base_size = 12)) -library(ggthemes) -``` - -## SDS 395 Report -*Enter your name and EID here* - -### Instructions - -Write a brief (approximately 2--4 pages total, maximum is 5 pages) report on a data analysis topic of your choice. You should state a clear question, explain briefly the dataset you are using, and then provide an analysis to answer your question. All code required to perform the analysis must be included in the R Markdown document. Please do not include more than 3 figures in your analysis. - -The knitted R Markdown document (as a PDF) *and* the raw R Markdown file (as .Rmd) as well as any required data files should be submitted to Canvas by **March 31, 2023.** These two documents will be graded jointly, so they must be consistent (as in, don't change the R Markdown file without also updating the knitted document). - -Grading comments: - -1. This assignment will be graded pass/fail, but you can make revisions to obtain a passing grade. See the course syllabus for details. - -2. If you need more than 5 pages or more than 3 figures your project is too complex! You are not being graded on complexity. The goal is clarity and succinctness. - -3. You are encouraged to work with data from your own research. If you would need extensive amounts of code to clean up the data and prepare for the analysis, then do this ahead of time, export the data to a clean csv file, and then work with that csv file here. Just explain in a few sentences what data the csv file contains and how it was generated. - -4. Please don't include any extraneous code that does not contribute to answering the question. - -5. The project instructions do not count towards the page limit. Feel free to delete them. - -### Report - -*Please add your report here.* diff --git a/assignments/grad_assignment.html b/assignments/grad_assignment.html deleted file mode 100644 index 6a477df..0000000 --- a/assignments/grad_assignment.html +++ /dev/null @@ -1,444 +0,0 @@ - - - - - - - - - - - - - -grad_assignment.knit - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -
-

SDS 395 Report

-

Enter your name and EID here

-
-

Instructions

-

Write a brief (approximately 2–4 pages total, maximum is 5 pages) -report on a data analysis topic of your choice. You should state a clear -question, explain briefly the dataset you are using, and then provide an -analysis to answer your question. All code required to perform the -analysis must be included in the R Markdown document. Please do not -include more than 3 figures in your analysis.

-

The knitted R Markdown document (as a PDF) and the raw R -Markdown file (as .Rmd) as well as any required data files should be -submitted to Canvas by March 31, 2023. These two -documents will be graded jointly, so they must be consistent (as in, -don’t change the R Markdown file without also updating the knitted -document).

-

Grading comments:

-
    -
  1. This assignment will be graded pass/fail, but you can make -revisions to obtain a passing grade. See the course syllabus for -details.

  2. -
  3. If you need more than 5 pages or more than 3 figures your project -is too complex! You are not being graded on complexity. The goal is -clarity and succinctness.

  4. -
  5. You are encouraged to work with data from your own research. If -you would need extensive amounts of code to clean up the data and -prepare for the analysis, then do this ahead of time, export the data to -a clean csv file, and then work with that csv file here. Just explain in -a few sentences what data the csv file contains and how it was -generated.

  6. -
  7. Please don’t include any extraneous code that does not contribute -to answering the question.

  8. -
  9. The project instructions do not count towards the page limit. -Feel free to delete them.

  10. -
-
-
-

Report

-

Please add your report here.

-
-
- - - - -
- - - - - - - - - - - - - - -