HW 7 and Project 3

wilkelab · Apr 1, 2024 · 9516f20 · 9516f20
1 parent c05e657
commit 9516f20
Show file tree

Hide file tree

Showing 13 changed files with 3,928 additions and 5 deletions.
diff --git a/assignments/HW7.Rmd b/assignments/HW7.Rmd
@@ -0,0 +1,47 @@
+---
+title: "Homework 7"
+output:
+  html_document:
+    theme:
+      version: 4
+---
+
+```{r global_options, include=FALSE}
+library(knitr)
+library(tidyverse)
+library(broom)
+opts_chunk$set(fig.align="center", fig.height=4.326, fig.width=7)
+```
+
+**This homework is due on Apr 11, 2024 at 11:00pm. Please submit as a pdf file on Canvas.**
+
+For both problems in this homework, we will work with the `heart_disease_data` dataset, which is a simplified and recoded version of a dataset available from kaggle. You can read about the original dataset here: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download 
+
+The `heart_disease_data` dataset contains 9 variables: `HeartDisease` (whether or not the participant has heart disease), `BMI` (body mass index), `PhysicalHealth` (how many days a month was their physical health not good), `MentalHealth` (how many days a month was their mental health not good), `ApproximateAge` (participants age), `SleepTime` (how many hours of sleep do they get in a 24-hour period), `Smoking` (1-smoker, 0-nonsmoker), `AlcoholDrinking` (1-drinks alcohol, 0-does not drink), `PhysicalActivity` (1-did physical activity or exercise during the past 30 days, 0-hardly any physical activity). Compared to the original dataset, the columns `ApproximateAge`, `Smoking`, `AlcoholDrinking`, and `PhysicalActivity` have been converted into numeric columns so they can be included in a PCA.
+
+**Note:** This homework is about the contents of the plots. Don't worry about styling. It's OK to use the default theme and plot labeling.
+
+
+```{r message = FALSE}
+heart_data <- read_csv("https://wilkelab.org/SDS375/datasets/heart_disease_data.csv")
+```
+
+**Problem 1: (10 pts)** 
+
+Perform a PCA of the numerical colums of the `heart_disease_data` dataset. Then make two plots, a rotation plot of components 1 and 2 and a plot of the eigenvalues, showing the amount of variance explained by the various components.
+
+```{r}
+# your code here
+```
+
+```{r}
+# your code here
+```
+
+
+**Problem 2: (10 pts)** Make a scatter plot of PC 2 versus PC 1 and color by heart disease status. Then use the rotation plot from Problem 1 to describe the variables/factors by which we can separate the study participants with heart disease from the study participants without heart disease. 
+
+
+```{r}
+# your code here
+```
diff --git a/assignments/HW7.html b/assignments/HW7.html
diff --git a/assignments/Project_3.Rmd b/assignments/Project_3.Rmd
@@ -0,0 +1,47 @@
+---
+title: "Project 3"
+output:
+  html_document:
+    theme:
+      version: 4
+---
+
+```{r setup, include=FALSE}
+library(tidyverse)
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+
+In this project, you will be working with a dataset of your own choosing. **Important:** The dataset needs to be picked from the [TidyTuesday project,](https://github.com/rfordatascience/tidytuesday/tree/master/data/2023), and it needs to be one that has been released between May 30, 2023 and December 26, 2023 (both dates inclusive).
+
+**Hints:**
+
+- Read in your data with `readr::read_csv()`, as we have done in prior projects. **Do not use the tidytuesdayR package.** The TidyTuesday site explains for each dataset how it can be read with `readr::read_csv()`, under "Get the data here", part "Or read in the data manually".
+
+- Make sure your question is actually a question, and not a veiled instruction to perform a particular analysis.
+
+- Adjust `fig.width` and `fig.height` in the chunk headers to customize figure sizing and figure aspect ratios. These numbers are measured in inches and will usually fall between 4 and 10.
+
+You can delete these instructions from your project. Please also delete text such as *Your approach here* or `# Code for figure 1 here`.
+
+**Introduction:** *Your introduction here.*
+
+**Question:** *Your question here.*
+
+**Approach:** *Your approach here.*
+
+**Analysis:**
+
+```{r}
+# Data loading/wrangling/analysis code here
+```
+
+```{r fig.width = 5, fig.height = 5}
+# Code for figure 1 here
+```
+
+```{r fig.width = 5, fig.height = 5}
+# Code for figure 2 here
+```
+
+**Discussion:** *Your discussion of results here.*
diff --git a/assignments/Project_3.html b/assignments/Project_3.html
diff --git a/assignments/Project_3_instructions.html b/assignments/Project_3_instructions.html
diff --git a/docs/assignments/HW7.Rmd b/docs/assignments/HW7.Rmd
@@ -0,0 +1,47 @@
+---
+title: "Homework 7"
+output:
+  html_document:
+    theme:
+      version: 4
+---
+
+```{r global_options, include=FALSE}
+library(knitr)
+library(tidyverse)
+library(broom)
+opts_chunk$set(fig.align="center", fig.height=4.326, fig.width=7)
+```
+
+**This homework is due on Apr 11, 2024 at 11:00pm. Please submit as a pdf file on Canvas.**
+
+For both problems in this homework, we will work with the `heart_disease_data` dataset, which is a simplified and recoded version of a dataset available from kaggle. You can read about the original dataset here: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download 
+
+The `heart_disease_data` dataset contains 9 variables: `HeartDisease` (whether or not the participant has heart disease), `BMI` (body mass index), `PhysicalHealth` (how many days a month was their physical health not good), `MentalHealth` (how many days a month was their mental health not good), `ApproximateAge` (participants age), `SleepTime` (how many hours of sleep do they get in a 24-hour period), `Smoking` (1-smoker, 0-nonsmoker), `AlcoholDrinking` (1-drinks alcohol, 0-does not drink), `PhysicalActivity` (1-did physical activity or exercise during the past 30 days, 0-hardly any physical activity). Compared to the original dataset, the columns `ApproximateAge`, `Smoking`, `AlcoholDrinking`, and `PhysicalActivity` have been converted into numeric columns so they can be included in a PCA.
+
+**Note:** This homework is about the contents of the plots. Don't worry about styling. It's OK to use the default theme and plot labeling.
+
+
+```{r message = FALSE}
+heart_data <- read_csv("https://wilkelab.org/SDS375/datasets/heart_disease_data.csv")
+```
+
+**Problem 1: (10 pts)** 
+
+Perform a PCA of the numerical colums of the `heart_disease_data` dataset. Then make two plots, a rotation plot of components 1 and 2 and a plot of the eigenvalues, showing the amount of variance explained by the various components.
+
+```{r}
+# your code here
+```
+
+```{r}
+# your code here
+```
+
+
+**Problem 2: (10 pts)** Make a scatter plot of PC 2 versus PC 1 and color by heart disease status. Then use the rotation plot from Problem 1 to describe the variables/factors by which we can separate the study participants with heart disease from the study participants without heart disease. 
+
+
+```{r}
+# your code here
+```
diff --git a/docs/assignments/HW7.html b/docs/assignments/HW7.html
diff --git a/docs/assignments/Project_3.Rmd b/docs/assignments/Project_3.Rmd
@@ -0,0 +1,47 @@
+---
+title: "Project 3"
+output:
+  html_document:
+    theme:
+      version: 4
+---
+
+```{r setup, include=FALSE}
+library(tidyverse)
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+
+In this project, you will be working with a dataset of your own choosing. **Important:** The dataset needs to be picked from the [TidyTuesday project,](https://github.com/rfordatascience/tidytuesday/tree/master/data/2023), and it needs to be one that has been released between May 30, 2023 and December 26, 2023 (both dates inclusive).
+
+**Hints:**
+
+- Read in your data with `readr::read_csv()`, as we have done in prior projects. **Do not use the tidytuesdayR package.** The TidyTuesday site explains for each dataset how it can be read with `readr::read_csv()`, under "Get the data here", part "Or read in the data manually".
+
+- Make sure your question is actually a question, and not a veiled instruction to perform a particular analysis.
+
+- Adjust `fig.width` and `fig.height` in the chunk headers to customize figure sizing and figure aspect ratios. These numbers are measured in inches and will usually fall between 4 and 10.
+
+You can delete these instructions from your project. Please also delete text such as *Your approach here* or `# Code for figure 1 here`.
+
+**Introduction:** *Your introduction here.*
+
+**Question:** *Your question here.*
+
+**Approach:** *Your approach here.*
+
+**Analysis:**
+
+```{r}
+# Data loading/wrangling/analysis code here
+```
+
+```{r fig.width = 5, fig.height = 5}
+# Code for figure 1 here
+```
+
+```{r fig.width = 5, fig.height = 5}
+# Code for figure 2 here
+```
+
+**Discussion:** *Your discussion of results here.*
diff --git a/docs/assignments/Project_3.html b/docs/assignments/Project_3.html
diff --git a/docs/assignments/Project_3_instructions.html b/docs/assignments/Project_3_instructions.html
diff --git a/docs/schedule.html b/docs/schedule.html
@@ -2741,6 +2741,24 @@ <h3 id="mar-28-2024dimension-reduction-2">20. Mar 28, 2024—Dimension reduction
 </li>
 <li><a href="worksheets/dimension-reduction-2.Rmd">Worksheet</a></li>
 </ul>
+<h3 id="apr-2-2024clustering">21. Apr 2, 2024—Clustering</h3>
+<p class="nospace">
+Materials:
+</p>
+<ul>
+<li><a href="slides/clustering.html">Slides</a><br />
+</li>
+<li><a href="worksheets/clustering.Rmd">Worksheet</a></li>
+</ul>
+<h3 id="apr-4-2024hierarchical-clustering">22. Apr 4, 2024—Hierarchical clustering</h3>
+<p class="nospace">
+Materials:
+</p>
+<ul>
+<li><a href="slides/hierarchical-clustering.html">Slides</a><br />
+</li>
+<li><a href="worksheets/hierarchical-clustering.Rmd">Worksheet</a></li>
+</ul>
 <h2 id="homeworks">Homeworks</h2>
 <p>All homeworks are due by 11:00pm on the day they are due. Homeworks need to be submitted as pdf files on Canvas.</p>
 <h3 id="homework-1-due-jan-25-2024">Homework 1 (due Jan 25, 2024)</h3>
@@ -2792,6 +2810,13 @@ <h3 id="homework-6-due-apr-4-2024">Homework 6 (due Apr 4, 2024)</h3>
 <li><a href="assignments/HW6.html">HTML</a></li>
 </ul>
 <h3 id="homework-7-due-apr-11-2024">Homework 7 (due Apr 11, 2024)</h3>
+<p class="nospace">
+Materials:
+</p>
+<ul>
+<li><a href="assignments/HW7.Rmd">R Markdown template</a></li>
+<li><a href="assignments/HW7.html">HTML</a></li>
+</ul>
 <h2 id="projects">Projects</h2>
 <p>All projects are due by 11:00pm on the day they are due. Projects need to be submitted on Canvas. Please carefully read the submission instructions for each project.</p>
 <h3 id="project-1-due-feb-15-2024">Project 1 (due Feb 15, 2024)</h3>
@@ -2817,6 +2842,14 @@ <h3 id="project-2-due-mar-21-2024">Project 2 (due Mar 21, 2024)</h3>
 </ul>
 <p>Please use the example and the solutions from Project 1 as examples for Project 2.</p>
 <h3 id="project-3-due-apr-18-2024">Project 3 (due Apr 18, 2024)</h3>
+<p class="nospace">
+Materials:
+</p>
+<ul>
+<li><a href="assignments/Project_3_instructions.html">Instructions</a></li>
+<li><a href="assignments/Project_3.Rmd">Project Template (Rmd)</a></li>
+<li><a href="assignments/Project_3.html">Project Template (HTML)</a></li>
+</ul>
 <h2 class="appendix" id="reuse">Reuse</h2>
 <p>Text and figures are licensed under Creative Commons Attribution <a href="https://creativecommons.org/licenses/by/4.0/">CC BY 4.0</a>. Any computer code (R, HTML, CSS, etc.) in slides and worksheets, including in slide and worksheet sources, is also licensed under <a href="https://github.com/wilkelab/SDS375/LICENSE.md">MIT</a>. Note that figures in slides may be pulled in from external sources and may be licensed under different terms. For such images, image credits are available in the slide notes, accessible via pressing the letter ‘p’.</p>
 <div class="sourceCode" id="cb1"><pre class="sourceCode r distill-force-highlighting-css"><code class="sourceCode r"></code></pre></div>