voter-turnout-doc.Rmd

---
title: "Impact of Voter Age on Election Outcomes"
author: "Hans Quiogue and Jackson Callaghan"
date: "5/3/2020"
output: html_document
runtime: shiny
---
<style>
    body .main-container {
        max-width: 1000px;
    }
</style>

```{r Libraries, message=FALSE, warning=FALSE, include=FALSE}
library(tidyverse)
library(readxl)
library(pander)
library(forecast)
library(DT)
library(reshape2)
library(kableExtra)
library(viridis)
library(shiny)
library(scales)
library(gridExtra)
library(fiftystater)
library(mapproj)
library(wesanderson)
```

```{r Set Theme, message=FALSE, warning=FALSE, include=FALSE}
theme_set(theme_grey())
```

```{r Data Cleaning for Historical Data, message=FALSE, warning=FALSE, include=FALSE}
# Uncleaned historical data
hist_df <- read_xlsx("a1.xlsx", skip = 5, na = "NA")
# Different voting category names to put into columns for the dataframe
voting_col_names <- c("Year", "Population-Count", "Population-Percent",
                      "Citizen-Percent", "White-Percent", "Citizen-White-Percent", 
                      "White-Non-Hispanic-Percent", "Citizen-White-Non-Hispanic-Perc", 
                      "Black-Percent", "Citizen-Black-Percent", "Asian-Percent",
                      "Citizen-Asian-Percent", "Hispanic/Other-Percent",
                      "Citizen-Hispanic/Other-Perc", "Male-Percent", "Female-Percent")
# Assigns proper column names 
colnames(hist_df) <- voting_col_names
# Converts all columns to numeric
hist_df[] <- lapply(hist_df, function(x) as.numeric(x))
# Converts year column to a categorical value
hist_df$Year <- as.factor(hist_df$Year)

# 1964 to 2018 (Goes by every 2 year)
years_length <- (2018 - 1964) / 2
# List of proper age groups and voting status to put into dataframe
age_groups <- c("Total", "18 to 24", "25 to 44", "45 to 64", "65+")
vote_status <- c("Voted", "Registered")

temp_df <- data.frame()

# Integer values representing current index values of corresponding list
curr_status <- 1
curr_age <- 1

# Loop to cleans up hist_df (Loop increments are added by 6 to account for missing columns)
for(i in seq(1, nrow(hist_df), years_length + 6)) {
  # Appends new rows to temp_df from subset of hist_df
  if(i != 1) temp_df <- bind_rows(temp_df, hist_df[(i - 1):((i - 1) + years_length), ])
  else temp_df <- bind_rows(temp_df, hist_df[i:(i + years_length), ])
}

# Appends new columns for age groups and voting statuses
temp_df <- data.frame(append(temp_df, c(Age = age_groups[curr_status], 
                                        Status = vote_status[curr_age]),
                                        after = 0), stringsAsFactors = FALSE)

temp_index = 1
# Loop that corrects age group and voting status rows
for(i in 1:(length(age_groups) * length(vote_status))) {
  temp_df$Age[temp_index:(temp_index + years_length)] <- age_groups[curr_age]
  temp_df$Status[temp_index:(temp_index + years_length)] <- vote_status[curr_status]
  temp_index <- (temp_index + years_length) + 1
  # Updates age group and voting status to match rows
  if(curr_status == 1) curr_status = curr_status + 1
  else {
    curr_status = 1
    curr_age = curr_age + 1
  }
}
# Removes unnessary rows at the nd
hist_df <- temp_df[1:279, ]

hist_df$Age <- as.factor(hist_df$Age)
hist_df$Status <- as.factor(hist_df$Status)
hist_df <- mutate(hist_df, election.type = ifelse(Year %in% seq(1964, 2020, 4), 
                                                  "Presidential", "Midterm")) 
```

## Project Description

This project will be focused on how voter turnout has affected the outcomes of United States elections using visualizations and some aggregate statistics. Specifically, we wanted to look at how turnout breaks down in terms of age. Low turnout of young voters is a major problem in America, and is typically attributed to a number of factors. In this project, we explored voter turnout data to see how the young vote affects election outcomes. The result is this html-knitted markdown file with embedded Shiny applications detailing our question, methods, and findings alongside interactive exploratory and more question-focused visualizations, and aggregate statistics. 

The main R libraries we used for this project were tidyverse, to wrangle and visualize the data, and Shiny, which made all of our plots interactive. Tidyverse includes ggplot2 and ensured that our visualizations were well refined while Shiny enhanced the plots overall for users by making them dynamic. Viridis is package that includes multiple color palletes and we used the library to make our plots consistent with a constant color scheme. Other packages we used to provide more interesting and informative plots and statistical tests were pander, forecast, fiftstarter, DT, fiftystarter, and many others.

The data sources we used were the following:

- Census: https://www.census.gov/topics/public-sector/voting.html

- The United States Elections Project: http://www.electproject.org/home

- The MIT Elections Lab: https://electionlab.mit.edu/data

## Project Summary 

We looked at a number of interactions in elections, and arrived at several interesting conclusions, as well as a few non-conclusions. As a result, our project became more of an exploratory report on some concepts in election turnout. Turnout rate to elections is lagging behind population, and so overall turnout rate has been on the decline or stagnant since 1964. Young voters aged 18-24 have the lowest turnout of any age group. Asian and Hispanic/other American voters have significantly lower turnout than other ethnicities, which holds true across different age groups. The turnout of young voters appear to affect the party leanings in votes by states, however we could not find a statistically significant conclusion supporting this. Finally, overall turnout would appear to affect the party leanings in votes by states, however this accounts for a very small portion of the variance in state party leanings. Each of these is explained in greater detail below.


## Voter Turnout Throughout the Years

Voter turnout can be described as the number of people who vote in any election. How big of a problem is turnout in the United States?

```{r echo=FALSE, fig.align='center', message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
years_label <- ggplot2:::interleave(seq(1964,2018,4), "")

hist_df %>% 
  filter(Status == "Voted" & Age == "Total") %>% 
  mutate(voted_count = (Population.Count * Population.Percent) / 100) %>% 
  ggplot() +
  aes(x = Year, group = Age) +
    geom_point(aes(y = voted_count, color = "Total Vote Count")) +
    geom_line(aes(y = voted_count, color = "Total Vote Count")) +
    geom_point(aes(y = Population.Count, color = "Eligible Population")) +
    geom_line(aes(y = Population.Count, color = "Eligible Population")) +
    labs(
     title = "Comparison Between Eligible Voters and Vote Count",
     y = "Total Population Count (In Thousands)",
     x = "Years",
     color = "Groups"
     ) + 
    scale_x_discrete(labels = years_label) +
    theme(
     plot.title = element_text(hjust = .5), 
     axis.text.x = element_text(angle = 45, hjust = 1)
     ) +
    scale_color_viridis(discrete = TRUE) +
    facet_wrap(~election.type)

hist_df %>% 
  filter(Status == "Voted" & Age == "Total") %>% 
  ggplot() +
    aes(Year, Population.Percent, group = Age) + 
    geom_point(color = "Yellow2") + geom_line(color = "Yellow2") +
    ggtitle("Total Turnout Percentage Over the Years") +
    ylab("Percentage") + 
    xlab("Years") + 
    scale_x_discrete(labels = years_label) +
    scale_color_viridis(discrete = TRUE) +
    theme(plot.title = element_text(hjust = .5)) +
    facet_wrap(~election.type)

gap <- hist_df %>% 
  filter(Status == "Voted" & Age == "Total" & (Year == "1964" | Year == "2018")) %>% 
  mutate(voted = (Population.Count * Population.Percent) / 100) %>% 
  select(Year, Population.Count, voted, Population.Percent) %>% 
  rename(`Eligible Voters (In Thousands)` = Population.Count,
         `Total Population that Voted (In Thousands)` = voted,
         `Percentage of Eligible Population that Voted` = Population.Percent) %>% 
  arrange(-desc(Year))

datatable(gap)
```

Based on the plots above, overall voter turnout has steadily increased over the years, but the gap between the total population who are eligible to vote and turnout are widening. At 1964, out of 110 million people that are eligible to vote, only 77 million voted. In 2018, out of 250 million eligible voters, only 122 million voted. While the total number of votes is increasing over time, percentage wise, voting has dropped by approximately 70% in 1964 to approximately 50% in the latest election year. Looking for the potential problems in decreasing number of voter turnout, we wanted to focus on different demographics, specifically age groups. 

```{r echo=FALSE, fig.align='center', message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
plot_title <- paste("Age Groups that Voted in the U.S. (1964-2018)")
count_label <- paste("Percent Eligible Population Voted")

hist_df %>% 
  filter(Status == "Voted" & Age != "Total") %>% 
  # mutate(test = (Population.Count * Population.Percent) / 100) %>% 
  ggplot() +
  aes(Year, Population.Percent, group = Age, color = Age) + 
  geom_point() + geom_line() + 
  ggtitle(plot_title) +
  ylab(count_label) + 
  xlab("Years") + 
  scale_color_viridis(discrete = TRUE) +
  scale_x_discrete(labels = years_label) +
  theme(plot.title = element_text(hjust = .5)) +
  facet_wrap(~election.type)

hist_df %>%
  filter(Status == "Voted" & Age != 'Total') %>% 
  ggplot() +
  aes(Age, Population.Percent, fill = Age) +
  geom_boxplot() +
  ggtitle("Mean Tunrout Percentages by Age Group") + 
  ylab("Mean Turnout Percentage") +
  xlab("Age Groups") + 
  scale_fill_viridis(discrete = TRUE) +
  theme(plot.title = element_text(hjust = .5))

mean_table <- hist_df %>% 
  filter(Status == "Voted") %>% 
  group_by(Age) %>% 
  summarize(`Mean of Age Groups` = mean(Population.Percent)) %>% 
  rename(`Age Groups` = Age)

datatable(mean_table)
```

From the plots, it seems that young voters have the least turnout compared to other age groups. Specifically, eligible people between 18 and 24 years of age have an average of 32% turnout between 1964 to 2018. Other age groups have higher average turnout percentages. What could cause such a phenomena? To explore factors that might contribute to low turnout in young people, we wanted to create some visuals and statistical tests relating to the group. To explore further with plots and statistics on voting and registration with different age groups, an interactive application is presented below.

## Interactive Plots on Voting and Registration Between Age Groups

```{r Overall Votes and Registration, echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
inputPanel(
  selectInput("plot_change", "Choose a plot:", 
              choices = list("Total Count" = 1, 
                              "Percentage" = 2),
              selected = 1),
  checkboxGroupInput("hist_ages", "Choose an age group:",
                      choices = list("Total" = 1,"18 - 24 yrs" = 2,
                                     "25 - 44 yrs" = 3,"45 - 64 yrs" = 4,
                                     "65+ yrs" = 5),
                      selected = c(1, 2, 3, 4, 5)),
  radioButtons("hist_status", "Choose voting status:",
               choices = list("Voted" = 1, "Registered" = 2),
               selected = 1),
  checkboxInput("split_type", "Split midterm/presidential", FALSE)
)

renderPlot({
  # Dataframe of filtered and selected age groups from user input
  plot_df <- hist_df[hist_df$Age %in% age_groups[as.numeric(input$hist_ages)], ]
  years_label <- ggplot2:::interleave(seq(1964,2018,4), "")
  plot_title <- paste("Age Groups that", vote_status[as.numeric(input$hist_status)], "in the U.S.")
  count_label <- paste("Total", vote_status[as.numeric(input$hist_status)], "Count (In Thousands)")
  select_vote <- as.numeric(input$hist_status)
  # Historical plot of either total count  on votes/registratiob
  if(input$plot_change == 1) {
    outplot <- plot_df %>% 
      filter(Status == vote_status[select_vote]) %>% 
      mutate(status_pop = (Population.Count * Population.Percent) / 100) %>% 
      ggplot() +
        aes(Year, status_pop, group = Age, color = Age) + 
        geom_point() + geom_line() + 
        ggtitle(plot_title) +
        ylab(count_label) + 
        xlab("Years") + 
        scale_x_discrete(labels = years_label) +
        scale_color_viridis(discrete = TRUE) +
        theme(plot.title = element_text(hjust = .5))
  }
  # Historical plot of either votes and registration data
  else {
      plot_df <- plot_df %>% filter(Status == vote_status[select_vote])
      title_perc <- paste("Percentage Between Age Groups that", vote_status[select_vote])
      # Historical plot
      outplot <- ggplot(data = plot_df) +
        aes(Year, Population.Percent, group = Age, color = Age) + 
        geom_point() + geom_line() + 
        ggtitle(title_perc) +
        ylab("Turnout Percentage") + 
        xlab("Years") + 
        scale_x_discrete(labels = years_label) +
        scale_color_viridis(discrete = TRUE) +
        theme(plot.title = element_text(hjust = .5))
  }
  if (input$split_type) {
    outplot + facet_wrap(~election.type)
  }
  else {
    outplot
  }
})

renderDataTable({
  # Dataframe of filtered and selected age groups from user input
  table_df <- hist_df[hist_df$Age %in% age_groups[as.numeric(input$hist_ages)], ]
  select_vote <- as.numeric(input$hist_status)
  # Table of historical data
  table_df %>% 
    filter(Status == vote_status[select_vote]) %>% 
    mutate(status_pop = (Population.Count * Population.Percent) / 100) %>% 
    select(Year, Age, status_pop, Population.Percent) %>%
    rename(`Population Count` = status_pop) %>% 
    rename(`Population Percentage` = Population.Percent)
})
```

## Voter Turnout and Different Race Groups

```{r message=FALSE, warning=FALSE, include=FALSE, fig.width=10, fig.fullwidth=TRUE}
# Dataframe putting hist_df into a long format based on races
race_df <- hist_df %>% filter(Status == "Voted")
race_df <- melt(race_df, id.vars = c("Age", "Year"), 
                measure.vars = c("White.Percent", "White.Non.Hispanic.Percent",
                                 "Black.Percent", "Asian.Percent",
                                 "Hispanic.Other.Percent"))

race_df <- na.omit(race_df)
```

Could race be a factor in low turnout?

```{r Plots for young voters by race, echo=FALSE, fig.align='center', message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
race_names <- c("White", "White (Non Hispanic)", 
                "Black", "Asian", "Hispanic/Other")
# Total Race Plots 
race_df %>% 
  filter(Age == "Total") %>% 
  ggplot(aes(Year, value, group = variable, color = variable)) +
  geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
  ggtitle("Turnout Percentage and Race") +
  ylab("Turnout Percentage") +
  xlab("Years (1964-2018)") +
  labs(color = "Race/Ethnicity") +
  scale_x_discrete(labels = years_label) +
  scale_color_viridis(discrete = TRUE, labels = race_names) +
  theme(plot.title = element_text(hjust = .5))

# All Ages and Race Plots Facetted
race_df %>% 
  ggplot(aes(Year, value, group = variable, color = variable)) +
  geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
  ggtitle("Percent Turnout of Different Age Groups and Race") +
  ylab("Turnout Percentage") +
  xlab("Years (1964-2018)") +
  labs(color = "Race/Ethnicity") +
  scale_color_viridis(discrete = TRUE, labels = race_names) +
  # theme_minimal() +
  theme(axis.text.x = element_blank(),
        plot.title = element_text(hjust = .5)) +
  facet_wrap(~Age)
```

It seems that Asians and Hispanics/other have the least voter turnout in all age groups compared to other ethnicities. One noteworthy observation is that Asians have the least number of observations over the years, which could be a potential reason on the group's low percentage numbers, and hence, something to account for.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
race_df %>% 
  filter(Age == "Total") %>% 
  ggplot() +
  aes(variable, value, fill = variable) +
  geom_boxplot() +
  ggtitle("Percent Turnout by Race") +
  xlab("Race/Ethnicity") +
  ylab("Turnout Percentage") +
  labs(fill = "Race/Ethnicity") +
  scale_fill_viridis(discrete = TRUE, labels = race_names) +
  scale_x_discrete(labels = race_names) + 
  theme(plot.title = element_text(hjust = .5)) 

race_df %>% 
  filter(Year == "2018" & Age != "Total") %>% 
  ggplot() +
  aes(variable, value, fill = variable) +
  geom_bar(stat = "identity") +
  ggtitle("Percent Turnout by Age and Race") +
  xlab("Race/Ethnicity") +
  ylab("Turnout Percentage") +
  facet_wrap(~Age) +
  scale_x_discrete(labels = race_names) +
  # theme_minimal() +
  labs(fill = "Race/Ethnicity") +
  scale_fill_viridis(discrete = TRUE, labels = race_names) +
  theme(
    axis.text.x = element_blank(),
    plot.title = element_text(hjust = .5)) 

race_table <- race_df
race_table <- na.omit(race_table)
race_table$variable <- as.character(race_table$variable)
race_table$variable[race_table$variable == "White.Percent"] <- "White"
race_table$variable[race_table$variable == "White.Non.Hispanic.Percent"] <- "White (Non Hispanic)"
race_table$variable[race_table$variable == "Black.Percent"] <- "Black"
race_table$variable[race_table$variable == "Asian.Percent"] <- "Asian"
race_table$variable[race_table$variable == "Hispanic.Other.Percent"] <- "Hispanic or Other"
race_table$variable <- as.factor(race_table$variable)

race_one <- race_table %>% 
  filter(Age == "Total") %>% 
  group_by(variable) %>% 
  summarize(`Mean of Age Groups` = mean(value)) %>% 
  rename(`Race Groups` = variable)

datatable(race_one)
```

When looking at the average total percentage across all race groups, Asians and Hispanics (and other race groups) approximately have a 25% voting average from 1964 to 2018. Compared to other race groups, there mean voting percentages are around 50%. 

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# Young People and Race
race_df %>% 
  filter(Age == '18 to 24') %>% 
  ggplot(aes(Year, value, group = variable, color = variable)) +
  geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
  ggtitle("Young Voters and Race") +
  ylab("Voting Percentage") +
  xlab("Years (1964-2018)") +
  labs(color = "Race/Ethnicity") +
  scale_x_discrete(labels = years_label) +
  scale_color_viridis(discrete = TRUE, labels = race_names) +
  theme(plot.title = element_text(hjust = .5))

# Young Voters and Race Facetted
race_table %>% 
  filter(Age == '18 to 24') %>% 
  ggplot(aes(Year, value, group = Age, color = variable)) +
  geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
  facet_wrap(~variable) +
  ggtitle("Young Voters and Race") +
  ylab("Voting Percentage") +
  xlab("Years (1964-2018)") +
  labs(color = "Race/Ethnicity") +
  scale_color_viridis(discrete = TRUE, labels = race_names) +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        plot.title = element_text(hjust = .5)) 
```

Looking specifically at young voters (18 to 24 year olds) and the different race demographics, the same applies; Asians and Hispanics (and other race groups) have a lower voting percentage compared to other race groups.  

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
race_df %>% 
  filter(Age == "18 to 24") %>% 
  group_by(variable) %>%
  summarize(`Vote Percentage` = mean(value)) %>% 
  ggplot() +
  aes(variable, `Vote Percentage`, fill = variable) +
  geom_bar(stat = "identity") +
  ggtitle("Mean Percent Turnout of Young Voters by Race") +
  labs(fill = "Race/Ethnicity") +
  xlab("Race/Ethnicity") +
  scale_x_discrete(labels = race_names) + 
  scale_fill_viridis(discrete = TRUE, labels = race_names) + 
  theme(plot.title = element_text(hjust = .5)) 

race_df %>% 
  filter(Age == "18 to 24") %>% 
  ggplot() +
  aes(variable, value, fill = variable) +
  geom_boxplot() +
  ggtitle("Percent Turnout of Young Voters by Race") +
  # xlab("Race/Ethnicity") +
  ylab("Vote Percentage") +
  xlab("Race/Ethnicity") +
  labs(fill = "Race/Ethnicity") +
  scale_fill_viridis(discrete = TRUE, labels = race_names) +
  scale_x_discrete(labels = race_names) + 
  theme(plot.title = element_text(hjust = .5)) 

race_two <- race_table %>% 
  filter(Age == "18 to 24") %>% 
  group_by(variable) %>% 
  summarize(`Mean of Age Groups` = mean(value)) %>% 
  rename(`Race Groups` = variable) 

datatable(race_two)
```

Unsurprisingly, based on the young voters and race, the mean voting percentage between races/ethnicities is a lot lower compared to the total percentages on races/ethnicities and total turnout. Asians and Hispanics/other, again, have the lowest turnout with approximately 15% average voting percentage. For some particular reason, Asians and Hispanics/other have lower turnout compared to other race groups and even lower for young voter data. Below are interative plots on voter turnout and races/ethnicities to further explore the topic.

## Interactive Plots on Voter Turnout and Race Groups

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
race_val_names <- c("White.Percent", "White.Non.Hispanic.Percent",
                    "Black.Percent", "Asian.Percent",
                    "Hispanic.Other.Percent")

inputPanel(
  selectInput("plot_race", "Choose a plot:", 
              choices = list("All Voters" = 1, 
                              "Young Voters" = 2),
              selected = 1),
  checkboxGroupInput("race", "Choose a race group:",
                      choices = list("White" = 1,"White (Non Hispanic)" = 2,
                                     "Black" = 3,"Asian" = 4,
                                     "Hispanic/Other" = 5),
                      selected = c(1, 2, 3, 4, 5)),
  radioButtons("status", "Choose voting status:",
               choices = list("Voted" = 1, "Registered" = 2),
               selected = 1)
  )


gen_race_df <- function() {
  race_num <- as.numeric(input$race)
  status_num <- as.numeric(input$status)
  # Dataframe putting hist_df into a long format based on races
  race_df <- hist_df %>% filter(Status == vote_status[status_num])
  race_df <- melt(race_df, id.vars = c("Age", "Year"), 
                  measure.vars = race_val_names)
  race_df <- na.omit(race_df)
  # Dataframe of filtered and selected age groups from user input
  race_df <- race_df[race_df$variable %in% race_val_names[race_num], ]
  
  if(input$plot_race == 1) {
    race_df <- race_df %>% filter(Age == "Total")
  }
  else {
    race_df <- race_df %>% filter(Age == "18 to 24")
  }
  
  race_df$variable <- as.character(race_df$variable)
  race_df$variable[race_df$variable == "White.Percent"] <- "White"
  race_df$variable[race_df$variable == "White.Non.Hispanic.Percent"] <- "White (Non Hispanic)"
  race_df$variable[race_df$variable == "Black.Percent"] <- "Black"
  race_df$variable[race_df$variable == "Asian.Percent"] <- "Asian"
  race_df$variable[race_df$variable == "Hispanic.Other.Percent"] <- "Hispanic or Other"
  race_df$variable <- as.factor(race_df$variable)
  race_df
}

renderPlot({
  plot_df <- gen_race_df()
  plot_title <- paste("Percent", vote_status[as.numeric(input$status)], "by Race")
  y_title <- paste("Percent of Eligible (", vote_status[as.numeric(input$status)], ")", sep = "")
  plot_df %>% 
    ggplot(aes(Year, value, group = variable, color = variable)) +
    geom_point() + geom_line() +
    ggtitle(plot_title) +
    ylab(y_title) +
    ylim(0, 100) +
    xlab("Years (1964-2018)") +
    labs(color = "Race/Ethnicity") +
    scale_x_discrete(labels = years_label) +
    scale_color_viridis(discrete = TRUE) +
    theme(plot.title = element_text(hjust = .5))
})

renderDataTable({
  table_df <- gen_race_df()
  
  table_df %>% 
    rename(`Race Groups` = variable,
           Percentages = value)
})

```

## Race Group Testing - Multiple Linear Regression

Based on the plots, it seems that overall and specifically young Asians and Hispanics/other vote the least compared to other races. Could specific races, such as Asians or Hispanics/other, have affect voter turnout? Could we make predictions based on race and young voters? We can build and fit a multiple linear regression model, with different races/ethinicities as predictor variables and young voting percentage as the outcome variable, to test if there are no significant predictions of young voters by different race groups.

We conducted a formal F test using ANOVA to compare our specified model we want to fit to another with no independent variables.

- The null hypothesis states that the model with no independent variables fits the data as well as your model

- The alternative hypothesis says that our model fits the data better 

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# Cleaning up race values
lm_race <- race_df %>% filter(Age == "18 to 24")
lm_race$variable <- as.character(lm_race$variable)
lm_race$variable[lm_race$variable == "White.Percent"] <- "White"
lm_race$variable[lm_race$variable == "White.Non.Hispanic.Percent"] <- "White (Non Hispanic)"
lm_race$variable[lm_race$variable == "Black.Percent"] <- "Black"
lm_race$variable[lm_race$variable == "Asian.Percent"] <- "Asian"
lm_race$variable[lm_race$variable == "Hispanic.Other.Percent"] <- "Hispanic or Other"
lm_race$variable <- as.factor(lm_race$variable)
lm_race <- lm_race %>% rename(race = variable) %>% rename(percentage = value)

summary(aov(percentage ~ race, data = lm_race)) %>% pander
```

Since the p-value is significantly less than the specified significance level of .05, we reject the null hypothesis. There is statistical evidence that our model fits the data better. Hence, we could attempt fitting a multiple linear regression model. 

### Fitting the model

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
lm_fit <- lm(percentage ~ race, data = lm_race)
summary(lm_fit) %>% pander
```

Based on the results, since none of the coefficients were equal to 0, the race groups are statistically significant. 

Some interpretations we could make based on the model:

- The approximate voting percentage would increase by 16.162% for young Asians

- The approximate voting percentage would increase by 13.481% for young Blacks

- The approximate voting percentage would increase by .47% for young Hispanics/other

- The approximate voting percentage would increase by 16.796% for young Whites

- The approximate voting percentage would increase by 16% for young Whites that are not Hispanic

Comparing all the races/ethnicities together, it seems that they all have similar voting percentage outcomes, with the exception of Hispanics/other. In short, regardless of one's race, voting ncreases the voting percentage of one's race, which makes sense hypothetically. However, certain race groups, such as Hispanics seem to increase slower compared to other races/ethnicities. Due to the low R squared and adjusted R squared values however, it seems that the current model has a  higher amount of unexplainable variability. The p-value for the coefficient with Hispanics/other is also very high, so interpreting the specified race group is questionable. Regardless, we should evaluate and check the performance of the current model. 

```{r fig.fullwidth=TRUE, fig.width=10, message=FALSE, warning=FALSE, include=FALSE}
step(lm(percentage ~ race, data = lm_race), direction = "both", trace = 0) 
```

To evaluate the model, the stepwise algorithm was used in both directions and it concluded that the current model that included all races/ethnicities as the coefficients is the best to use. Since no other models could be used to compare different mean squared errors with the current model, no evaluation results or other models were shown.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
ggplot(data = lm_fit) +
  aes(lm_fit$fitted.values, lm_fit$residuals) +
  geom_point() + 
  geom_hline(yintercept = 0, color = "red", linetype = "dotted") +
  ggtitle("Residuals vs Fitted Values Plot") +
  ylab("Residuals") +
  xlab("Fitted Values") +
  theme(plot.title = element_text(hjust = .5)) 
```

We also checked the see how the model held based on a plot between residuals versus the fitted values. Unfortunately, there is a noticable pattern; a multiple linear regression model would probably not be the best model to use for prediction with race and young voters since the errors seem to be dependent on each other. In short, the current model is not accurate.

## Young Voter Turnout and Election Results

After taking a look at the factors affecting young voter turnout, we wanted to take a look at how young voter turnout influences presidential elections. By calculating the mean party score across the county, we were able to plot it against the turnout percentage. Party score is on a scale from -1 to 1, where -1 means 100% of the country voted for the Democrat Party, while 1 means 100% of the country voted for the Republican Party. This country mean was calculated from state party scores. Unfortunately, we did not have state-level turnout data broken down by age.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
pres.results <- read_csv("1976-2016-president.csv")

party.results <- pres.results %>% 
      # full_join(select(sena.results, -district)) %>% 
      # full_join(select(hous.results, -district)) %>% 
      filter(writein == FALSE) %>% 
      filter(year != 2018) %>% 
      mutate(party = replace(party, party == "democratic-farmer-labor", "democrat")) %>% 
      filter(party %in% c("democrat", "republican")) %>% 
      select(-candidate) %>% 
      pivot_wider(
          names_from = party,
          values_from = candidatevotes,
          values_fill = list(democrat = 0, republican = 0)
      ) %>% 
      # filter(totalvotes != 1) %>% 
      # party.results$democrat <- unlist(party.results$democrat, use.names = FALSE)
      mutate(p.dem = democrat / totalvotes) %>% 
      mutate(p.rep = republican / totalvotes) %>% 
      mutate(partyscore = p.rep - p.dem)
```


```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
hist.age.turnout <- hist_df %>% 
  filter(Year %in% seq(1964, 2016, 4)) %>% 
  filter(Status == "Voted") %>% 
  filter(Age != "Total") %>% 
  filter(election.type == "Presidential") %>% 
  rename(year = Year)

hist.age.turnout.res <- party.results %>% 
  select(year, state, partyscore) %>% 
  mutate(state = tolower(state)) %>% 
  pivot_wider(names_from = state, values_from = partyscore) %>% 
  mutate(country_mean = rowMeans(select(., alabama:`wyoming`))) %>% 
  filter(year %in% seq(1964, 2016, 4)) %>% 
  select(year, country_mean) %>%
  mutate(year = as.factor(year)) %>% 
  full_join(hist.age.turnout) %>% 
  mutate(year = as.numeric(year)) %>% 
  filter(year >= 1976)
```

Curiously enough, it was found that the years in which Ronald Reagan was elected presented significant outliers across all age groups, and so decided to look at the data both with and without the Reagan years.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
with.reagan.plot <- hist.age.turnout.res %>% 
  # filter(!(year %in% c(1980, 1984))) %>% 
  ggplot() +
    aes(x = Population.Percent, y = country_mean, color = Age) +
    geom_point() +
    geom_smooth(method = "lm") +
    scale_color_viridis(discrete = TRUE) +
    labs(x = "Percent Turnout", y = "Country Mean Party Score", title = "With Reagan") + 
    theme(legend.position = "none") +
    facet_wrap(~Age)

without.reagan.plot <- hist.age.turnout.res %>% 
  filter(!(year %in% c(1980, 1984))) %>%
  ggplot() +
    aes(x = Population.Percent, y = country_mean, color = Age) +
    geom_point() +
    geom_smooth(method = "lm") +
    scale_color_viridis(discrete = TRUE) +
    labs(x = "Percent Turnout", y = "Country Mean Party Score", title = "Without Reagan") + 
    theme(legend.position = "none") +
    facet_wrap(~Age)

grid.arrange(with.reagan.plot, without.reagan.plot, nrow = 1, top = "Country Mean Party Score by Age Group")

# filter(hist.age.turnout.res, Age == "18 to 24")

with.reagan <- lm(country_mean ~ Population.Percent, data = filter(hist.age.turnout.res, Age == "18 to 24"))
par(mfrow=c(2,2))
kable(summary(with.reagan)$coefficients, format = 'markdown')
kable(summary(with.reagan)$r.squared, col.names = "R Squared", format = 'markdown')
```

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}


# filter(hist.age.turnout.res, Age == "18 to 24")

data.without.reagan <- hist.age.turnout.res %>% 
  filter(!(year %in% c(1980, 1984))) %>%
  filter(Age == "18 to 24")

with.reagan <- lm(country_mean ~ Population.Percent, data = data.without.reagan)
par(mfrow=c(2,2))
plot(with.reagan)
kable(summary(with.reagan)$coefficients, format = 'markdown')
kable(summary(with.reagan)$r.squared, col.names = "R Squared", format = 'markdown')
```

We decided to fit a model to just young people as there are seemingly notable differences between age groups. Evaluating the appropriateness of a linear model we find some issues. Most important is that the data only barely appears to be linear, however this will be ignored as useful information may still be gained. Residuals appear to vary more or less evenly, while the residuals themselves are aproximately normally distributed. Each observation is independent enough for our purposes, however it must be noted that true and perfect independence is nearly impossible with the complex interactions between elections.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
kable(summary(with.reagan)$coefficients, format = 'markdown')
kable(summary(with.reagan)$r.squared, format = 'markdown')
```

Unfortunately, the p-value for each was incredibly large, and so we fail to reject the null hypothesis (that there is no relationship). We can't say more than it's inconclusive. In the future we would like to get state-level data for age group turnout, which may provide a more conclusive answer.

## The Future and Young Voter Turnout

What could happen for young voter turnout? To predict this, we created a time series, a vector of values specifically on a the total population of people 18 to 24 years who voted over the years.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
time_df <- hist_df %>% filter(Status == "Voted" & Age == "18 to 24")
votes_vec <- time_df$Population.Percent
votes_vec <- na.omit(votes_vec)
votes_ts <- ts(votes_vec, frequency = .5, start = c(1964, 1), end = c(2018, 1))
plot.ts(votes_ts, 
        main = "Time Series of Young Voting Percentage",
        ylab = "Voting Percentage")
```

Then to create future predictions for later years, we fitted a Holt-Winters forecasting model.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
fit <- HoltWinters(votes_ts, gamma = FALSE)
plot(fit)
pred <- forecast:::forecast.HoltWinters(fit, h = 5, lower = 0, upper = 100)
plot(pred)
```

Based on the model, it seems that voter turnout for young people is staying roughly constant, but slowly increasing over time. To explore further years, an interactive application is shown below.

## Interactive Forecasting with HoltWinters Model of Young Voters

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
inputPanel(
  numericInput("num", "Enter amount of years to make prediction", 
               value = 10, min = 1, max = 100)
  )

renderPlot({
  pred <- forecast:::forecast.HoltWinters(fit, h = (input$num[1]/2), lower = 0, upper = 100)
  plot(pred)
})
```

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# keeping this as a template
```

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# read in age demographics
raw.byage <- read_excel(
  "CPS Turnout Rates.xlsx",
  sheet = "Age",
  range = "A2:R6"
)

# pivot to tidy shape
byage <- raw.byage %>% 
  pivot_longer(cols = 2:length(raw.byage), names_to = "year", values_to = "turnout")
colnames(byage)[1] = "age.group"
# change types and mutate new type column
byage$year <- as.numeric(byage$year)
byage <- byage %>% mutate(type = ifelse(year %in% seq(1986, 2020, 4), "Midterm", "Presidential"))
byage$type <- factor(byage$type, levels = c("Presidential", "Midterm"))
```

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# read in age demographics
raw.byedu <- read_excel(
  "CPS Turnout Rates.xlsx",
  sheet = "Education",
  range = "A2:R6"
)

# pivot to tidy shape
byedu <- raw.byedu %>% 
  pivot_longer(cols = 2:length(raw.byedu), names_to = "year", values_to = "turnout")
colnames(byedu)[1] = "edu_lvl"
# change types and mutate new type column
byedu$year <- as.numeric(byedu$year)
byedu <- byedu %>% mutate(type = ifelse(year %in% seq(1986, 2020, 4), "Midterm", "Presidential"))
byedu$type <- factor(byedu$type, levels = c("Presidential", "Midterm"))
byedu$edu_lvl <- factor(byedu$edu_lvl, levels = c("Less Than High School", "High School Grad", "Some College to College Grad", "Post-Graduate"))
```

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
raw.alltime <- read_csv("U.S. VEP Turnout 1789-Present - Statistics.csv")
all.pres <- raw.alltime[, 1:2]
all.midt <- raw.alltime[, 3:4]
colnames(all.pres) <- c("year", "turnout")
colnames(all.midt) <- c("year", "turnout")

all.pres$type <- rep("Presidential", nrow(all.pres))
all.midt$type <- rep("Midterm", nrow(all.midt))

alltime <- full_join(all.pres, all.midt)
alltime$type <- factor(alltime$type, levels = c("Presidential", "Midterm"))
```

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
gen.turn <- read_excel(
        "1980-2014 November General Election.xlsx",
        skip = 1
    )
    
    gen.turn.2016 = read_excel(
        "2016 November General Election.xlsx",
        skip = 1
    )
    
    gen.turn <- gen.turn %>%
        select(-(2:3), -c(5, 7, 8, 11), -(12:17)) 
    
    colnames(gen.turn) <- c("year", "region", "turnout", "ballots", "VEP")
    gen.turn$region <- sapply(gen.turn$region, tolower)
    gen.turn <- mutate(gen.turn, type = ifelse(year %in% seq(1980, 2020, 4), "Presidential", "Midterm"))
    
    gen.turn.2016 <- gen.turn.2016 %>%
        select(-(2:3), -c(4, 6, 7, 10), -(11:17)) 
    
    colnames(gen.turn.2016) <- c("region", "turnout", "ballots", "VEP")
    gen.turn.2016$region <- sapply(gen.turn.2016$region, tolower)
    gen.turn.2016$year <- rep(2016, nrow(gen.turn.2016))
    gen.turn.2016$type <- rep("Presidential", nrow(gen.turn.2016))
    
    gen.turn <- full_join(gen.turn, gen.turn.2016)
    
    states.turnout <- gen.turn %>% filter(region != "united states")
    total.turnout <- gen.turn %>% filter(region == "united states")
    
party.results.mergeprep <- party.results %>% 
  mutate(state = tolower(state)) %>% 
  filter(year %in% seq(1980, 2016, 4))

states.turnout.mergeprep <- states.turnout %>% 
  filter(year %in% seq(1980, 2016, 4)) %>% 
  rename(state = region)

test2 <- full_join(party.results.mergeprep, states.turnout.mergeprep)

test2 <- test2 %>% 
  group_by(state) %>% 
  summarize(mean_state_score = mean(partyscore)) %>% 
  full_join(test2)

results <- read_csv("1976-2016-president.csv")
    party.results <- results %>% 
        mutate(party = replace(party, party == "democratic-farmer-labor", "democrat")) %>% 
        filter(party %in% c("democrat", "republican")) %>% 
        filter(writein == FALSE) %>% 
        select(-candidate) %>% 
        pivot_wider(
            names_from = party,
            values_from = candidatevotes
        ) %>% 
        mutate(p.dem = democrat / totalvotes) %>% 
        mutate(p.rep = republican / totalvotes) %>% 
        mutate(partyscore = p.rep - p.dem)
```

## Overall Turnout and Party Score

After looking at the young voting population we wanted to look at the broader affects of voter turnout.

### Interactive: Per-State Turnout and Party Score

```{r echo=FALSE, message=FALSE, warning=FALSE}
inputPanel(
  sliderInput("year", "Year", 1980, 2016, 2016, step = 2, animate = TRUE, sep = ""),
  checkboxInput(
      "relscale",
      "Use relative color scale",
      TRUE
  )
)

  splitLayout(
  renderPlot({
          
          # merge <- states.turnout %>% 
          #     filter(year == input$year)
          # 
          # merge <- inner_join(states_map, merge, by = "region")
          
          data("fifty_states")
          
          ret_plot <- states.turnout %>% 
              filter(year == input$year) %>% 
              ggplot(aes(map_id = region)) +
                  geom_map(aes(fill = turnout), map = fifty_states, color = "white") +
                  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
                  coord_map(projection = "albers", lat0=30, lat1=40) +
                  labs(title = paste("Election Turnout Rates by State (", input$year, ")", sep = ""), fill = "Turnout") +
                  theme_void() +
                  theme(legend.position = "bottom", legend.box = "horizontal") +
                  fifty_states_inset_boxes() + 
                  facet_wrap(~type)
          
          if (input$relscale) {
              ret_plot + scale_fill_viridis(label = scales::percent, guide = guide_colorbar(title.position = "top"))
          }
          else {
              ret_plot + scale_fill_viridis(limits = c(0, 1), label = scales::percent)
          }
      }
      # cache = diskCache(),
      # cacheKeyExpr = input$year
     ),
  
  renderPlot({
          party.results %>% 
              filter(year == input$year) %>% 
              ggplot(aes(map_id = tolower(state))) + # map_id must be lowercase for whatever reason
                  geom_map(aes(fill = partyscore), map = fifty_states, color = "lightgrey") +
                  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
                  coord_map(projection = "albers", lat0=30, lat1=40) +
                  labs(title = paste("Presidential Popular Vote Leaning by State (", input$year, ")", sep = ""), fill = "Party Score (-1 = 100% Democrat, 1 = 100% Republican)") +
                  theme_void() +
                  scale_fill_gradient2(
                      low = "blue", 
                      mid = "white", 
                      high = "red", 
                      limits = c(-1, 1),
                      guide = guide_colorbar(title.position = "top")
                  ) +
                  theme(legend.position = "bottom", legend.box = "horizontal") +
                  fifty_states_inset_boxes()
      }
      # cache = diskCache(),
      # cacheKeyExpr = input$year
      )
)
```

We decided to look at overall turnout and its affect on party score in recent years, so we took the party score and turnout for each state from 2000 to 2016. 

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
test2 %>% 
  filter(year >= 2000) %>% 
  ggplot() +
    aes(x = turnout, y = partyscore, color = partyscore) +
    geom_point() +
    scale_color_viridis(limits = c(-1, 1)) +
    labs(x = "Percent Turnout", y = "Party Score", title = "Party Score vs. Percent Turnout, 2000 to 2016") +
    geom_smooth(method = "lm")
    # facet_wrap(~year)

test2 %>% 
  filter(year >= 2000) %>% 
  ggplot() +
    aes(x = turnout, y = partyscore, color = partyscore) +
    geom_point() +
    scale_color_viridis(limits = c(-1, 1)) +
    labs(x = "Percent Turnout", y = "Party Score", title = "Party Score vs. Percent Turnout, by Year") +
    geom_smooth(method = "lm") +
    facet_wrap(~year)
```

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
turn.res.2000 <- test2 %>% 
  filter(year >= 2000)

fit.turn.res <- lm(partyscore ~ turnout, data = turn.res.2000)
par(mfrow=c(2,2))
plot(fit.turn.res)
```

Evaluating the model we find that the data does appear to be at least someone linear, with fairly evenly distributed residual variance, and fairly normal distribution of residuals, as shown above. Each observation is as independent as may be reasonable expected, however it must be noted that the outcomes of each vote is affected by the previous in manners which are difficult to trace. We conclude that a linear regression model is appropriate and of use.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
kable(summary(fit.turn.res)$coefficients, format = 'markdown')
kable(summary(fit.turn.res)$r.squared, col.names = "R Squared", format = 'markdown')
```


The result is a negative association (-0.5654) between percent turnout and party score, which implies that greater percent turnout pushes a state's overall vote toward the Democrat Party. This association is statistically significant, with a p-value of 0.00629 - we must reject the null hypothesis that there is no association. The Adjusted R-squared value is incredibly low at 0.0253, meaning that our model explains only about 2.5% of the variation in the data. This is likely due to the numerous factors beyond simple turnout which affect the party leanings of a given state, however it remains interesting that turnout is as significant as it is.

## Conclusion

As this was more of an exploratory report, a single conclusion cannot be reached. What our explorations have made clear, however, is that voter turnout is incredibly important to the outcomes of votes (as one might expect), and should be better encouraged. With the low turnout of Asian, Hispanic/other, and young populations, it is clear that better representation, awareness, and access need to occur for our democracy to better represent its populations.