forked from tokebe/voter-turnout-248
-
Notifications
You must be signed in to change notification settings - Fork 0
/
voter-turnout-doc.Rmd
994 lines (830 loc) · 44.9 KB
/
voter-turnout-doc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
---
title: "Impact of Voter Age on Election Outcomes"
author: "Hans Quiogue and Jackson Callaghan"
date: "5/3/2020"
output: html_document
runtime: shiny
---
<style>
body .main-container {
max-width: 1000px;
}
</style>
```{r Libraries, message=FALSE, warning=FALSE, include=FALSE}
library(tidyverse)
library(readxl)
library(pander)
library(forecast)
library(DT)
library(reshape2)
library(kableExtra)
library(viridis)
library(shiny)
library(scales)
library(gridExtra)
library(fiftystater)
library(mapproj)
library(wesanderson)
```
```{r Set Theme, message=FALSE, warning=FALSE, include=FALSE}
theme_set(theme_grey())
```
```{r Data Cleaning for Historical Data, message=FALSE, warning=FALSE, include=FALSE}
# Uncleaned historical data
hist_df <- read_xlsx("a1.xlsx", skip = 5, na = "NA")
# Different voting category names to put into columns for the dataframe
voting_col_names <- c("Year", "Population-Count", "Population-Percent",
"Citizen-Percent", "White-Percent", "Citizen-White-Percent",
"White-Non-Hispanic-Percent", "Citizen-White-Non-Hispanic-Perc",
"Black-Percent", "Citizen-Black-Percent", "Asian-Percent",
"Citizen-Asian-Percent", "Hispanic/Other-Percent",
"Citizen-Hispanic/Other-Perc", "Male-Percent", "Female-Percent")
# Assigns proper column names
colnames(hist_df) <- voting_col_names
# Converts all columns to numeric
hist_df[] <- lapply(hist_df, function(x) as.numeric(x))
# Converts year column to a categorical value
hist_df$Year <- as.factor(hist_df$Year)
# 1964 to 2018 (Goes by every 2 year)
years_length <- (2018 - 1964) / 2
# List of proper age groups and voting status to put into dataframe
age_groups <- c("Total", "18 to 24", "25 to 44", "45 to 64", "65+")
vote_status <- c("Voted", "Registered")
temp_df <- data.frame()
# Integer values representing current index values of corresponding list
curr_status <- 1
curr_age <- 1
# Loop to cleans up hist_df (Loop increments are added by 6 to account for missing columns)
for(i in seq(1, nrow(hist_df), years_length + 6)) {
# Appends new rows to temp_df from subset of hist_df
if(i != 1) temp_df <- bind_rows(temp_df, hist_df[(i - 1):((i - 1) + years_length), ])
else temp_df <- bind_rows(temp_df, hist_df[i:(i + years_length), ])
}
# Appends new columns for age groups and voting statuses
temp_df <- data.frame(append(temp_df, c(Age = age_groups[curr_status],
Status = vote_status[curr_age]),
after = 0), stringsAsFactors = FALSE)
temp_index = 1
# Loop that corrects age group and voting status rows
for(i in 1:(length(age_groups) * length(vote_status))) {
temp_df$Age[temp_index:(temp_index + years_length)] <- age_groups[curr_age]
temp_df$Status[temp_index:(temp_index + years_length)] <- vote_status[curr_status]
temp_index <- (temp_index + years_length) + 1
# Updates age group and voting status to match rows
if(curr_status == 1) curr_status = curr_status + 1
else {
curr_status = 1
curr_age = curr_age + 1
}
}
# Removes unnessary rows at the nd
hist_df <- temp_df[1:279, ]
hist_df$Age <- as.factor(hist_df$Age)
hist_df$Status <- as.factor(hist_df$Status)
hist_df <- mutate(hist_df, election.type = ifelse(Year %in% seq(1964, 2020, 4),
"Presidential", "Midterm"))
```
## Project Description
This project will be focused on how voter turnout has affected the outcomes of United States elections using visualizations and some aggregate statistics. Specifically, we wanted to look at how turnout breaks down in terms of age. Low turnout of young voters is a major problem in America, and is typically attributed to a number of factors. In this project, we explored voter turnout data to see how the young vote affects election outcomes. The result is this html-knitted markdown file with embedded Shiny applications detailing our question, methods, and findings alongside interactive exploratory and more question-focused visualizations, and aggregate statistics.
The main R libraries we used for this project were tidyverse, to wrangle and visualize the data, and Shiny, which made all of our plots interactive. Tidyverse includes ggplot2 and ensured that our visualizations were well refined while Shiny enhanced the plots overall for users by making them dynamic. Viridis is package that includes multiple color palletes and we used the library to make our plots consistent with a constant color scheme. Other packages we used to provide more interesting and informative plots and statistical tests were pander, forecast, fiftstarter, DT, fiftystarter, and many others.
The data sources we used were the following:
- Census: https://www.census.gov/topics/public-sector/voting.html
- The United States Elections Project: http://www.electproject.org/home
- The MIT Elections Lab: https://electionlab.mit.edu/data
## Project Summary
We looked at a number of interactions in elections, and arrived at several interesting conclusions, as well as a few non-conclusions. As a result, our project became more of an exploratory report on some concepts in election turnout. Turnout rate to elections is lagging behind population, and so overall turnout rate has been on the decline or stagnant since 1964. Young voters aged 18-24 have the lowest turnout of any age group. Asian and Hispanic/other American voters have significantly lower turnout than other ethnicities, which holds true across different age groups. The turnout of young voters appear to affect the party leanings in votes by states, however we could not find a statistically significant conclusion supporting this. Finally, overall turnout would appear to affect the party leanings in votes by states, however this accounts for a very small portion of the variance in state party leanings. Each of these is explained in greater detail below.
## Voter Turnout Throughout the Years
Voter turnout can be described as the number of people who vote in any election. How big of a problem is turnout in the United States?
```{r echo=FALSE, fig.align='center', message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
years_label <- ggplot2:::interleave(seq(1964,2018,4), "")
hist_df %>%
filter(Status == "Voted" & Age == "Total") %>%
mutate(voted_count = (Population.Count * Population.Percent) / 100) %>%
ggplot() +
aes(x = Year, group = Age) +
geom_point(aes(y = voted_count, color = "Total Vote Count")) +
geom_line(aes(y = voted_count, color = "Total Vote Count")) +
geom_point(aes(y = Population.Count, color = "Eligible Population")) +
geom_line(aes(y = Population.Count, color = "Eligible Population")) +
labs(
title = "Comparison Between Eligible Voters and Vote Count",
y = "Total Population Count (In Thousands)",
x = "Years",
color = "Groups"
) +
scale_x_discrete(labels = years_label) +
theme(
plot.title = element_text(hjust = .5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_color_viridis(discrete = TRUE) +
facet_wrap(~election.type)
hist_df %>%
filter(Status == "Voted" & Age == "Total") %>%
ggplot() +
aes(Year, Population.Percent, group = Age) +
geom_point(color = "Yellow2") + geom_line(color = "Yellow2") +
ggtitle("Total Turnout Percentage Over the Years") +
ylab("Percentage") +
xlab("Years") +
scale_x_discrete(labels = years_label) +
scale_color_viridis(discrete = TRUE) +
theme(plot.title = element_text(hjust = .5)) +
facet_wrap(~election.type)
gap <- hist_df %>%
filter(Status == "Voted" & Age == "Total" & (Year == "1964" | Year == "2018")) %>%
mutate(voted = (Population.Count * Population.Percent) / 100) %>%
select(Year, Population.Count, voted, Population.Percent) %>%
rename(`Eligible Voters (In Thousands)` = Population.Count,
`Total Population that Voted (In Thousands)` = voted,
`Percentage of Eligible Population that Voted` = Population.Percent) %>%
arrange(-desc(Year))
datatable(gap)
```
Based on the plots above, overall voter turnout has steadily increased over the years, but the gap between the total population who are eligible to vote and turnout are widening. At 1964, out of 110 million people that are eligible to vote, only 77 million voted. In 2018, out of 250 million eligible voters, only 122 million voted. While the total number of votes is increasing over time, percentage wise, voting has dropped by approximately 70% in 1964 to approximately 50% in the latest election year. Looking for the potential problems in decreasing number of voter turnout, we wanted to focus on different demographics, specifically age groups.
```{r echo=FALSE, fig.align='center', message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
plot_title <- paste("Age Groups that Voted in the U.S. (1964-2018)")
count_label <- paste("Percent Eligible Population Voted")
hist_df %>%
filter(Status == "Voted" & Age != "Total") %>%
# mutate(test = (Population.Count * Population.Percent) / 100) %>%
ggplot() +
aes(Year, Population.Percent, group = Age, color = Age) +
geom_point() + geom_line() +
ggtitle(plot_title) +
ylab(count_label) +
xlab("Years") +
scale_color_viridis(discrete = TRUE) +
scale_x_discrete(labels = years_label) +
theme(plot.title = element_text(hjust = .5)) +
facet_wrap(~election.type)
hist_df %>%
filter(Status == "Voted" & Age != 'Total') %>%
ggplot() +
aes(Age, Population.Percent, fill = Age) +
geom_boxplot() +
ggtitle("Mean Tunrout Percentages by Age Group") +
ylab("Mean Turnout Percentage") +
xlab("Age Groups") +
scale_fill_viridis(discrete = TRUE) +
theme(plot.title = element_text(hjust = .5))
mean_table <- hist_df %>%
filter(Status == "Voted") %>%
group_by(Age) %>%
summarize(`Mean of Age Groups` = mean(Population.Percent)) %>%
rename(`Age Groups` = Age)
datatable(mean_table)
```
From the plots, it seems that young voters have the least turnout compared to other age groups. Specifically, eligible people between 18 and 24 years of age have an average of 32% turnout between 1964 to 2018. Other age groups have higher average turnout percentages. What could cause such a phenomena? To explore factors that might contribute to low turnout in young people, we wanted to create some visuals and statistical tests relating to the group. To explore further with plots and statistics on voting and registration with different age groups, an interactive application is presented below.
## Interactive Plots on Voting and Registration Between Age Groups
```{r Overall Votes and Registration, echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
inputPanel(
selectInput("plot_change", "Choose a plot:",
choices = list("Total Count" = 1,
"Percentage" = 2),
selected = 1),
checkboxGroupInput("hist_ages", "Choose an age group:",
choices = list("Total" = 1,"18 - 24 yrs" = 2,
"25 - 44 yrs" = 3,"45 - 64 yrs" = 4,
"65+ yrs" = 5),
selected = c(1, 2, 3, 4, 5)),
radioButtons("hist_status", "Choose voting status:",
choices = list("Voted" = 1, "Registered" = 2),
selected = 1),
checkboxInput("split_type", "Split midterm/presidential", FALSE)
)
renderPlot({
# Dataframe of filtered and selected age groups from user input
plot_df <- hist_df[hist_df$Age %in% age_groups[as.numeric(input$hist_ages)], ]
years_label <- ggplot2:::interleave(seq(1964,2018,4), "")
plot_title <- paste("Age Groups that", vote_status[as.numeric(input$hist_status)], "in the U.S.")
count_label <- paste("Total", vote_status[as.numeric(input$hist_status)], "Count (In Thousands)")
select_vote <- as.numeric(input$hist_status)
# Historical plot of either total count on votes/registratiob
if(input$plot_change == 1) {
outplot <- plot_df %>%
filter(Status == vote_status[select_vote]) %>%
mutate(status_pop = (Population.Count * Population.Percent) / 100) %>%
ggplot() +
aes(Year, status_pop, group = Age, color = Age) +
geom_point() + geom_line() +
ggtitle(plot_title) +
ylab(count_label) +
xlab("Years") +
scale_x_discrete(labels = years_label) +
scale_color_viridis(discrete = TRUE) +
theme(plot.title = element_text(hjust = .5))
}
# Historical plot of either votes and registration data
else {
plot_df <- plot_df %>% filter(Status == vote_status[select_vote])
title_perc <- paste("Percentage Between Age Groups that", vote_status[select_vote])
# Historical plot
outplot <- ggplot(data = plot_df) +
aes(Year, Population.Percent, group = Age, color = Age) +
geom_point() + geom_line() +
ggtitle(title_perc) +
ylab("Turnout Percentage") +
xlab("Years") +
scale_x_discrete(labels = years_label) +
scale_color_viridis(discrete = TRUE) +
theme(plot.title = element_text(hjust = .5))
}
if (input$split_type) {
outplot + facet_wrap(~election.type)
}
else {
outplot
}
})
renderDataTable({
# Dataframe of filtered and selected age groups from user input
table_df <- hist_df[hist_df$Age %in% age_groups[as.numeric(input$hist_ages)], ]
select_vote <- as.numeric(input$hist_status)
# Table of historical data
table_df %>%
filter(Status == vote_status[select_vote]) %>%
mutate(status_pop = (Population.Count * Population.Percent) / 100) %>%
select(Year, Age, status_pop, Population.Percent) %>%
rename(`Population Count` = status_pop) %>%
rename(`Population Percentage` = Population.Percent)
})
```
## Voter Turnout and Different Race Groups
```{r message=FALSE, warning=FALSE, include=FALSE, fig.width=10, fig.fullwidth=TRUE}
# Dataframe putting hist_df into a long format based on races
race_df <- hist_df %>% filter(Status == "Voted")
race_df <- melt(race_df, id.vars = c("Age", "Year"),
measure.vars = c("White.Percent", "White.Non.Hispanic.Percent",
"Black.Percent", "Asian.Percent",
"Hispanic.Other.Percent"))
race_df <- na.omit(race_df)
```
Could race be a factor in low turnout?
```{r Plots for young voters by race, echo=FALSE, fig.align='center', message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
race_names <- c("White", "White (Non Hispanic)",
"Black", "Asian", "Hispanic/Other")
# Total Race Plots
race_df %>%
filter(Age == "Total") %>%
ggplot(aes(Year, value, group = variable, color = variable)) +
geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
ggtitle("Turnout Percentage and Race") +
ylab("Turnout Percentage") +
xlab("Years (1964-2018)") +
labs(color = "Race/Ethnicity") +
scale_x_discrete(labels = years_label) +
scale_color_viridis(discrete = TRUE, labels = race_names) +
theme(plot.title = element_text(hjust = .5))
# All Ages and Race Plots Facetted
race_df %>%
ggplot(aes(Year, value, group = variable, color = variable)) +
geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
ggtitle("Percent Turnout of Different Age Groups and Race") +
ylab("Turnout Percentage") +
xlab("Years (1964-2018)") +
labs(color = "Race/Ethnicity") +
scale_color_viridis(discrete = TRUE, labels = race_names) +
# theme_minimal() +
theme(axis.text.x = element_blank(),
plot.title = element_text(hjust = .5)) +
facet_wrap(~Age)
```
It seems that Asians and Hispanics/other have the least voter turnout in all age groups compared to other ethnicities. One noteworthy observation is that Asians have the least number of observations over the years, which could be a potential reason on the group's low percentage numbers, and hence, something to account for.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
race_df %>%
filter(Age == "Total") %>%
ggplot() +
aes(variable, value, fill = variable) +
geom_boxplot() +
ggtitle("Percent Turnout by Race") +
xlab("Race/Ethnicity") +
ylab("Turnout Percentage") +
labs(fill = "Race/Ethnicity") +
scale_fill_viridis(discrete = TRUE, labels = race_names) +
scale_x_discrete(labels = race_names) +
theme(plot.title = element_text(hjust = .5))
race_df %>%
filter(Year == "2018" & Age != "Total") %>%
ggplot() +
aes(variable, value, fill = variable) +
geom_bar(stat = "identity") +
ggtitle("Percent Turnout by Age and Race") +
xlab("Race/Ethnicity") +
ylab("Turnout Percentage") +
facet_wrap(~Age) +
scale_x_discrete(labels = race_names) +
# theme_minimal() +
labs(fill = "Race/Ethnicity") +
scale_fill_viridis(discrete = TRUE, labels = race_names) +
theme(
axis.text.x = element_blank(),
plot.title = element_text(hjust = .5))
race_table <- race_df
race_table <- na.omit(race_table)
race_table$variable <- as.character(race_table$variable)
race_table$variable[race_table$variable == "White.Percent"] <- "White"
race_table$variable[race_table$variable == "White.Non.Hispanic.Percent"] <- "White (Non Hispanic)"
race_table$variable[race_table$variable == "Black.Percent"] <- "Black"
race_table$variable[race_table$variable == "Asian.Percent"] <- "Asian"
race_table$variable[race_table$variable == "Hispanic.Other.Percent"] <- "Hispanic or Other"
race_table$variable <- as.factor(race_table$variable)
race_one <- race_table %>%
filter(Age == "Total") %>%
group_by(variable) %>%
summarize(`Mean of Age Groups` = mean(value)) %>%
rename(`Race Groups` = variable)
datatable(race_one)
```
When looking at the average total percentage across all race groups, Asians and Hispanics (and other race groups) approximately have a 25% voting average from 1964 to 2018. Compared to other race groups, there mean voting percentages are around 50%.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# Young People and Race
race_df %>%
filter(Age == '18 to 24') %>%
ggplot(aes(Year, value, group = variable, color = variable)) +
geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
ggtitle("Young Voters and Race") +
ylab("Voting Percentage") +
xlab("Years (1964-2018)") +
labs(color = "Race/Ethnicity") +
scale_x_discrete(labels = years_label) +
scale_color_viridis(discrete = TRUE, labels = race_names) +
theme(plot.title = element_text(hjust = .5))
# Young Voters and Race Facetted
race_table %>%
filter(Age == '18 to 24') %>%
ggplot(aes(Year, value, group = Age, color = variable)) +
geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
facet_wrap(~variable) +
ggtitle("Young Voters and Race") +
ylab("Voting Percentage") +
xlab("Years (1964-2018)") +
labs(color = "Race/Ethnicity") +
scale_color_viridis(discrete = TRUE, labels = race_names) +
theme_minimal() +
theme(axis.text.x = element_blank(),
plot.title = element_text(hjust = .5))
```
Looking specifically at young voters (18 to 24 year olds) and the different race demographics, the same applies; Asians and Hispanics (and other race groups) have a lower voting percentage compared to other race groups.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
race_df %>%
filter(Age == "18 to 24") %>%
group_by(variable) %>%
summarize(`Vote Percentage` = mean(value)) %>%
ggplot() +
aes(variable, `Vote Percentage`, fill = variable) +
geom_bar(stat = "identity") +
ggtitle("Mean Percent Turnout of Young Voters by Race") +
labs(fill = "Race/Ethnicity") +
xlab("Race/Ethnicity") +
scale_x_discrete(labels = race_names) +
scale_fill_viridis(discrete = TRUE, labels = race_names) +
theme(plot.title = element_text(hjust = .5))
race_df %>%
filter(Age == "18 to 24") %>%
ggplot() +
aes(variable, value, fill = variable) +
geom_boxplot() +
ggtitle("Percent Turnout of Young Voters by Race") +
# xlab("Race/Ethnicity") +
ylab("Vote Percentage") +
xlab("Race/Ethnicity") +
labs(fill = "Race/Ethnicity") +
scale_fill_viridis(discrete = TRUE, labels = race_names) +
scale_x_discrete(labels = race_names) +
theme(plot.title = element_text(hjust = .5))
race_two <- race_table %>%
filter(Age == "18 to 24") %>%
group_by(variable) %>%
summarize(`Mean of Age Groups` = mean(value)) %>%
rename(`Race Groups` = variable)
datatable(race_two)
```
Unsurprisingly, based on the young voters and race, the mean voting percentage between races/ethnicities is a lot lower compared to the total percentages on races/ethnicities and total turnout. Asians and Hispanics/other, again, have the lowest turnout with approximately 15% average voting percentage. For some particular reason, Asians and Hispanics/other have lower turnout compared to other race groups and even lower for young voter data. Below are interative plots on voter turnout and races/ethnicities to further explore the topic.
## Interactive Plots on Voter Turnout and Race Groups
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
race_val_names <- c("White.Percent", "White.Non.Hispanic.Percent",
"Black.Percent", "Asian.Percent",
"Hispanic.Other.Percent")
inputPanel(
selectInput("plot_race", "Choose a plot:",
choices = list("All Voters" = 1,
"Young Voters" = 2),
selected = 1),
checkboxGroupInput("race", "Choose a race group:",
choices = list("White" = 1,"White (Non Hispanic)" = 2,
"Black" = 3,"Asian" = 4,
"Hispanic/Other" = 5),
selected = c(1, 2, 3, 4, 5)),
radioButtons("status", "Choose voting status:",
choices = list("Voted" = 1, "Registered" = 2),
selected = 1)
)
gen_race_df <- function() {
race_num <- as.numeric(input$race)
status_num <- as.numeric(input$status)
# Dataframe putting hist_df into a long format based on races
race_df <- hist_df %>% filter(Status == vote_status[status_num])
race_df <- melt(race_df, id.vars = c("Age", "Year"),
measure.vars = race_val_names)
race_df <- na.omit(race_df)
# Dataframe of filtered and selected age groups from user input
race_df <- race_df[race_df$variable %in% race_val_names[race_num], ]
if(input$plot_race == 1) {
race_df <- race_df %>% filter(Age == "Total")
}
else {
race_df <- race_df %>% filter(Age == "18 to 24")
}
race_df$variable <- as.character(race_df$variable)
race_df$variable[race_df$variable == "White.Percent"] <- "White"
race_df$variable[race_df$variable == "White.Non.Hispanic.Percent"] <- "White (Non Hispanic)"
race_df$variable[race_df$variable == "Black.Percent"] <- "Black"
race_df$variable[race_df$variable == "Asian.Percent"] <- "Asian"
race_df$variable[race_df$variable == "Hispanic.Other.Percent"] <- "Hispanic or Other"
race_df$variable <- as.factor(race_df$variable)
race_df
}
renderPlot({
plot_df <- gen_race_df()
plot_title <- paste("Percent", vote_status[as.numeric(input$status)], "by Race")
y_title <- paste("Percent of Eligible (", vote_status[as.numeric(input$status)], ")", sep = "")
plot_df %>%
ggplot(aes(Year, value, group = variable, color = variable)) +
geom_point() + geom_line() +
ggtitle(plot_title) +
ylab(y_title) +
ylim(0, 100) +
xlab("Years (1964-2018)") +
labs(color = "Race/Ethnicity") +
scale_x_discrete(labels = years_label) +
scale_color_viridis(discrete = TRUE) +
theme(plot.title = element_text(hjust = .5))
})
renderDataTable({
table_df <- gen_race_df()
table_df %>%
rename(`Race Groups` = variable,
Percentages = value)
})
```
## Race Group Testing - Multiple Linear Regression
Based on the plots, it seems that overall and specifically young Asians and Hispanics/other vote the least compared to other races. Could specific races, such as Asians or Hispanics/other, have affect voter turnout? Could we make predictions based on race and young voters? We can build and fit a multiple linear regression model, with different races/ethinicities as predictor variables and young voting percentage as the outcome variable, to test if there are no significant predictions of young voters by different race groups.
We conducted a formal F test using ANOVA to compare our specified model we want to fit to another with no independent variables.
- The null hypothesis states that the model with no independent variables fits the data as well as your model
- The alternative hypothesis says that our model fits the data better
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# Cleaning up race values
lm_race <- race_df %>% filter(Age == "18 to 24")
lm_race$variable <- as.character(lm_race$variable)
lm_race$variable[lm_race$variable == "White.Percent"] <- "White"
lm_race$variable[lm_race$variable == "White.Non.Hispanic.Percent"] <- "White (Non Hispanic)"
lm_race$variable[lm_race$variable == "Black.Percent"] <- "Black"
lm_race$variable[lm_race$variable == "Asian.Percent"] <- "Asian"
lm_race$variable[lm_race$variable == "Hispanic.Other.Percent"] <- "Hispanic or Other"
lm_race$variable <- as.factor(lm_race$variable)
lm_race <- lm_race %>% rename(race = variable) %>% rename(percentage = value)
summary(aov(percentage ~ race, data = lm_race)) %>% pander
```
Since the p-value is significantly less than the specified significance level of .05, we reject the null hypothesis. There is statistical evidence that our model fits the data better. Hence, we could attempt fitting a multiple linear regression model.
### Fitting the model
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
lm_fit <- lm(percentage ~ race, data = lm_race)
summary(lm_fit) %>% pander
```
Based on the results, since none of the coefficients were equal to 0, the race groups are statistically significant.
Some interpretations we could make based on the model:
- The approximate voting percentage would increase by 16.162% for young Asians
- The approximate voting percentage would increase by 13.481% for young Blacks
- The approximate voting percentage would increase by .47% for young Hispanics/other
- The approximate voting percentage would increase by 16.796% for young Whites
- The approximate voting percentage would increase by 16% for young Whites that are not Hispanic
Comparing all the races/ethnicities together, it seems that they all have similar voting percentage outcomes, with the exception of Hispanics/other. In short, regardless of one's race, voting ncreases the voting percentage of one's race, which makes sense hypothetically. However, certain race groups, such as Hispanics seem to increase slower compared to other races/ethnicities. Due to the low R squared and adjusted R squared values however, it seems that the current model has a higher amount of unexplainable variability. The p-value for the coefficient with Hispanics/other is also very high, so interpreting the specified race group is questionable. Regardless, we should evaluate and check the performance of the current model.
```{r fig.fullwidth=TRUE, fig.width=10, message=FALSE, warning=FALSE, include=FALSE}
step(lm(percentage ~ race, data = lm_race), direction = "both", trace = 0)
```
To evaluate the model, the stepwise algorithm was used in both directions and it concluded that the current model that included all races/ethnicities as the coefficients is the best to use. Since no other models could be used to compare different mean squared errors with the current model, no evaluation results or other models were shown.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
ggplot(data = lm_fit) +
aes(lm_fit$fitted.values, lm_fit$residuals) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dotted") +
ggtitle("Residuals vs Fitted Values Plot") +
ylab("Residuals") +
xlab("Fitted Values") +
theme(plot.title = element_text(hjust = .5))
```
We also checked the see how the model held based on a plot between residuals versus the fitted values. Unfortunately, there is a noticable pattern; a multiple linear regression model would probably not be the best model to use for prediction with race and young voters since the errors seem to be dependent on each other. In short, the current model is not accurate.
## Young Voter Turnout and Election Results
After taking a look at the factors affecting young voter turnout, we wanted to take a look at how young voter turnout influences presidential elections. By calculating the mean party score across the county, we were able to plot it against the turnout percentage. Party score is on a scale from -1 to 1, where -1 means 100% of the country voted for the Democrat Party, while 1 means 100% of the country voted for the Republican Party. This country mean was calculated from state party scores. Unfortunately, we did not have state-level turnout data broken down by age.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
pres.results <- read_csv("1976-2016-president.csv")
party.results <- pres.results %>%
# full_join(select(sena.results, -district)) %>%
# full_join(select(hous.results, -district)) %>%
filter(writein == FALSE) %>%
filter(year != 2018) %>%
mutate(party = replace(party, party == "democratic-farmer-labor", "democrat")) %>%
filter(party %in% c("democrat", "republican")) %>%
select(-candidate) %>%
pivot_wider(
names_from = party,
values_from = candidatevotes,
values_fill = list(democrat = 0, republican = 0)
) %>%
# filter(totalvotes != 1) %>%
# party.results$democrat <- unlist(party.results$democrat, use.names = FALSE)
mutate(p.dem = democrat / totalvotes) %>%
mutate(p.rep = republican / totalvotes) %>%
mutate(partyscore = p.rep - p.dem)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
hist.age.turnout <- hist_df %>%
filter(Year %in% seq(1964, 2016, 4)) %>%
filter(Status == "Voted") %>%
filter(Age != "Total") %>%
filter(election.type == "Presidential") %>%
rename(year = Year)
hist.age.turnout.res <- party.results %>%
select(year, state, partyscore) %>%
mutate(state = tolower(state)) %>%
pivot_wider(names_from = state, values_from = partyscore) %>%
mutate(country_mean = rowMeans(select(., alabama:`wyoming`))) %>%
filter(year %in% seq(1964, 2016, 4)) %>%
select(year, country_mean) %>%
mutate(year = as.factor(year)) %>%
full_join(hist.age.turnout) %>%
mutate(year = as.numeric(year)) %>%
filter(year >= 1976)
```
Curiously enough, it was found that the years in which Ronald Reagan was elected presented significant outliers across all age groups, and so decided to look at the data both with and without the Reagan years.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
with.reagan.plot <- hist.age.turnout.res %>%
# filter(!(year %in% c(1980, 1984))) %>%
ggplot() +
aes(x = Population.Percent, y = country_mean, color = Age) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_viridis(discrete = TRUE) +
labs(x = "Percent Turnout", y = "Country Mean Party Score", title = "With Reagan") +
theme(legend.position = "none") +
facet_wrap(~Age)
without.reagan.plot <- hist.age.turnout.res %>%
filter(!(year %in% c(1980, 1984))) %>%
ggplot() +
aes(x = Population.Percent, y = country_mean, color = Age) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_viridis(discrete = TRUE) +
labs(x = "Percent Turnout", y = "Country Mean Party Score", title = "Without Reagan") +
theme(legend.position = "none") +
facet_wrap(~Age)
grid.arrange(with.reagan.plot, without.reagan.plot, nrow = 1, top = "Country Mean Party Score by Age Group")
# filter(hist.age.turnout.res, Age == "18 to 24")
with.reagan <- lm(country_mean ~ Population.Percent, data = filter(hist.age.turnout.res, Age == "18 to 24"))
par(mfrow=c(2,2))
kable(summary(with.reagan)$coefficients, format = 'markdown')
kable(summary(with.reagan)$r.squared, col.names = "R Squared", format = 'markdown')
```
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# filter(hist.age.turnout.res, Age == "18 to 24")
data.without.reagan <- hist.age.turnout.res %>%
filter(!(year %in% c(1980, 1984))) %>%
filter(Age == "18 to 24")
with.reagan <- lm(country_mean ~ Population.Percent, data = data.without.reagan)
par(mfrow=c(2,2))
plot(with.reagan)
kable(summary(with.reagan)$coefficients, format = 'markdown')
kable(summary(with.reagan)$r.squared, col.names = "R Squared", format = 'markdown')
```
We decided to fit a model to just young people as there are seemingly notable differences between age groups. Evaluating the appropriateness of a linear model we find some issues. Most important is that the data only barely appears to be linear, however this will be ignored as useful information may still be gained. Residuals appear to vary more or less evenly, while the residuals themselves are aproximately normally distributed. Each observation is independent enough for our purposes, however it must be noted that true and perfect independence is nearly impossible with the complex interactions between elections.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
kable(summary(with.reagan)$coefficients, format = 'markdown')
kable(summary(with.reagan)$r.squared, format = 'markdown')
```
Unfortunately, the p-value for each was incredibly large, and so we fail to reject the null hypothesis (that there is no relationship). We can't say more than it's inconclusive. In the future we would like to get state-level data for age group turnout, which may provide a more conclusive answer.
## The Future and Young Voter Turnout
What could happen for young voter turnout? To predict this, we created a time series, a vector of values specifically on a the total population of people 18 to 24 years who voted over the years.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
time_df <- hist_df %>% filter(Status == "Voted" & Age == "18 to 24")
votes_vec <- time_df$Population.Percent
votes_vec <- na.omit(votes_vec)
votes_ts <- ts(votes_vec, frequency = .5, start = c(1964, 1), end = c(2018, 1))
plot.ts(votes_ts,
main = "Time Series of Young Voting Percentage",
ylab = "Voting Percentage")
```
Then to create future predictions for later years, we fitted a Holt-Winters forecasting model.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
fit <- HoltWinters(votes_ts, gamma = FALSE)
plot(fit)
pred <- forecast:::forecast.HoltWinters(fit, h = 5, lower = 0, upper = 100)
plot(pred)
```
Based on the model, it seems that voter turnout for young people is staying roughly constant, but slowly increasing over time. To explore further years, an interactive application is shown below.
## Interactive Forecasting with HoltWinters Model of Young Voters
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
inputPanel(
numericInput("num", "Enter amount of years to make prediction",
value = 10, min = 1, max = 100)
)
renderPlot({
pred <- forecast:::forecast.HoltWinters(fit, h = (input$num[1]/2), lower = 0, upper = 100)
plot(pred)
})
```
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# keeping this as a template
```
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# read in age demographics
raw.byage <- read_excel(
"CPS Turnout Rates.xlsx",
sheet = "Age",
range = "A2:R6"
)
# pivot to tidy shape
byage <- raw.byage %>%
pivot_longer(cols = 2:length(raw.byage), names_to = "year", values_to = "turnout")
colnames(byage)[1] = "age.group"
# change types and mutate new type column
byage$year <- as.numeric(byage$year)
byage <- byage %>% mutate(type = ifelse(year %in% seq(1986, 2020, 4), "Midterm", "Presidential"))
byage$type <- factor(byage$type, levels = c("Presidential", "Midterm"))
```
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
# read in age demographics
raw.byedu <- read_excel(
"CPS Turnout Rates.xlsx",
sheet = "Education",
range = "A2:R6"
)
# pivot to tidy shape
byedu <- raw.byedu %>%
pivot_longer(cols = 2:length(raw.byedu), names_to = "year", values_to = "turnout")
colnames(byedu)[1] = "edu_lvl"
# change types and mutate new type column
byedu$year <- as.numeric(byedu$year)
byedu <- byedu %>% mutate(type = ifelse(year %in% seq(1986, 2020, 4), "Midterm", "Presidential"))
byedu$type <- factor(byedu$type, levels = c("Presidential", "Midterm"))
byedu$edu_lvl <- factor(byedu$edu_lvl, levels = c("Less Than High School", "High School Grad", "Some College to College Grad", "Post-Graduate"))
```
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
raw.alltime <- read_csv("U.S. VEP Turnout 1789-Present - Statistics.csv")
all.pres <- raw.alltime[, 1:2]
all.midt <- raw.alltime[, 3:4]
colnames(all.pres) <- c("year", "turnout")
colnames(all.midt) <- c("year", "turnout")
all.pres$type <- rep("Presidential", nrow(all.pres))
all.midt$type <- rep("Midterm", nrow(all.midt))
alltime <- full_join(all.pres, all.midt)
alltime$type <- factor(alltime$type, levels = c("Presidential", "Midterm"))
```
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
gen.turn <- read_excel(
"1980-2014 November General Election.xlsx",
skip = 1
)
gen.turn.2016 = read_excel(
"2016 November General Election.xlsx",
skip = 1
)
gen.turn <- gen.turn %>%
select(-(2:3), -c(5, 7, 8, 11), -(12:17))
colnames(gen.turn) <- c("year", "region", "turnout", "ballots", "VEP")
gen.turn$region <- sapply(gen.turn$region, tolower)
gen.turn <- mutate(gen.turn, type = ifelse(year %in% seq(1980, 2020, 4), "Presidential", "Midterm"))
gen.turn.2016 <- gen.turn.2016 %>%
select(-(2:3), -c(4, 6, 7, 10), -(11:17))
colnames(gen.turn.2016) <- c("region", "turnout", "ballots", "VEP")
gen.turn.2016$region <- sapply(gen.turn.2016$region, tolower)
gen.turn.2016$year <- rep(2016, nrow(gen.turn.2016))
gen.turn.2016$type <- rep("Presidential", nrow(gen.turn.2016))
gen.turn <- full_join(gen.turn, gen.turn.2016)
states.turnout <- gen.turn %>% filter(region != "united states")
total.turnout <- gen.turn %>% filter(region == "united states")
party.results.mergeprep <- party.results %>%
mutate(state = tolower(state)) %>%
filter(year %in% seq(1980, 2016, 4))
states.turnout.mergeprep <- states.turnout %>%
filter(year %in% seq(1980, 2016, 4)) %>%
rename(state = region)
test2 <- full_join(party.results.mergeprep, states.turnout.mergeprep)
test2 <- test2 %>%
group_by(state) %>%
summarize(mean_state_score = mean(partyscore)) %>%
full_join(test2)
results <- read_csv("1976-2016-president.csv")
party.results <- results %>%
mutate(party = replace(party, party == "democratic-farmer-labor", "democrat")) %>%
filter(party %in% c("democrat", "republican")) %>%
filter(writein == FALSE) %>%
select(-candidate) %>%
pivot_wider(
names_from = party,
values_from = candidatevotes
) %>%
mutate(p.dem = democrat / totalvotes) %>%
mutate(p.rep = republican / totalvotes) %>%
mutate(partyscore = p.rep - p.dem)
```
## Overall Turnout and Party Score
After looking at the young voting population we wanted to look at the broader affects of voter turnout.
### Interactive: Per-State Turnout and Party Score
```{r echo=FALSE, message=FALSE, warning=FALSE}
inputPanel(
sliderInput("year", "Year", 1980, 2016, 2016, step = 2, animate = TRUE, sep = ""),
checkboxInput(
"relscale",
"Use relative color scale",
TRUE
)
)
splitLayout(
renderPlot({
# merge <- states.turnout %>%
# filter(year == input$year)
#
# merge <- inner_join(states_map, merge, by = "region")
data("fifty_states")
ret_plot <- states.turnout %>%
filter(year == input$year) %>%
ggplot(aes(map_id = region)) +
geom_map(aes(fill = turnout), map = fifty_states, color = "white") +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "albers", lat0=30, lat1=40) +
labs(title = paste("Election Turnout Rates by State (", input$year, ")", sep = ""), fill = "Turnout") +
theme_void() +
theme(legend.position = "bottom", legend.box = "horizontal") +
fifty_states_inset_boxes() +
facet_wrap(~type)
if (input$relscale) {
ret_plot + scale_fill_viridis(label = scales::percent, guide = guide_colorbar(title.position = "top"))
}
else {
ret_plot + scale_fill_viridis(limits = c(0, 1), label = scales::percent)
}
}
# cache = diskCache(),
# cacheKeyExpr = input$year
),
renderPlot({
party.results %>%
filter(year == input$year) %>%
ggplot(aes(map_id = tolower(state))) + # map_id must be lowercase for whatever reason
geom_map(aes(fill = partyscore), map = fifty_states, color = "lightgrey") +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "albers", lat0=30, lat1=40) +
labs(title = paste("Presidential Popular Vote Leaning by State (", input$year, ")", sep = ""), fill = "Party Score (-1 = 100% Democrat, 1 = 100% Republican)") +
theme_void() +
scale_fill_gradient2(
low = "blue",
mid = "white",
high = "red",
limits = c(-1, 1),
guide = guide_colorbar(title.position = "top")
) +
theme(legend.position = "bottom", legend.box = "horizontal") +
fifty_states_inset_boxes()
}
# cache = diskCache(),
# cacheKeyExpr = input$year
)
)
```
We decided to look at overall turnout and its affect on party score in recent years, so we took the party score and turnout for each state from 2000 to 2016.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
test2 %>%
filter(year >= 2000) %>%
ggplot() +
aes(x = turnout, y = partyscore, color = partyscore) +
geom_point() +
scale_color_viridis(limits = c(-1, 1)) +
labs(x = "Percent Turnout", y = "Party Score", title = "Party Score vs. Percent Turnout, 2000 to 2016") +
geom_smooth(method = "lm")
# facet_wrap(~year)
test2 %>%
filter(year >= 2000) %>%
ggplot() +
aes(x = turnout, y = partyscore, color = partyscore) +
geom_point() +
scale_color_viridis(limits = c(-1, 1)) +
labs(x = "Percent Turnout", y = "Party Score", title = "Party Score vs. Percent Turnout, by Year") +
geom_smooth(method = "lm") +
facet_wrap(~year)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
turn.res.2000 <- test2 %>%
filter(year >= 2000)
fit.turn.res <- lm(partyscore ~ turnout, data = turn.res.2000)
par(mfrow=c(2,2))
plot(fit.turn.res)
```
Evaluating the model we find that the data does appear to be at least someone linear, with fairly evenly distributed residual variance, and fairly normal distribution of residuals, as shown above. Each observation is as independent as may be reasonable expected, however it must be noted that the outcomes of each vote is affected by the previous in manners which are difficult to trace. We conclude that a linear regression model is appropriate and of use.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.fullwidth=TRUE}
kable(summary(fit.turn.res)$coefficients, format = 'markdown')
kable(summary(fit.turn.res)$r.squared, col.names = "R Squared", format = 'markdown')
```
The result is a negative association (-0.5654) between percent turnout and party score, which implies that greater percent turnout pushes a state's overall vote toward the Democrat Party. This association is statistically significant, with a p-value of 0.00629 - we must reject the null hypothesis that there is no association. The Adjusted R-squared value is incredibly low at 0.0253, meaning that our model explains only about 2.5% of the variation in the data. This is likely due to the numerous factors beyond simple turnout which affect the party leanings of a given state, however it remains interesting that turnout is as significant as it is.
## Conclusion
As this was more of an exploratory report, a single conclusion cannot be reached. What our explorations have made clear, however, is that voter turnout is incredibly important to the outcomes of votes (as one might expect), and should be better encouraged. With the low turnout of Asian, Hispanic/other, and young populations, it is clear that better representation, awareness, and access need to occur for our democracy to better represent its populations.