generated from jtr13/EDAVtemplate
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdata.Rmd
145 lines (114 loc) · 12.9 KB
/
data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# Data
## Sources
Data used in this project was originally collected by a team based in California State University San Marcos, and led by [Dr. Dustin Calvillo](https://scholar.google.com/citations?user=PxQ8_zEAAAAJ&hl=en&oi=ao). A visit to his Google Scholar profile indicates that his recent research has focused on employing quantitative methods from cognitive psychology to the study of contemporary communication phenomena, especially mis- and disinformation.
This specific data set – [the fake news and confirmation bias](https://osf.io/fqncj) – is part of the [Confirmation bias in fact checking political claims research project](https://osf.io/e6jy4/). There is a similar dataset in the same project, which is apparently a previous version of the one used in this project (suffixes x1 and x2, respectively. Although both datasets share the same variables, a few are present in one and not in the other. For example, “disconfirm_bias” is present in x1, but not in x2; the opposite is true for “ideology” and “condition” (referring to “fact-checking conditions”). More clarification was requested via email to the principal investigator, but we are assuming that x2 is a more refined version of x1, so we opted to use the former for our analysis.
The dataset has 247 observations of 53 variables, all of which are “double.” However, the data dictionary indicates that many of those represent other types of data and have possibly been transformed by the authors before publishing the data set. For example, the variable ```pol_party``` originally has values “Democrat” and “Repubican,” which are now represented by 1 and 2, respectively; ```gender``` originally accepted values “female,” “male,” “other,” and “decline to state,” and now are coded from 1 to 4, in that order. Others are originally integers, for example, ```t_concordant```, which represents “total number f fact-checks for true politically concordant headlines.” The same is true for the Likert-type scale variable ```ideology```, whose rating ranges from 1 to 7, “from extremely liberal to extremely conservative.” A third class that was present in the original dataset was “time,” the case, for example, of ```T_dem_rt``` (“mean reading time for true pro-liberal headlines”). Lastly, there are 2 variables that gorup other variables: ```CRT```` is the “total number of CRT items answered correctly” (0-6); and ```AOT```, which is the mean of items in the “active open-minded thinking” (1-5).
One possible challenge that may arise as we start the exploratory data analysis is the transformation of time values form doubles to time and date format. In terms of missing values, although the data set seems to have been pre-processed in order to exclude NAs, there are 4 observations that contain missing values (1 NA each).
```{r}
library(haven)
library(dplyr)
data <- read_sav("fake_news_confirmation_bias_x2.sav")
#dim(data)
#str(data)
library(magrittr)
# (classes_data <- data.frame(variable = names(data),
# class = sapply(data, typeof)))
# table(classes_data$variable)
dataset = data
#extracat::visna(data)
```
## Cleaning / transformation
```{r, echo=TRUE}
#Add participant id
dataset$participant_id=1:nrow(dataset)
#Define a factor converter to convert numeric columns to factors
factor_converter = function(data,columns){
for (x in columns){
data[,x]=as.factor(data[,x])
data[,x]=addNA(data[,x])
}
return(data)
}
```
First, the data set is loaded in using the package Haven's read_sav() function, as the data is stored as a .SAV file. This type of file is outputted by the application SPSS. It is then converted to a data frame for easier use. The authors of this study clearly spent time preparing their data for publishing and little transformation was needed. Upon examination, there are many categorical variables which have already been modified to be numeric, ordinal variables. However, they are still numeric and plotting in ggplot2 often requires ordinal variables to be in factor form. As such, many columns will need to be converted to factors. This is done with a custom function factor_converter() which takes columns and a data set as a inputs and creates factor level variables, including an NA factor. One option to explore is modifying some of these ordinal variables to be binary, as using ordinal scale on data that shouldn't be ranked can obscure true relationships. Given this data relates to individual participants and their responses in the study, a new variable participant_id has been added to the data set to uniquely identify each participant. This allowed us to track whether any anomalies were related to specific individuals. For now, this allows us to move straight into exploratory data analysis.
## Missing value analysis
The authors mention that missing values have been removed prior to the data being made publicly available. However, there do seem to be a few missing observations present in the data. First we present the four variables that are missing in the data set and the overall distribution of scores for participants on these questions.
```{r}
library(gridExtra)
library(ggplot2)
#Find columns that have N/A values, select them from the data set
na_cols=unlist(lapply(dataset, function(x) any(is.na(x))))
na_cols=names(na_cols[na_cols == TRUE])
na_data=as.data.frame(select(dataset,na_cols))
na_data=factor_converter(na_data,na_cols)
#Plot histograms of columns with missing values
aot01_chart=ggplot(na_data, aes(aot01)) +
geom_histogram(stat="count") +
theme_minimal() +
xlab("Open Mindedness Question 1")
aot04_chart=ggplot(na_data, aes( aot04)) +
geom_histogram(stat="count") +
theme_minimal() +
xlab("Open Mindedness Question 4")
aot09_chart=ggplot(na_data, aes( aot09)) +
geom_histogram(stat="count") +
theme_minimal() +
xlab("Open Mindedness Question 9")
fc_att3_chart=ggplot(na_data, aes(fc_att3)) +
geom_histogram(stat="count") +
theme_minimal() +
xlab("Attitude towards Fact Checkers Question 3")
grid.arrange(aot01_chart,aot04_chart,aot09_chart,fc_att3_chart)
```
Overall, open minded thinking questions one and four received high scores among the majority of participants, while scores were slightly more widespread for question nine. The overall score for participant attitude toward fact checkers based on question three is also high. We now show below only the participants and columns that contained missing values to see if there were any similarities:
Variables:
aot: mean rating for 11 actively open-minding thinking questions (possible range of 1-5)\
aaf: mean rating for three attitude toward fact-checkers questions (possible range of 1-7)\
fc_att3: response to the third item on the attitudes toward fact-checkers scale (possible range of 1-7)\
aot01: response to the first item on the actively open-minded thinking scale (1-5)\
aot04: response to the fourth item on the actively open-minded thinking scale (1-5)\
aot09: response to the nineth item on the actively open-minded thinking scale (1-5)\
```{r}
#Show only rows that contain a NA value
dataset = data
dataset$participant_id=1:nrow(dataset)
print(as.data.frame(select(dataset[!complete.cases(dataset),],c(participant_id,na_cols))),row.names = FALSE)
```
All four participants had a different column item missing, meaning they all had one blank response to a different question. This leads us naturally to investigate further whether these participants are similar across some other dimension.
```{r}
#Show blank rows but also include other variables that are relevant, change labels for easy reading and interpretation
missing_obs=select(dataset[!complete.cases(dataset),],c(participant_id,na_cols,'gender', 'pol_party', 'ideology', 'condition','age','AOT','fact_check_att'))
colnames(missing_obs)=c("id",na_cols,'gender', 'pol_party', 'ideology', 'condition','age', 'aot','aaf')
missing_obs$gender=gsub(1,"female",missing_obs$gender)
missing_obs$gender=gsub(2,"male",missing_obs$gender)
missing_obs$condition=gsub(1,"easy",missing_obs$condition)
missing_obs$condition=gsub(2,"difficult",missing_obs$condition)
missing_obs$pol_party=gsub(1,"democrat",missing_obs$pol_party)
missing_obs$pol_party=gsub(2,"republican",missing_obs$pol_party)
missing_obs$aot=round(missing_obs$aot,2)
missing_obs$aaf=round(missing_obs$aaf,2)
print(as.data.frame(missing_obs),row.names = FALSE)
```
There seems to be little in common among the four participants at first glance. However, all four score roughly a 3.5 on the AOT mean rating for 11 actively open minded thinking questions. Considering the scale goes up to only five, this means these participants were on the higher end in terms of their willingness to explore alternate ways of thinking. Considering aot01, aot04, and aot09 are all questions intended to judge open minded thinking, it's possible that these participants believed the question was neutral, that it was outside of their realm of knowledge, or that they just couldn't think of a black and white answer so they left it blank. But, it must be noted that all three participants missing an open minded thinking question are republican and two have the highest ideology rating, meaning they are extremely conservative (participants 197 and 215). Perhaps a question was asked that disagreed with them morally, such as abortion or the death penalty. This may have led them to give no response to this question. This theory can better be confirmed by the fact that participant 215 was not very open minded in regards to the ninth question, aot09, and only scored a 2 out of 5. Participant 197 chose not to answer this question at all. This is further backed up by the fact that responses to question nine are more widespread than the other questions, it could easily be a polarizing question between parties. Both also scored relatively low on questions intended to gauge the participants attitudes towards fact checkers, 2 and 3 out of 7 respectively. Maybe they felt uncomfortable sharing their views with the the proctor. Another idea could be that these three participants simply had a hard time answering these questions because the fact checking conditions were difficult. This may have made them more guarded in terms of their willingness to answer these type of questions.
```{r}
library(forcats)
#Data transformation
c1=data.frame(variable=as.factor(c("Q1","Q2","Q3","avg")),Count=c(results$fc_att1[1],results$fc_att2[1],results$fc_att3[1],results$fact_check_att[1]))
c1$variable=fct_relevel(c1$variable,c("Q1","Q2","Q3","avg"))
c2=data.frame(variable=as.factor(c("Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8","Q9","Q10","Q11","avg")),Count=c(results$aot01[2],results$aot02[2],results$aot03[2],results$aot04[2],results$aot05[2],results$aot06[2],results$aot07[2],results$aot08[2],results$aot09[2],results$aot10[2],results$aot11[2],results$AOT[2]))
c2$variable=fct_relevel(c2$variable,c("Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8","Q9","Q10","Q11","avg"))
c3=data.frame(variable=as.factor(c("Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8","Q9","Q10","Q11","avg")),Count=c(results$aot01[3],results$aot02[3],results$aot03[3],results$aot04[3],results$aot05[3],results$aot06[3],results$aot07[3],results$aot08[3],results$aot09[3],results$aot10[3],results$aot11[3],results$AOT[3]))
c3$variable=fct_relevel(c3$variable,c("Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8","Q9","Q10","Q11","avg"))
c4=data.frame(variable=as.factor(c("Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8","Q9","Q10","Q11","avg")),Count=c(results$aot01[4],results$aot02[4],results$aot03[4],results$aot04[4],results$aot05[4],results$aot06[4],results$aot07[4],results$aot08[4],results$aot09[4],results$aot10[4],results$aot11[4],results$AOT[4]))
c4$variable=fct_relevel(c4$variable,c("Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8","Q9","Q10","Q11","avg"))
#Plotting
c11=ggplot(data=c1, aes(x=variable, y=Count)) + geom_bar(stat="identity") + xlab("Participant 1") + ylab("Score (1-5)")+theme_minimal() +ggtitle("Attitude Towards Fact Checkers")
c22=ggplot(data=c2, aes(x=variable, y=Count)) + geom_bar(stat="identity") + xlab("Participant 197") + ylab("Score (1-5)")+theme_minimal() +ylim(0, 5)
c33=ggplot(data=c3, aes(x=variable, y=Count)) + geom_bar(stat="identity") + xlab("Participant 213") + theme_minimal() +ylab("Score (1-5)")
c44=ggplot(data=c4, aes(x=variable, y=Count)) + geom_bar(stat="identity") + xlab("Participant 215") + theme_minimal() +ylab("Score (1-5)")
grid.arrange(c22,c33,c44,top=textGrob("Missing Open Minded Thinking Question"))
```
As for participant 1, she is a democrat who is extremely liberal with an ideology score of 2, with one being extremely liberal. fc_att3 was one of the questions given to participants to gauge their attitudes toward fact checkers. Overall, participant 1 scored relatively high in terms of her attitude toward fact checkers, with an aaf of 4.5/7. So it seems that overall she felt positively about fact checkers, but maybe she just didn't feel that question was appropriate. All four are still valid participants to include in the study considering all So it makes sense the author is left on itother questions were answered.
```{r}
c11
```