-
Notifications
You must be signed in to change notification settings - Fork 0
/
anova-plus.qmd
279 lines (209 loc) · 8.47 KB
/
anova-plus.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
```{r}
#| label: setup-anova
#| include: false
library(mosaic)
library(car)
```
# Other Model Selection Approaches
The way we have been doing model selection thus far is *definitely* not
the only way. What other options are out there? Many - let's consider a
few.
This section also includes some miscellaneous R notes (making summary
tables, and sources of inspiration for cool figures).
### Rationale
Until now, we have focused on using information criteria for model
selection, in order to get very familiar with one coherent framework for
choosing variables across model types. But:
- In some fields, using hypothesis tests for variable selection is
preferred
- For datasets that are large and/or models that are complex,
`dredge()` can be a challenge (taking a very long time to run and
perhaps timing out on the server)
- Using hypothesis tests for selection is quite common, so we should
know how it's done!
### Hypotheses
Basically, for each (fixed effect) variable in a model, we'd like to
test:
$$H_0: \text{all } \beta\text{s for this variable are 0; it's not a good predictor}$$
$$H_1: \text{ at least one } \beta\text{ is non-zero; it's a good predictor}$$
We want to test these hypotheses *given that all the other predictors in
the current full model are included*. Because of this condition, and
because there are *multiple* $\beta$s for categorical predictors with
more than 2 categories, we can **not** generally just use the p-values
from the model `summary()` output.
Instead, we use `Anova()` from the package `car`. `lm()` example:
```{r, message = FALSE}
iris_mod <- lm(Petal.Length ~ Petal.Width + Species + Sepal.Length, data = iris)
summary(iris_mod)
library(car)
Anova(iris_mod)
```
Notice that `Anova()` reports *one* p-value for each predictor
(excellent!). If the p-value is small, that gives evidence against
$H_0$, and we'd conclude we should keep the predictor in the model. Many
people use $\alpha = 0.05$ as the "dividing line" between "small" and
"large" p-values and thus "statistically significant" and
"non-significant" test results, but remember the p-value is a
probability - there's no magical difference between 0.049 and 0.051.
*Warning: be careful with your capitalization! The R function `anova()`
does someting kind of similar to `Anova()` but* **NOT** *the same and
should be avoided -- it does sequential rather than marginal tests.*
## Backward selection
How do we use p-value-based selection to arrive at a best model? There
are many options and much controversy about different approaches; here
I'll suggest one. None of these methods are guaranteed to arrive at a
model that is theoretically "best" in some specific way, but they do
give a framework to guide decision-making and are computationally quick.
The premise is that we'd like a simple algorithm to implement, and we
will begin with a full model including all the predictors that we think
*should* or *could* reasonably be important (not just throwing in
everything possible).
### Algorithm
- Obtain p-values for all predictors in full model
- Remove the predictor with the largest p-value that you judge to be
"not small" or "not significant"
- Re-compute p-values for the new, smaller model
- Repeat until all p-values are "significant"
### Example
Let's consider a logistic regression to predict whether a person in
substance-abuse treatment is homeless.
```{r}
home_mod0 <- glm(homeless ~ sex + substance + i1 + cesd +
racegrp + age,
data = HELPrct, family = binomial(link = 'logit'))
Anova(home_mod0)
```
Removing `age`:
```{r}
home_mod <- update(home_mod0, .~. - age)
Anova(home_mod)
```
Removing `racegrp`
```{r}
home_mod <- update(home_mod, .~. - racegrp)
Anova(home_mod)
```
Remove `cesd` (a score indicating depression level)
```{r}
home_mod <- update(home_mod, .~. - cesd)
Anova(home_mod)
```
Remove `substance`
```{r}
home_mod <- update(home_mod, .~. - substance)
Anova(home_mod)
```
Remove `sex`
```{r}
home_mod <- update(home_mod, .~. - sex)
Anova(home_mod)
```
### Can't this be automated?
Strangely...functions are not widely available.
### Stepwise IC-based selection
Another option may be to use **backward stepwise selection** (same
algorithm as above), but using AIC or BIC as the criterion at each stage
instead of p-values. If the IC value is better (by *any* amount) without
a variable, it gets dropped. Variables are dropped one by one until no
further IC improvement is possible.
This evaluates many fewer models than `dredge` so should be much faster,
but may not find the best of all possible models.
For example, for our model using AIC (*note: this may or may not work
for all model types.*):
```{r, message = FALSE}
library(MASS)
stepAIC(home_mod0)
```
Note that we might want to still remove *one more* variable than
`stepAIC()` does! Above, you see that if you were to remove `age`, the
AIC would only go up by about 1 unit. So according to our
$\Delta AIC \sim 3$ threshold, we would take `age` out too.
Using BIC instead, we need to specify the input `k = log(nrow(data))`
(the BIC penalty multiplier):
```{r}
stepAIC(home_mod0, k = log10(nrow(HELPrct)))
```
To get less verbose output, set `trace = 0` -- but then you won't know
whether it would make sense to perhaps remove additional variables...
```{r}
stepAIC(home_mod0, k = log10(nrow(HELPrct)), trace = 0)
```
## Summary tables
You may want to compute and display summary tables for your projects.
Here are a few examples of how to do it.
### Mean (or sd, median, IQR, etc.) by groups
Compute the mean and sd (could use any other summary stats you want,
though) for several quantitative variables, by groups.
Example: find mean and sd of iris flower `Petal.Length` and
`Petal.Width` by `Species` and display results in a pretty table. The
dataset is called `iris`.
Make a little one-row table for each variable being summarized, then
stick them together.
```{r, message = FALSE}
library(knitr)
length_stats <- iris |>
df_stats(Petal.Length ~ Species, mean, sd, long_names = FALSE) |>
mutate(variable = 'Petal Length')
width_stats <- iris |>
df_stats(Petal.Width ~ Species, mean, sd, long_names = FALSE) |>
mutate(variable = 'Petal Width')
my_table <- bind_rows(length_stats, width_stats)
kable(my_table)
```
What if we want to round all table entries to 2 digits after the
decimal?
```{r}
kable(my_table, digits = 2)
```
What if we want the column order to be Variable, Species, mean, sd, and
sort by Species and then Variable?
```{r}
my_table <- my_table |>
dplyr::select(variable, Species, mean, sd) |>
arrange(Species, variable)
kable(my_table, digits = 2)
```
What if we actually want a column for mean length, sd length, etc. and
one row per species?
```{r}
library(tidyverse)
my_table2 <- my_table |>
pivot_wider(names_from = variable,
values_from = c("mean", "sd"),
names_sep = ' ')
kable(my_table2, digits = 2, align = 'c')
```
### Proportions in categories by groups
You may also want to make a table of proportion observations in each
category by groups, potentially for many variables.
For just one variable, we can use tally:
```{r}
tally(~substance | sex, data = HELPrct, format = 'prop') |>
kable(caption = 'Proportion using each substance', digits = 2)
```
For many variables we can use a loop. For example, we might want to know
the proportion homeless and housed **and** proportion using each
substance, both by sex, from the `HELPrct` dataset. Above we were using
the function `knitr::kable()` to make tables, but we can use
`pander::pander()` too:
```{r}
# select only variables needed for the table
# make the first variable the groups one
cat_data <- HELPrct |> dplyr::select(sex, substance, homeless)
for (i in c(2:ncol(cat_data))){
tally(~cat_data[,i] | cat_data[,1], format = 'prop') |>
pander::pander(caption = paste('Proportion in each ',
names(cat_data)[i]))
# can rename variables in cat_data if you want better captions
}
```
## Figures
We've made a lot of figures in this class, and almost all have been kind
of mediocre. To aim for awesome, here are a couple of great references
for inspiration, ideas, and best practices:
- *Fundamentals of Data Visualization* by Claus Wilke.
<https://serialmentor.com/dataviz/>
- <https://infogram.com/blog/20-best-data-visualizations-of-2018/>
- visualizingdata.com blog
- <https://www.visualisingdata.com/2019/08/10-significant-visualisation-developments-january-to-june-2019/>
- <https://www.visualisingdata.com/2016/03/little-visualisation-design/>