-
Notifications
You must be signed in to change notification settings - Fork 21
/
ggplot2.Rmd
172 lines (136 loc) · 6.07 KB
/
ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "Data visualization - `ggplot2`"
author: "Douglas Bates"
date: "2014-09-19"
output: ioslides_presentation
---
```{r preliminaries,cache=FALSE,echo=FALSE,results='hide'}
library(ggplot2)
library(knitr)
library(faraway)
opts_chunk$set(cache=TRUE,fig.align='center',fig.width=9,fig.height=4.5,warning=FALSE,message=FALSE)
```
# ggplot2
## The `ggplot2` graphics package
- `R` has "base" graphics functions as well as specialized plotting packages.
- The "base" graphics are a very old design. Avoid them. It is easy to create bad plots with them and difficult to create good graphics.
- The `ggplot2` package by Hadley Wickham of RStudio is the recommended visualization package.
- Hadley's book, [http://ggplot2.org](ggplot2: Elegant Graphics for Data Analysis) (Springer, 2009) describes the system.
- I will use data from the `faraway` package to accompany Julian Faraway's freely available book __Practical Regression and Anova using R__ to illustrate the use of `qplot`.
## General principles
- `ggplot2` is an implementation of Leland Wilkinson's "Grammar of Graphics" ideas
- The general approach is to build information about a plot in layers associating geometric or statistical aspects of the plot with various data characteristics.
- Frequently we create the base information about the plot then add information about the geometry then display the result.
- The `qplot` function produces a "quick plot". It can be used to create a base plot but I prefer to use `ggplot`.
# The `pima` data set
## Examining the `pima` data
```{r pima}
str(pima)
head(pima)
```
## Recoding the missing data
- As Faraway notes, several of the values of variables that cannot reasonably be zero are recorded as zero.
- A bit of research shows that these are missing data values. Also the `test` variable is a factor, not numeric.
```{r pimarecode}
pima <- within(pima, {
diastolic[diastolic == 0] <- glucose[glucose == 0] <-
triceps[triceps == 0] <- insulin[insulin == 0] <- bmi[bmi == 0] <- NA
test <- factor(test, labels=c("negative","positive"))
})
head(pima, 3)
```
# Univariate summary plots
## Histogram of diastolic blood pressure
```{r histdiastolic}
qplot(diastolic, data=pima, geom="histogram")
```
## Comments about the histogram
- Histograms are from an earlier era. You almost never want to use them.
- If you want to show the empirical density of a set of observations use an empirical density plot.
- One of the simplest things you can do to enhance a plot is to use informative axis labels, including the units if appropriate. In the case of diastolic blood pressure, "75 mmHg" is a measurement; "75" is not.
- To avoid repeating phrases like `xlab="Diastolic blood pressue (mmHg)"` create the base plot then display it with various geometries.
## Histogram of diastolic blood pressure
```{r histby}
p <- ggplot(pima, aes(x=diastolic)) + xlab("Diastolic blood pressure (mmHg)")
p + geom_histogram()
```
## Empirical density plot of diastolic blood pressure
```{r density}
p + geom_density(aes(x=diastolic))
```
## Empirical density of diastolic b.p. by test result
```{r densityby}
p + geom_density(aes(color=test))
```
# Bivariate plots
## Simple scatterplot, c.f. Fig. 1.2a, p. 13
```{r scat}
p <- ggplot(pima,aes(x=diastolic,y=diabetes))+xlab("Diastolic blood pressure (mmHg)")+ylab("Diabetes pedigree function")
p + geom_point()
```
## Adding a scatterplot smoother
```{r scat1}
p + geom_point() + geom_smooth()
```
## Multiple smoothers by group
```{r scat2}
p + geom_point(aes(color=test)) + geom_smooth(aes(color=test))
```
## Conversion to log scale
```{r scat3}
p + geom_point(aes(color=test)) + geom_smooth(aes(color=test)) + scale_x_log10()
```
## Conversion to log-log scale
```{r scat4}
p <- p + geom_point(aes(color=test)) + geom_smooth(aes(color=test))
p + scale_x_log10() + scale_y_log10()
```
# Comparative boxplots
## Comparing a continuous response by groups
- One of the things that drive me crazy when I am looking at a paper or report is non-informative comparative boxplots.
- Almost always the response is shown on the vertical axis and the groups on the horizontal axis.
- By transposing these axes the plot can be made much more informative.
- In a report or paper, the "expense" of including a plot is usually determined by the height, not the width.
- Short wide plots are preferred to tall narrow plots.
- It makes much better sense to put the continuous response on the long axis and the categorical groups on the short axis.
- In `ggplot2` you create the plot in the usual orientation then use `coord_flip()` to transpose the axes.
## Comparative boxplots - the wrong way
```{r bw1}
p <- ggplot(pima, aes(x=test,y=diabetes)) + xlab("Diabetes test result") + ylab("Diabetes pedigree function")
p + geom_boxplot()
```
## Comparative boxplot: horizontal orientation
```{r bw2,fig.height=1.5}
p + geom_boxplot() + coord_flip()
```
- You can make such a figure very short and not lose information
## Comparative boxplots: horizontal orientation, log scale
```{r bw3,fig.height=1.5}
p + geom_boxplot() + scale_y_log10() + coord_flip()
```
# Simple regression or ancova lines
## Adding a simple linear regression line - c.f. Fig. 1.3
```{r scatlm,fig.height=4}
stat500 <- data.frame(scale(stat500))
p <- ggplot(stat500,aes(x=midterm,y=final))+geom_point()+xlab("Z-score on midterm exam")+ylab("Z-score on final exam")
p + geom_smooth(method="lm") + coord_equal()
```
## Adding a reference line - c.f. Fig. 1.3
```{r scatlm1}
p+geom_smooth(method="lm")+geom_abline(intercept=0,slope=1,color="red")+coord_equal()
```
## Suppressing the confidence band
```{r scatlm2,fig.width=4.5}
p+geom_smooth(method="lm",se=FALSE)+geom_abline(color="red")+coord_equal()
```
# Ancova
## Plotting multiple groups and lines, c.f. Fig. 15.2
```{r ancova}
levels(cathedral$style) <- c("Gothic", "Romanesque")
p <- ggplot(cathedral,aes(x=x,y=y)) + xlab("Nave Height (ft)") + ylab("Total Length (ft)")
p + geom_point(aes(color=style)) + geom_smooth(aes(color=style),method="lm")
```
## Plotting multiple groups in separate panels
```{r ancova1}
p + geom_point() + geom_smooth(method="lm") + facet_grid(.~style)
```