-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNYC311rats_animated_map.Rmd
251 lines (186 loc) · 15.1 KB
/
NYC311rats_animated_map.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "NYC311rats_animated_map"
author: "Jenny"
date: "1/12/2018"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
knitr::opts_chunk$set(fig.width=11, fig.height=5)
```
New York City maintains an [open data portal](https://opendata.cityofnewyork.us) making data from many City agencies publicly available. If you're looking for a good dataset to use for practicing data manipulation, analysis, or visualization techniques, it is a gold mine.
NYC’s 311 system takes calls or online complaints and questions about any non-emergency matter, all of which are available on the NYC Open Data website. NYC 311 has a rodent sighting category and rat sub-category, which I used for this project.
You can read my Medium [blog post](https://towardsdatascience.com/new-yorkers-should-learn-to-get-along-with-rats-because-theyre-not-leaving-a-data-visualization-db4ca516762b) about rats in NYC and some of the factors that might contribute to seasonal changes and Borough or neighborhood differences in rat 311 reports.
Load packages used
You will also have to instal `ImageMagick` on your computer (not an R package) which is wrapped by `magick` for use in `R`. It took me a bit of time to figure this out. You might want to start [here](https://cran.r-project.org/web/packages/magick/vignettes/intro.html).
```{r, packages1, results = 'hide', message = FALSE}
library(tidyverse)
library(data.table)
library(janitor)
library(ggplot2)
library(gganimate)
library(choroplethr)
library(choroplethrZip)
library(magick)
library(filesstrings)
```
You can use data stored in this [GitHub repository](https://github.com/JListman/NYC_311_animated_map) or you can download an updated version of 311 data from NYC Open Data.
Download [NYC 311 data on rat and mouse sightings](https://data.cityofnewyork.us/Social-Services/Rat-and-Mouse-Sightings/tyjc-9rwy?category=Social-Services&view_name=Rat-and-Mouse-Sightings). I dowloaded data from January, 2010 through July, 2017 and saved it as `Rat_and_Mouse_Sightings.csv`.
This was read into R with `data.table::fread`, which is faster than `read.csv`.
```{r, results = 'hide', message = FALSE, eval = FALSE}
RatMouseSightings <- fread("Rat_and_Mouse_Sightings.csv")
```
This results in a large data set with many variables not needed for this project.
Look at the first few rows to understand the variables we have.
```{r, message = FALSE, eval = FALSE}
View(head(RatMouseSightings))
```
Make a smaller dataframe by removing unneeded columns keeping only "Created Date", "Descriptor","Incident Zip", and "Borough". Look at the first few rows again to make sure the correct variables were kept.
```{r, message = FALSE, eval = FALSE}
RatMouseData <- as.data.frame(RatMouseSightings[,c(2,7,9,25)])
View(head(RatMouseData))
```
Remove the super large file from memory, since we're now working with a file that only contains 4 of the original 53 columns (variables).
```{r, results = 'hide', message = FALSE, eval = FALSE}
rm(RatMouseSightings)
```
Fix the variable names to remove spaces using `janitor::clean_names`. Then turn multiple variables from character to factor, at the same time.
```{r, results = 'hide', message = FALSE, eval = FALSE}
RatMouseData <- RatMouseData %>% clean_names()
cols <- c("Descriptor", "Incident_Zip", "Borough")
RatMouseData <- RatMouseData %>% mutate_at(cols, funs(factor(.)))
```
Make a new dataframe with the subset of rows for which Descriptor variable is "Rat Sighting".
```{r, results = 'hide', message = FALSE, eval = FALSE}
RatData <- subset(RatMouseData, Descriptor == "Rat Sighting")
```
Turn "Created_Date" variable, currently a string, into POSIXct date format. Then get Month, Year and Month_Year variables because you might want any/all of those for exploratory analysis or plotting.
Create them as factor variables instead of numeric or character. Save the dataframe of rat sighting data as a `.rds` file to preserve variable formatting for the next time you want to use it.
```{r, results = 'hide', message = FALSE, eval = FALSE}
RatData <- RatData %>%
mutate(Created_Date = as.POSIXct(Created_Date, "%m/%d/%Y %I:%M:%S %p", tz="")) %>%
mutate(Year = as.factor(year(RatData$Created_Date))) %>%
mutate(Month = as.factor(month(RatData$Created_Date))) %>%
mutate(Month_Year = as.factor(format(RatData$Created_Date, "%Y-%m")))
saveRDS(RatData, "RatData.rds")
```
```{r, results = 'hide', message = FALSE, echo = FALSE}
RatData <- readRDS("RatData.rds")
```
Read in census data NYCpopsizebyzip.csv (found in this [GitHub repository](https://github.com/JListman/NYC_311_animated_map) associated with the project) for popoulation size by NYC zip code. Add population count (`NYCpopsizebyzip$value`) to `RatData` by matching zip codes. `region` variable in `NYCpopsizebyzip` must be changed from numeric to factor in order to use `base::merge`. Use `droplevels` to remove `Incident_Zip` and `Borough` typos, NAs, or errors from NYC 311 database. Rename `value` (population count variable from `NYCpopsizebyzip`) to be self-explanatory.
```{r, readdata, results = 'hide', message = FALSE}
NYCpopsizebyzip <- read.csv("NYCpopsizebyzip.csv", header=TRUE)
NYCpopsizebyzip$region <- as.factor(NYCpopsizebyzip$region)
RatData <- RatData %>%
merge(NYCpopsizebyzip, by.x="Incident_Zip", by.y="region") %>%
rename("popsize" = value) %>%
droplevels()
```
Make a tidy Zip dataframe including new variable of calls per Month_Year by zip code with `dplyr::count`. Rename the newly created variable (automatically named `n` by `dplyr::count` default) to be self-explanatory.
```{r, make_tidydata, results = 'hide', message = FALSE}
TidyRatDataZip <- RatData %>%
count(Incident_Zip, popsize, Month_Year) %>%
rename("count_calls_Zip" = n)
```
For mapping with `choroplethrZip`, Zip needs to be named "region" and the mapped factor variable must be named "value" and needs labels for the legend to make sense.
It can be more useful to compare relative instead of absolute counts across regions, when making a map data visualization. Make new variables based on rat reports per population count per region. First make a numeric variable of rat reports per zip code population size, then bin it to create a factor variable which will be used as the mapped "value" in `choroplethrZip`.
To come up with reasonable `breaks` for the bins using the base R function `cut`, look at the distribution of the variable you want to map. Visually inspect the distribution of the factor variable to be mapped onto zip codes by printing a table. I had to change break values a couple of times before I was happy with the results. I then used `label` to make the bins appear in the map legend in a readable manner. Default labels can be ugly.
```{r, make_map_data, results = 'hide', message = FALSE}
TidyRatDataZip <- TidyRatDataZip %>%
mutate(ReportPer10K = (count_calls_Zip/popsize)*10000) %>%
mutate(value = cut(ReportPer10K,
breaks = c(0.09,0.35,0.55, 0.8, 1.1, 1.5, 2.0,3,10,41),
labels = c("0.09-0.35","0.35-0.55","0.55-0.80","0.80-1.10","1.10-1.50","1.50-2.00", "2.00-3.00","3.00-10.00","10.00-41.00"))) %>%
mutate(region = Incident_Zip)
table(TidyRatDataZip$value)
```
Make a smoothed line plot of median call density per Month_Year across all zips within each Borough to show seasonality.
First, create a new data frame with Month_Year in proper chronological order for plotting & create a new variable. I used median number of rat reports per 10,000 residents rather than mean because the distribution was quite skewed.
```{r, make_lineplot_data, results = 'hide', message = FALSE}
RatCallsByBoroughMedian <- RatData %>%
count(Incident_Zip, Borough, popsize, Month_Year) %>%
rename("count_calls_Zip" = n) %>%
mutate(ReportPer10K = (count_calls_Zip/popsize)*10000) %>%
group_by(Borough, Month_Year) %>%
summarize(MedianMonthYearReportPer10K = median(ReportPer10K))
```
Create a dummy variable, "Order", to make plotting the x-axis easier. "Order" is a number from 1 to 92, which is the number of levels for the factor variable "Month_Year". If you download your data from NYC Open Data rather than using my older data set, you'll have a different number of levels for this variable and will have to adjust the code, accordingly.
Change names of "Order" variable manually for x-axis using only January and July so x-axis is readable instead of crowded and seasonality is visible. The `\n` within each xlabel makes them print on two lines instead of a single, unreadable line.
Again, I downloaded the data in July, 2017. If you do a new download now, instead of using my stored data, you'll need to deal with Month_Year levels that are not addressed in my code.
```{r, make_lineplot_dummy_varibles, results = 'hide', message = FALSE}
RatCallsByBoroughMedian$Order <- seq_along(1:92)
xlabels <- c("Jan \n2010","July \n2010","Jan \n2011","July \n2011","Jan \n2012",
"July \n2012","Jan \n2013","July \n2013","Jan \n2014","July \n2014",
"Jan \n2015","July \n2015","Jan \n2016","July \n2016","Jan \n2017","July \n2017")
```
Make a lineplot graph using `ggplot::geom_smooth`. This also took some fiddling with the parameters of the smoothing function to make it look reasonable. I used `theme_minimal` to get rid of lots of distracting lines and then used `geom_vline` to re-insert vertical lines only at January & July axis ticks for a cleaner look.
You can add a `caption` for the graph, with your Twitter handle or website and the data source, so if someone posts the image, everyone gets credit.
```{r, make_lineplot, message = FALSE}
RatCallsByMonthBorough <- ggplot(RatCallsByBoroughMedian,aes(Order, MedianMonthYearReportPer10K)) +
geom_smooth(aes(group=Borough, color=Borough), na.rm = TRUE,
formula = y ~ s(x, k = 50), method = "gam", n= 100, se = FALSE)+
theme_minimal() +
xlab("\nMonth & Year") +
ylab("Rat Reports to 311 Per 10,000 Residents\n") +
labs(title="Rat Complaints to NYC's 311 By Borough: 2010-2017\n",
caption="@jblistman. Data source: NYC Open Data 311 Service Requests") +
scale_y_continuous(breaks=c(0,1,2,3), labels= c("0","1","2","3")) +
geom_vline(xintercept=seq(1,92,by=6), colour='gray') +
scale_x_continuous(breaks=seq(1,92,by=6),labels=xlabels) +
theme(axis.text.x=element_text(size = 13),
axis.text.y=element_text(size = 13),
axis.title.x=element_text (size = 14),
axis.title.y=element_text (size = 14),
plot.title = element_text(size = 15, hjust = 0.5),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12))
RatCallsByMonthBorough
```
Make a dataframe with the subset of variables needed for the map.
```{r, make_map_dataframe, results = 'hide', message = FALSE}
TidyRatMapData <- TidyRatDataZip[,c(3,4,6,7)]
```
Create objects and order variables needed for maps.
1. Month_Year as list in chronologicall order with `base::sort`. Combine the sort with `base::unique` to get rid of duplicate Month_Year levels because each zip has an entry for each Month_Year.
2. Make vector of Hex color codes in a color-blind friendly map color palette derived from the [colorbrewer website](http://colorbrewer2.org/#type=sequential&scheme=YlGnBu&n=9) for 9 factor levels of "value". I chose a diverging palette to emphasize the highs and lows. You can play around with this.
3. Make a vector of NYC zip codes instead of using nyc_fips (in the Census data set that we merged earlier to get population size) because county fips codes include some census zcta (close to zip codes, but not exactly) that include Long Island, which is *definitely* not part of NYC.
```{r, make_map_variables, results = 'hide', message = FALSE}
datesRats <- sort(unique(TidyRatMapData$Month_Year))
mapcolors <- c('#ffffd9','#edf8b1','#c7e9b4','#7fcdbb','#41b6c4','#1d91c0','#225ea8','#253494','#081d58')
nyc_zips <- NYCpopsizebyzip$region
```
Make maps using `zip_choropleth` which uses `ggplot2` to make map objects and then `choroplethr_animate` to write png files of each map image.
These pngs can be used as-is with `choroplethr_animate` which creates code along with the pngs to make a link to animated maps or they can be turned into a gif with `magick`.
For each Month_Year create a map. This code was adapted from a [tutorial by Kimberly Coffey](http://www.kimberlycoffey.com/blog/2015/10/17/animated-choropleth-map-for-the-web). In `scale_fill_manual` and `guides`, override the vector of color codes that correspond to binned values so that zip codes with no rat complaints for a given month are colored gray.
```{r, make_map_pngs, results="hide", message = FALSE, eval = FALSE}
RatMapsAnimate <- list()
for (i in 1:length(datesRats)) {temp.date <- datesRats[i]
df <- subset(TidyRatMapData, Month_Year == temp.date)
totalrats <- sum(df$count_calls_Zip)
title = paste("Total Rat Reports to NYC's 311 in", temp.date, ":", totalrats)
RatMapsAnimate[[i]] = zip_choropleth(df, zip_zoom = nyc_zips, title=title, reference_map=FALSE) +
scale_fill_manual(values=mapcolors,drop=FALSE,
na.translate = TRUE, na.value = "gray",
name="Rat Reports to \nNYC's 311 Per \n10,000 Residents")+
labs(caption="@jblistman. Data source: NYC Open Data 311 Service Requests")+
scale_colour_manual(values=NA) +
guides(colour=guide_legend("0", override.aes=list(colour="gray")))
}
```
Use `choroplethr_animate` to write a png image for each map plus a file with html code to combine pngs as an animation. The animation can be viewed in a browser, but we'll also turn the pngs into a gif.
```{r, make_animation, results="hide", message = FALSE, eval = FALSE}
choroplethr_animate(RatMapsAnimate)
```
`choroplethr_animate` creates files with numerical names that are out of order, alphabetically. Rename files choropleth_1.png ... choropleth_9.png to choropleth_01.png ... choropleth_09.png so when they are automatically turned into a gif with `magick`, they are ordered alphabetically as well as numerically.
`filestrings::nice_file_nums` will do this automatically for an entire directory. The current working directory already contains the png files that need to be renamed.
```{r, renumber_maps, results="hide", message = FALSE, eval = FALSE}
nice_file_nums(dir = ".", pattern = ".png")
```
Convert the .png files to a .gif with `ImageMagick`. First, change your working directory to the folder in which you've written the png files. I've named the resulting file ratmaps.gif and chose a delay of 40 between frames. Play with the delay depending on your preferences for time between frames and name your own file (or keep ratmaps, whichever). `system()` executes code as if you're using the terminal.
```{r, make_gif, results="hide", message = FALSE, eval = FALSE}
system("convert -delay 40 *.png ratmaps.gif")
```
If you don't want to save all the gigantic .png files, remove them from the working directory.
```{r, remove_pngs, results="hide", message = FALSE, eval = FALSE}
file.remove(list.files(pattern=".png"))
```