generated from jtr13/EDAVtemplate
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata.Rmd
308 lines (240 loc) · 24.5 KB
/
data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
# Data
## Sources
As we mentioned in Chapter 2, we will focus on four data sets published by WTO. Here are the description of them. Our data are downloaded from the WHO Coronavirus (COVID-19) Dashboard [website](https://covid19.who.int/data) on Dec. 13, 2022.
### Daily cases and deaths by date reported to WHO
This data set is designed to show the spread of covid-19 virus in countries, territories, and areas around the world from January 2020 to the present (It should be noticed that in our project we get records until Dec. 13, 2022). It contains 8 columns, whose detail descriptions are shown below.
| Field name | Type | Description |
|:-----------------:|:-------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| Date_reported | Date | Date of reporting to WHO |
| Country_code | String | ISO Alpha-2 country code |
| Country | String | Country, territory, area |
| WHO_region | String | WHO regional offices: WHO Member States are grouped into six WHO regions -- Regional Office for Africa (AFRO), Regional Office for the Americas (AMRO), Regional Office for South-East Asia (SEARO), Regional Office for Europe (EURO), Regional Office for the Eastern Mediterranean (EMRO), and Regional Office for the Western Pacific (WPRO). |
| New_cases | Integer | New confirmed cases. Calculated by subtracting previous cumulative case count from current cumulative cases count.\* |
| Cumulative_cases | Integer | Cumulative confirmed cases reported to WHO to date. |
| New_deaths | Integer | New confirmed deaths. Calculated by subtracting previous cumulative deaths from current cumulative deaths.\* |
| Cumulative_deaths | Integer | Cumulative confirmed deaths reported to WHO to date. |
``` r
daily_case <- read.csv(file="./data/WHO-COVID-19-global-data.csv")
# Brief view of the data.
head(daily_case)
```
```{r}
daily_case <- read.csv(file="./data/WHO-COVID-19-global-data.csv")
# Brief view of the data.
head(daily_case)
```
``` r
# See the number of countries, territories, and areas.
cat("The number of conuntries, territories, and areas recorded: ",length(unique(daily_case$Country)))
```
```{r}
# See the number of countries, territories, and areas.
cat("The number of conuntries, territories, and areas recorded: ",length(unique(daily_case$Country)))
```
``` r
# See the beginning date and ending date of the records.
print(paste0("The last day of records: ", max(as.Date(daily_case$Date_reported))))
print(paste0("The first day of records: ", min(as.Date(daily_case$Date_reported))))
```
```{r}
# See the beginning date and ending date of the records.
print(paste0("The last day of records: ", max(as.Date(daily_case$Date_reported))))
print(paste0("The first day of records: ", min(as.Date(daily_case$Date_reported))))
```
According to the brief view of the data, it is tidy, which means that:
1. Every column is a variable.
2. Every row is an observation.
3. Every cell is a single value.
It contains daily information of 237 countries, territories, and areas from 2020-01-03 to 2022-12-13.
### Latest reported counts of cases and deaths
Different from the data set of daily cases and deaths, this data set focus on the reported cases and deaths in the recent period (recent 7 days, and recent 24 hours). It contains 12 columns, and here are the descriptions of them.
| Field name | Type | Description |
|:------------------------------------------------------------:|:-------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| Name | String | Country, territory, area |
| WHO_region | String | WHO Region |
| Cases - cumulative total | Integer | Cumulative confirmed cases reported to WHO to date. |
| Cases - cumulative total per 100000 population | Decimal | Cumulative confirmed cases reported to WHO to date per 100,000 population. |
| Cases - newly reported in last 7 days | Integer | New confirmed cases reported in the last 7 days. Calculated by subtracting previous cumulative case count (8 days prior) from current cumulative cases count. |
| Cases - newly reported in last 7 days per 100000 population | Decimal | New confirmed cases reported in the last 7 days per 100,000 population. |
| Cases - newly reported in last 24 hours | Integer | New confirmed cases reported in the last 24 hours. Calculated by subtracting previous cumulative case count from current cumulative cases count. |
| Deaths - cumulative total | Integer | Cumulative confirmed deaths reported to WHO to date. |
| Deaths - cumulative total per 100000 population | Decimal | Cumulative confirmed deaths reported to WHO to date per 100,000 population. |
| Deaths - newly reported in last 7 days | Integer | New confirmed deaths reported in the last 7 days. Calculated by subtracting previous cumulative death count (8 days prior) from current cumulative deaths count. |
| Deaths - newly reported in last 7 days per 100000 population | Decimal | New confirmed deaths reported in the last 7 days per 100,000 population. |
| Deaths - newly reported in last 24 hours | Integer | New confirmed deaths reported in the last 24 hours. Calculated by subtracting previous cumulative death count from current cumulative deaths count. |
``` r
reccent_case <- read.csv(file="./data/WHO-COVID-19-global-table-data.csv")
# Brief view of the data.
head(reccent_case)
```
```{r}
reccent_case <- read.csv(file="./data/WHO-COVID-19-global-table-data.csv")
# Brief view of the data.
head(reccent_case)
```
``` r
# See the number of countries, territories, and areas.
cat("The number of conuntries, territories, and areas recorded: ",length(unique(reccent_case$WHO.Region))-1)
```
```{r}
# See the number of countries, territories, and areas.
cat("The number of conuntries, territories, and areas recorded: ",length(unique(reccent_case$WHO.Region))-1)
```
According to the brief view of the data set, this data set is tidy as well. Since it focuses on the recent cases, there is no column about the record date. Another thing that should be noticed is that the number of countries, territories, and areas is different from the last data set.
### Vaccination data
This data set shows the vaccination status of covid-19 in different countries, territories, and areas. The descriptions of the 12 columns are shown below.
| Field name | Type | Description |
|:------------------------------------:|:-------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| COUNTRY | String | Country, territory, area |
| ISO3 | String | ISO Alpha-3 country code |
| WHO_REGION | String | WHO regional offices: WHO Member States are grouped into six WHO regions: Regional Office for Africa (AFRO), Regional Office for the Americas (AMRO), Regional Office for South-East Asia (SEARO), Regional Office for Europe (EURO), Regional Office for the Eastern Mediterranean (EMRO), and Regional Office for the Western Pacific (WPRO). |
| DATA_SOURCE | String | Indicates data source: - REPORTING: Data reported by Member States, or sourced from official reports - OWID: Data sourced from Our World in Data: <https://ourworldindata.org/covid-vaccinations> |
| DATE_UPDATED | Date | Date of last update |
| TOTAL_VACCINATIONS | Integer | Cumulative total vaccine doses administered |
| PERSONS_VACCINATED_1PLUS_DOSE | Decimal | Cumulative number of persons vaccinated with at least one dose |
| TOTAL_VACCINATIONS_PER100 | Integer | Cumulative total vaccine doses administered per 100 population |
| PERSONS_VACCINATED_1PLUS_DOSE_PER100 | Decimal | Cumulative persons vaccinated with at least one dose per 100 population |
| PERSONS_FULLY_VACCINATED | Integer | Cumulative number of persons fully vaccinated |
| PERSONS_FULLY_VACCINATED_PER100 | Decimal | Cumulative number of persons fully vaccinated per 100 population |
| VACCINES_USED | String | Combined short name of vaccine: "Company - Product name" (see below) |
| FIRST_VACCINE_DATE | Date | Date of first vaccinations. Equivalent to start/launch date of the first vaccine administered in a country. |
| NUMBER_VACCINES_TYPES_USED | Integer | Number of vaccine types used per country, territory, area |
| PERSONS_BOOSTER_ADD_DOSE | Integer | Persons received booster or additional dose |
| PERSONS_BOOSTER_ADD_DOSE_PER100 | Decimal | Persons received booster or additional dose per 100 population |
``` r
vcc <- read.csv(file = "./data/vaccination-data.csv")
# Brief view of the data.
head(vcc)
```
```{r}
vcc <- read.csv(file = "./data/vaccination-data.csv")
# Brief view of the data.
head(vcc)
```
``` r
# See the number of countries, territories, and areas.
cat("The number of conuntries, territories, and areas recorded: ",length(unique(vcc$COUNTRY)))
```
```{r}
# See the number of countries, territories, and areas.
cat("The number of conuntries, territories, and areas recorded: ",length(unique(vcc$COUNTRY)))
```
The number of countries, territories, and areas is less than the total number of members of WHO, which means that some places still lack vaccine.
### Vaccination metadata
This data set is smaller than the last one and does not provide information about the vaccination population. Here are the detail descriptions about the 10 columns.
| Field name | Type | Description |
|:------------------:|:------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| ISO3 | String | ISO Alpha-3 country code |
| VACCINE_NAME | String | Combined short name of vaccine: "Company - Product name" (see below) |
| PRODUCT_NAME | String | Name or label of vaccine product, or type of vaccine (if unnamed). |
| COMPANY_NAME | String | Marketing authorization holder of vaccine product. |
| FIRST_VACCINE_DATE | Date | Date of first vaccinations. Equivalent to start/launch date of the first vaccine administered in a country. |
| AUTHORIZATION_DATE | Date | Date vaccine product was authorised for use in the country, territory, area. |
| START_DATE | Date | Start/launch date of vaccination with vaccine type (excludes vaccinations during clinical trials). |
| END_DATE | Date | End date of vaccine rollout |
| COMMENT | String | Comments related to vaccine rollout |
| DATA_SOURCE | String | Indicates data source - REPORTING: Data reported by Member States, or sourced from official reports - OWID: Data sourced from Our World in Data: <https://ourworldindata.org/covid-vaccinations> |
``` r
vcc_meta <- read.csv(file = "./data/vaccination-metadata.csv")
# Brief view of the data.
head(vcc_meta)
```
```{r}
vcc_meta <- read.csv(file = "./data/vaccination-metadata.csv")
# Brief view of the data.
head(vcc_meta)
```
``` r
# See the number of countries, territories, and areas.
cat("The number of conuntries, territories, and areas recorded: ",length(unique(vcc_meta$ISO3)))
```
```{r}
# See the number of countries, territories, and areas.
cat("The number of conuntries, territories, and areas recorded: ",length(unique(vcc_meta$ISO3)))
```
## Cleaning / transformation
Since the data sets we use are all tidy, we do not neet to clean and transform it anymore. However, in the next chapter, we need use monthly information. Therefore, we create a monthly reported cases and deaths data set from the `WHO-COVID-19-global-data.csv`.
``` r
library(lubridate)
library(tidyverse)
library(redav)
daily_case$month = format(as.Date(daily_case$Date_reported, format = "%Y-%m-%d"), "%y-%m")
month_case <- daily_case %>%
group_by(WHO_region, Country_code, Country, month) %>%
summarize(
New_cases = sum(New_cases),
New_deaths = sum(New_deaths),
Cumulative_cases = max(Cumulative_cases),
Cumulative_deaths = max(Cumulative_deaths),
)
write.csv(month_case, file = "./data/monthly_data.csv")
```
```{r}
library(lubridate)
library(tidyverse)
library(redav)
```
```{r}
daily_case$month = format(as.Date(daily_case$Date_reported, format = "%Y-%m-%d"), "%y-%m")
month_case <- daily_case %>%
group_by(WHO_region, Country_code, Country, month) %>%
summarize(
New_cases = sum(New_cases),
New_deaths = sum(New_deaths),
Cumulative_cases = max(Cumulative_cases),
Cumulative_deaths = max(Cumulative_deaths),
)
write.csv(month_case, file = "./data/monthly_data.csv")
```
## Missing value analysis
``` r
daily_case_na = sum(is.na(daily_case))
reccent_case_na = sum(is.na(reccent_case))
vcc_na = sum(is.na(vcc))
vcc_meta_na = sum(is.na(vcc_meta))
print(paste0("The number of missing values in daily reported and death data set: ", daily_case_na))
print(paste0("The number of missing values in reccent reported and death data set: ", reccent_case_na))
print(paste0("The number of missing values in vaccination data set: ", vcc_na))
print(paste0("The number of missing values in vaccination meat-data set: ", vcc_meta_na))
```
```{r}
daily_case_na = sum(is.na(daily_case))
reccent_case_na = sum(is.na(reccent_case))
vcc_na = sum(is.na(vcc))
vcc_meta_na = sum(is.na(vcc_meta))
print(paste0("The number of missing values in daily reported and death data set: ", daily_case_na))
print(paste0("The number of missing values in reccent reported and death data set: ", reccent_case_na))
print(paste0("The number of missing values in vaccination data set: ", vcc_na))
print(paste0("The number of missing values in vaccination meat-data set: ", vcc_meta_na))
```
Basically, vaccination meta-data set has the most missing values.
``` r
plot_missing(month_case, percent = FALSE)
```
```{r, fig.width=14}
plot_missing(month_case, percent = FALSE)
```
There is only some country codes missing in `WHO-COVID-19-global-data.csv`, because some member of WHO are not a country. Those missing values will not influence our exploration.
``` r
plot_missing(reccent_case, percent = FALSE)
```
```{r, fig.width=14}
plot_missing(reccent_case, percent = FALSE)
```
Most of the missing values belong to `Deaths...newly.report`. This is because some countries have no reccent deaths or do not care about covid-19 anymore. Those missing values will not influence our exploration.
``` r
plot_missing(vcc, percent = FALSE)
```
```{r, fig.width=14}
plot_missing(vcc, percent = FALSE)
```
The data of most rows in `vaccination-data.csv` are complete.
```{r}
```
``` r
plot_missing(vcc_meta, percent = FALSE)
```
```{r, fig.width=14}
plot_missing(vcc_meta, percent = FALSE)
```
All of the values of `END_DATE` and `COMMENT` are missing, but since we do not analysis text data, those missing values will not influence our exploration.