-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRMArkdown.Rmd
289 lines (219 loc) · 11.5 KB
/
RMArkdown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
title: "Weather Events and their Impacts on Human Health and Economics"
date: "`r Sys.Date()`"
author: "Zaid Hassan"
output:
rmdformats::readthedown:
code_folding: show
self_contained: true
thumbnails: false
lightbox: true
keep_md: yes
df_print: paged
highlight: tango
---
<a href="https://github.com/RexGalilae/Weather-Data-Rmd" class="github-corner" aria-label="View source on GitHub"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#151513; color:#fff; position: absolute; top: 0; border: 0; right: 0;" aria-hidden="true"><path d="M0,0 L115,115 L130,115 L142,142 L250,250 L250,0 Z"></path><path d="M128.3,109.0 C113.8,99.7 119.0,89.6 119.0,89.6 C122.0,82.7 120.5,78.6 120.5,78.6 C119.2,72.0 123.4,76.3 123.4,76.3 C127.3,80.9 125.5,87.3 125.5,87.3 C122.9,97.6 130.6,101.9 134.4,103.2" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="M115.0,115.0 C114.9,115.1 118.7,116.5 119.8,115.4 L133.7,101.6 C136.9,99.2 139.9,98.4 142.2,98.6 C133.8,88.0 127.5,74.4 143.8,58.0 C148.5,53.4 154.0,51.2 159.7,51.0 C160.3,49.4 163.2,43.6 171.4,40.1 C171.4,40.1 176.1,42.5 178.8,56.2 C183.1,58.6 187.2,61.8 190.9,65.4 C194.5,69.0 197.7,73.2 200.1,77.6 C213.8,80.2 216.3,84.9 216.3,84.9 C212.7,93.1 206.9,96.0 205.4,96.6 C205.1,102.4 203.0,107.8 198.3,112.5 C181.9,128.9 168.3,122.5 157.7,114.1 C157.9,116.9 156.7,120.9 152.7,124.9 L141.0,136.5 C139.8,137.7 141.6,141.9 141.8,141.8 Z" fill="currentColor" class="octo-body"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style>
## Synopsis
```{r image-render, echo=FALSE, fig.align="center", fig.width=6, cache=TRUE, fig.cap="A photo of Hurricane Florence taken from the ISS (Courtesy: NASA)"}
library(jpeg)
library(grid)
img <- readJPEG("hurriet-pasha.jpg")
grid.raster(img)
```
The NOAA Storm Database receives Storm Data from the National Weather Service from across the US. This project aims to quantify the impact of various documented storms from an Economic as well as Human perspective. The idea is to compare and contrast various events to answer the following questions:
1. Across the United States, which types of events are **most harmful with respect to population health**?
2. Across the United States, which types of events have the **greatest economic consequences**?
This work was done as a part of a project towards the completion of the [Reproducible Research](http://www.coursera.org/learn/reproducible-research) course in the [Data Science Specialization](http://www.coursera.org/specializations/jhu-data-science). This `knitr` generated publication documents all the work done (in R) towards the completion of said project.
*Read the full [data documentation](https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf)*
## Load up all dependencies and set global variables
```{r setup, message = FALSE, warning = FALSE, cache=FALSE}
library(dplyr)
library(ggplot2)
library(knitr)
library(xtable)
library(magrittr)
library(data.table)
knitr::opts_chunk$set(
fig.align = "center",
fig.height = 6,
fig.path = "figs/fig-",
fig.width = 6,
message = FALSE,
warning = FALSE,
comment = NA
)
```
## Download Data if Missing
```{r download-data, message=FALSE, warning=FALSE}
destfile <- "repdata_data_StormData.csv.bz2"
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists(destfile)){
res <- tryCatch(download.file(fileURL,
destfile=destfile,
method="auto"),
error=function(e) 1)
}
```
## Load Data onto R
```{r load-data, message=FALSE, warning=FALSE, cache=TRUE}
df <- as.data.table(read.csv("repdata_data_StormData.csv.bz2"))
assertthat::assert_that(exists("df"))
```
## The Data {.tabset}
### Table
```{r table, echo=FALSE}
df
```
### Columns
```{r cols, echo=FALSE, results='asis'}
print(xtable(as.data.frame(colnames(df))), type = "html")
```
## Data Processing
### Data Subsetting
Since we're only interested in:
* Event Type `EVTYPE`
* Fatalities `FATALITIES`
* Injuries `INJURIES`
* Damange to Property `PROPDMG`E`PROPDMGEXP`
* Damange to Crops `CROPDMG`E`CROPDMGEXP`
```{r subset, cache = TRUE}
dmg <- df[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
```
### Quantifying Data
*Making the Data Visualization-ready*
The dataset used some interesting notation to represent the amount of damage done to their crops and property. For example,
<a id = "ideal"></a>
`r data.frame(X = 8E5, Y = 6E3, "..." = "...")`
is represented in two columns as,
<a id = "curr"></a>
`r data.frame(X = 8, XEXP = 5, Y = 6, YEXP = "K", "..." = "...")`
Let's have a look at the `"xEXP"` values present in the dataset.
```{r unique-propexps}
unique(dmg$PROPDMGEXP)
```
```{r unique-cropexps}
unique(dmg$CROPDMGEXP)
```
Corresponding to each unique symbol, we need to assign a numeric value. Let's create key-value pairs with our assigned exponents as values to facilitate our transition from this current state (shown [here](#curr)) to [this](#ideal).
### 1. Change all `xEXP` entries to uppercase.
```{r to-upper, cache=TRUE}
cols <- c("PROPDMGEXP", "CROPDMGEXP")
dmg %<>% mutate_at(cols, toupper)
dmg <- as.data.table(dmg)
```
### 2. Map property and crop damage alphanumeric exponents to numeric values.
```{r mappings}
propDmgKey <- c("\"\"" = 10^0,
"-" = 10^0,
"+" = 10^0,
"0" = 10^0,
"1" = 10^1,
"2" = 10^2,
"3" = 10^3,
"4" = 10^4,
"5" = 10^5,
"6" = 10^6,
"7" = 10^7,
"8" = 10^8,
"9" = 10^9,
"H" = 10^2,
"K" = 10^3,
"M" = 10^6,
"B" = 10^9)
cropDmgKey <- c("\"\"" = 10^0,
"?" = 10^0,
"0" = 10^0,
"K" = 10^3,
"M" = 10^6,
"B" = 10^9)
```
### 3. Replace the values in `PROPDMGEXP` with their corresponding numeric values.
```{r replace-keys, cache=FALSE}
dmg[, PROPDMGEXP := propDmgKey[as.character(dmg[,PROPDMGEXP])]]
dmg[is.na(PROPDMGEXP), PROPDMGEXP := 10^0 ]
dmg[, CROPDMGEXP := cropDmgKey[as.character(dmg[,CROPDMGEXP])] ]
dmg[is.na(CROPDMGEXP), CROPDMGEXP := 10^0 ]
```
*NOTE:* Using `mutate_at` along with a key-value fetcher function for this task didn't work as intended and instead, copied the same values across all rows. This is why it was avoided in this chunk.
### 4. Use `mutate` to create columns for `PropertyDamage`, `CropDamage` and `TotalDamage` and get rid of the raw column data
```{r calc-dmg, cache=TRUE}
dmg %<>%
mutate(PropertyDamage = PROPDMG*PROPDMGEXP, CropDamage = CROPDMG*CROPDMGEXP) %>%
mutate(TotalDamage = PropertyDamage+CropDamage, TotalCasualties = FATALITIES+INJURIES) %>%
select(-c("PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
```
### 5. Preview the Dataframe before proceeding
```{r prev}
dmg[0:100,]
```
## Results
### Summary of Data
Let's first `summarize` the data to find the total Economic and Manpower Damage of these Weather events
```{r dmg-summary, cache=TRUE}
damageReport <- dmg %>%
group_by(EVTYPE) %>%
summarize(PropertyDamage = sum(PropertyDamage),
CropDamage = sum(CropDamage),
TotalDamage = sum(TotalDamage),
Injuries = sum(INJURIES),
Fatalities = sum(FATALITIES),
TotalCasualties = sum(TotalCasualties))
damageReport
```
### Prepping up the Data for Bar Plotting
Since `ggplot` works by plotting a set of variables grouped by a `fill` aesthetic, we need to `melt` our report data to make our report fully plot-ready. But first, to split it into its human and economic facets:
```{r split}
econLosses <- damageReport[ ,c("EVTYPE","PropertyDamage","CropDamage", "TotalDamage")]
humanLosses <- damageReport[, c("EVTYPE","Injuries", "Fatalities", "TotalCasualties")]
```
Then, to order by Total Losses
``` {r order}
econLosses <- econLosses[order(econLosses$TotalDamage, decreasing = TRUE),]
humanLosses <- humanLosses[order(humanLosses$TotalCasualties, decreasing = TRUE),]
```
Take the Top 10 most costly events for each. The rest will merely clutter up our graph
```{r top-10}
econLosses <- econLosses[1:10,]
humanLosses <- humanLosses[1:10,]
```
Finally, melt the data
```{r melt}
econLosses <- melt(econLosses)
humanLosses <- melt(humanLosses)
```
### Plotting
```{r hloss-bar, fig.cap="A Bar Plot of the Top 10 Events with the Most Reported Casualties"}
# Specify Aesthetic mappings
plot <- ggplot(humanLosses, aes(x = reorder(EVTYPE, -value), y= value, fill= variable))
# Specify Bar Chart specs
plot = plot + geom_bar(stat = "identity", position = "dodge")
# Set x-axis label to blank
plot = plot + xlab(element_blank())
# Set y-axis label
plot = plot + ylab("Casualties")
# Prevent clutter around the x-axis
plot = plot + theme(axis.text.x = element_text(angle=45, hjust=1), axis.title.x = element_blank())
# Set chart title and center it
plot = plot + ggtitle("Top 10 Deadliest Events") + theme(plot.title = element_text(hjust = 0.5))
plot
```
```{r eloss-bar, fig.cap="A Bar Plot of the Top 10 Costliest Types of Weather Events"}
# Specify Aesthetic mappings
plot <- ggplot(econLosses, aes(x = reorder(EVTYPE, -value), y= value, fill= variable))
# Specify Bar Chart specs
plot = plot + geom_bar(stat = "identity", position = "dodge")
# Set y-axis label
plot = plot + ylab("Cost ($)")
# Prevent clutter around the x-axis
plot = plot + theme(axis.text.x = element_text(angle=45, hjust=1), axis.title.x = element_blank())
# Set chart title and center it
plot = plot + ggtitle("Top 10 Costliest Events") + theme(plot.title = element_text(hjust = 0.5))
plot
```
The data for Casualties is pretty clear on what the major contributor to weather-event related deaths is with Tornados taking a sizeable lead over the rest in both Injuries as well as Fatalities. The following four events are tied pretty evenly with each other while the events further down the list start appearing progressively insignificant next to one another.
On the other hand, the data on Economic Costs a steady progression a la [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law) with Floods still holding an indisputable lead over damages to Property and Crops with Typhoon and Tornados (no less) not very far behind. Interestingly, one may also notice that a few Events show extreme selectivity towards one type of Economic Resource. Upon closer inspection, however, it seems obvious why.
## Concluding Words
This work was done as a part of a project towards the completion of the [Reproducible Research](http://www.coursera.org/learn/reproducible-research) course in the [Data Science Specialization](http://www.coursera.org/specializations/jhu-data-science).
Keeping focus and familiarity in mind,I went with the `readthedocs`-esque layout available for `Rmd` courtesy of [juba](https://github.com/juba/rmdformats). I've made my full code available on [github](https://github.com/RexGalilae/Weather-Data-Rmd). If there are any suggestions, feel free to make them `r emo::ji("smile")`
-------------------------------------------------------------
*Created by Zaid Hassan aka alhazen on `r Sys.Date()`*