-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwrangle.qmd
357 lines (242 loc) · 14.5 KB
/
wrangle.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
---
title: "Wrangle"
---
## Module Learning Objectives
By the end of this module, you will be able to:
- <u>Manipulate</u> rows and columns with `dplyr`'s `select` and `filter` functions
- <u>Create</u> new columns with `dplyr`'s `mutate` function and fill them conditionally with `case_when` (also from `dplyr`)
- <u>Use</u> `tidyr`'s `separate_wider_delim` function to split a column into two
## What are Tidy Data?
What are some common things you like to do with your data? Maybe remove rows or columns, do calculations and add the results as new columns? These operations (and others) are called "data wrangling". The data we get to work with are rarely, if ever, in the format we need to do our analyses and data wrangling can help bridge that gap. `dplyr` and `tidyr` are two R packages from the Tidyverse that provide a fairly complete and extremely powerful set of functions for us to do virtually all needed wrangling quickly. Here we introduce some commonly used functions from these two packages.
We can use `glimpse` from the `dplyr` package to look at part of the data while also getting some relevant structural information (i.e., what type of data are in each column, etc.).
```{r libraries-wrangle, message = F}
# install.packages("tidyverse", "palmerpenguins")
library(tidyverse)
library(palmerpenguins)
dplyr::glimpse(penguins)
```
## Selecting Columns
To start off, how do we do a fundamental action like selecting the columns we want? `dplyr`'s `select` function provides us with a straightforward way to do just that. We only need to provide the column names!
<p align="center">
<img src="images/wrangle-select.png" alt="Graphic of a dataframe with columns labeled 'A', 'B', and 'C' being pare down to just the 'A' column" width="50%"/>
</p>
Note that even if you `select` just one column, a dataframe will be returned. Whereas if you use the `$` operator you get a vector (e.g., `data$column` returns a vector, not a dataframe).
### `select` Example: Including & Excluding
::: callout-note
## Example
To select only the `species`, `island`, and `body_mass_g` columns, we can use the following code:
```{r select-include}
# Provide the name of the data and then the columns that you want!
penguins_selected <- dplyr::select(.data = penguins, species, island, body_mass_g)
# Look at the product
dplyr::glimpse(penguins_selected)
```
If we want to remove specific columns, we can use the `-` operator.
```{r select-exclude}
# Put a "-" in front of each column you would like to remove from your dataframe
penguins_selected <- dplyr::select(.data = penguins, -flipper_length_mm, -sex)
# Look at the product
dplyr::glimpse(penguins_selected)
```
:::
### `select` Example: Selecting with Helper Functions
::: callout-note
## Example
If we want to select the columns that contain length measurements, we can manually type `bill_length_mm` and `flipper_length_mm`, but there's actually an easier way using the `contains` function, also from the `dplyr` package. Enter a string that matches what you're looking for among the column names.
```{r select-contains}
# Enter a matching string inside of a `select` call
penguins_selected <- dplyr::select(.data = penguins, dplyr::contains("length"))
# Look at the product
dplyr::glimpse(penguins_selected)
```
:::
### `select` Example: Selecting a Range of Columns
:::callout-note
## Example
Now what if we wanted all the columns from the first column `species` to the sixth column `body_mass_g`? We can use a colon, `:`, between the first (leftmost) and last (rightmost) columns in the range that we want to include.
```{r select-range}
# Use a colon to indicate a range of columns you want to select
penguins_selected <- dplyr::select(.data = penguins, species:body_mass_g)
# Look at the product
dplyr::glimpse(penguins_selected)
```
Conveniently selecting a range of columns can be especially useful when you have a big dataframe and don't want to exhaustively list every column by name.
:::
## Subsetting Rows
Instead of selecting certain columns, how can we get a subset of rows that meet certain conditions? For example, in the diagram below, how can we filter for rows that contain a diamond shape? We can use `dplyr`'s handy `filter` function along with logical and boolean operators!
<p align="center">
<img src="images/wrangle-filter.png" alt="Graphic of a table with only an 'A' column and rows with different shapes getting subsetted to only the row with a diamond" width="25%"/>
</p>
For reference, here are the operators we can use to specify our conditions with `filter`.
- `==` -- two items are exactly equal to one another
- `!=` -- two items are not equal to one another
- `<`/`>` -- one item is less/greater than the other
- `<=`/`>=` -- one item is less/greater than or equal to the other
- `|` -- one statement _or_ another is `TRUE`
- `&` -- _both_ statements are `TRUE`
- `%in%` -- a value is in a set of other values
You may have noticed that `filter` accepts the same operators that base R's `subset` function does. This is no accident and `filter` is one of the more accessible Tidyverse functions because the syntax it shares with its base R equivalent.
To get familiar with these operators, let's see some examples!
### `filter` Example: Exactly Equal
:::callout-note
## Example
To make a subset of our data that only contains information on Chinstrap penguins, we would use the `==` operator for "exactly equal to"
```{r filter-equal}
# Get all the rows where the species is "Chinstrap"
penguins_filtered <- dplyr::filter(.data = penguins, species == "Chinstrap")
head(penguins_filtered)
```
Note that we need to write the value we're looking for as a character string bookended by quotation marks.
:::
### `filter` Example: Either / Or
:::callout-note
## Example
What if we wanted to get all the rows where the penguin species is "Chinstrap" **or** "Gentoo"? In other words, we want all the rows where *either* condition is true. There are two options to do this. The first option is to use the "or" operator (`|`) between each of the conditions.
```{r filter-either}
# Get all the rows where the species is "Chinstrap" or "Gentoo"
penguins_filtered <- dplyr::filter(.data = penguins,
species == "Chinstrap" | species == "Gentoo")
unique(penguins_filtered$species)
```
This method works fine for a few options but begins to get cumbersome when you have many possible conditions that you'd like to retain. In these cases you can use the `%in%` operator followed by a vector of values that you want to include in your filter.
```{r filter-in}
# Get all the rows where the species is "Chinstrap" or "Gentoo"
penguins_filtered <- dplyr::filter(.data = penguins,
species %in% c("Chinstrap", "Gentoo"))
unique(penguins_filtered$species)
```
:::
### `filter` Example: Multiple Conditions
:::callout-note
## Example
We can also keep rows where both conditions are met by using the `&` operator to specify multiple conditions that must *all* be true. To keep only the rows where the species is "Adelie" **and** the island is "Dream", we can use the following code:
```{r filter-and}
# Get all the rows where the species is "Adelie" and the island is "Dream"
penguins_filtered <- dplyr::filter(.data = penguins,
species == "Adelie" & island == "Dream")
dplyr::glimpse(penguins_filtered)
```
:::
### `filter` Example: Greater Than / Less Than
:::callout-note
## Example
When subsetting by numeric columns, we can use greater than (`>`) and less than (`<`) to capture the range of possible values that meet that criteria. If you want to include an "or equal to" clause, just add an equal sign to the right of the greater/less than sign (e.g., `>=` or `<=`).
For instance, we can subset the data for only penguins whose bills are longer than 50 millimeters.
```{r filter-numeric}
# Filter based on bill length
penguins_filtered <- dplyr::filter(.data = penguins, bill_length_mm > 50)
sort(penguins_filtered$bill_length_mm)
```
Note that when filtering for numeric columns we do not need the quotation marks around the number(s) we use to filter.
:::
### `filter` Example: Exclusion Criteria
:::callout-note
## Example
Sometimes it's faster to subset the rows that **do not** meet a condition, rather than listing everything that we do want to keep. This is where the `!=` operator (or "not equal to") becomes useful. More generally, the exclamation mark indicates negation in the operator.
```{r filter-exclude}
# Get all the rows where the species is NOT "Chinstrap"
penguins_filtered <- dplyr::filter(.data = penguins, species != "Chinstrap")
dplyr::glimpse(penguins_filtered)
```
:::
### Challenge: `filter`
:::callout-important
## Your Turn!
Using `filter`, how would you get all of the rows that **do not** have any `NA` values in the `sex` column?
:::
## Making and Modifying Columns
Aside from selecting columns and subsetting rows, we may want to create new columns in our data. For instance, in the diagram below, we have a dataframe that only contains column A, and then we add new columns B and C. We can use `dplyr`'s `mutate` function to add a new column, while keeping the existing columns.
<p align="center">
<img src="images/wrangle-mutate.png" alt="Graphic of a table with an 'A' column gaining a 'B' and 'C' column" width="50%"/>
</p>
The general syntax to add a new column to your dataframe is as follows:
```{r mutate-syntax, eval = F}
your_data_v2 <- dplyr::mutate(.data = your_data, new_column_name = what_it_contains)
```
### `mutate` Example: Making New Columns
:::callout-note
## Example
If we wanted to add a new column that has the penguin's body mass in kilograms, we can do some arithmetic on the `body_mass_g` column and store the result in a new column.
```{r mutate-create}
# Create a new column with the penguins' body mass in kilograms
penguins_mutated <- dplyr::mutate(.data = penguins, body_mass_kg = body_mass_g / 1000)
dplyr::glimpse(penguins_mutated)
```
:::
### `mutate` Example: Overwriting Existing Columns
:::callout-note
## Example
Additionally, `mutate` can be used to overwrite an existing column. If we give the new column the same name as an existing column, the existing column will be **replaced**. As you can see, `island` is currently a factor. To change its class to a character, we would need to overwrite the column.
```{r mutate-overwrite}
# Check current format of the `island` column
class(penguins$island)
# Modify the existing island column
penguins_mutated <- dplyr::mutate(.data = penguins, island = as.character(island))
# the `island` column is now a character!
class(penguins_mutated$island)
```
Now `island` is a character column!
:::
### Conditional Operations
Sometimes in data wrangling we'll want to generate a new column where the contents of the column are dependent upon an existing column but we have many separate "if X then Y" type statements. Such statements are called "conditional" statements in programming. You may already be familiar with base R's `ifelse` function for handling cases where you have an either/or condition.
In the Tidyverse--specifically `dplyr`--we have `case_when` for handling multiple conditions in an efficient and relatively straightforward way! Why are we talking about `case_when` here? Because you can use `case_when` inside of a `mutate` to create a new column based on the conditions that you specify.
Here is what the general syntax for this operation looks like:
```{r case_when-syntax, eval = F}
your_data_v2 <- dplyr::mutate(.data = your_data,
new_column_name = dplyr::case_when(
condition1 ~ value_for_condition1,
condition2 ~ value_for_condition2,
condition3 ~ value_for_condition3,
...
TRUE ~ value_if_no_conditions_are_met))
```
Let's look at an example to make this somewhat more tangible.
### `mutate` + `case_when` Example: Creating a New Column Conditionally
:::callout-note
## Example
Suppose we want to add a new column called `flipper_rank` that contains the following:
- "short" if `flipper_length_mm` is \< 190 mm
- "long" if `flipper_length_mm` is \>= 190 mm
```{r case-when}
# Enter your conditions on the left side and the values on the right side of the tilde
penguins_mutated <- dplyr::mutate(.data = penguins,
flipper_rank = dplyr::case_when(
flipper_length_mm < 190 ~ "short",
flipper_length_mm >= 190 ~ "long")
)
dplyr::glimpse(penguins_mutated)
```
:::
### Challenge: `mutate` + `case_when`
:::callout-important
## Your Turn!
Using `mutate` and `case_when`, create a new column called `size_bin` that contains the following:
- "large" if body mass is greater than 4500 grams
- "medium" if body mass is greater than 3000 grams, and less than or equal to 4500 grams
- "small" if body mass is less than or equal to 3000 grams
:::
### Splitting a Column into Multiple Columns
Another relatively common task in data wrangling involves splitting the contents of one column into several columns. To demonstrate, let's first make a new column that contains the full scientific names for these penguins using `mutate` and `case_when`.
```{r penguins-spp}
# Remember that conditions are on the left side and the values are on the right side of the tilde
penguins_v1 <- dplyr::mutate(.data = penguins,
scientific_name = dplyr::case_when(
species == "Adelie" ~ "Pygoscelis_adeliae",
species == "Chinstrap" ~ "Pygoscelis_antarcticus",
species == "Gentoo" ~ "Pygoscelis_papua"))
dplyr::glimpse(penguins_v1)
```
If we want to split the scientific name into genus and specific epithet, we can use the `separate_wider_delim` function from the `tidyr` package.
### `separate_wider_delim` Example: Splitting a Column Apart
:::callout-note
## Example
Using our new scientific name column, suppose we want to split it so that `scientific_name` becomes two columns: `genus` and `epithet`. Using `tidyr`'s `separate_wider_delim` function we can do this in a single step!
```{r separate}
# Indicate the column you want to split, the separator, and the new column names!
penguins_separated <- tidyr::separate_wider_delim(data = penguins_v1,
cols = scientific_name,
delim = "_",
names = c("genus", "epithet"))
dplyr::glimpse(penguins_separated)
```
:::