wrangle.qmd

---
title: "Wrangle"
---

## Module Learning Objectives

By the end of this module, you will be able to:

- <u>Manipulate</u> rows and columns with `dplyr`'s `select` and `filter` functions
- <u>Create</u> new columns with `dplyr`'s `mutate` function and fill them conditionally with `case_when` (also from `dplyr`) 
- <u>Use</u> `tidyr`'s `separate_wider_delim` function to split a column into two

## What are Tidy Data?

What are some common things you like to do with your data? Maybe remove rows or columns, do calculations and add the results as new columns? These operations (and others) are called "data wrangling". The data we get to work with are rarely, if ever, in the format we need to do our analyses and data wrangling can help bridge that gap. `dplyr` and `tidyr` are two R packages from the Tidyverse that provide a fairly complete and extremely powerful set of functions for us to do virtually all needed wrangling quickly. Here we introduce some commonly used functions from these two packages.

We can use `glimpse` from the `dplyr` package to look at part of the data while also getting some relevant structural information (i.e., what type of data are in each column, etc.).

```{r libraries-wrangle, message = F}
# install.packages("tidyverse", "palmerpenguins")
library(tidyverse)
library(palmerpenguins)

dplyr::glimpse(penguins)
```

## Selecting Columns

To start off, how do we do a fundamental action like selecting the columns we want? `dplyr`'s `select` function provides us with a straightforward way to do just that. We only need to provide the column names!

<p align="center">
<img src="images/wrangle-select.png" alt="Graphic of a dataframe with columns labeled 'A', 'B', and 'C' being pare down to just the 'A' column" width="50%"/>
</p>

Note that even if you `select` just one column, a dataframe will be returned. Whereas if you use the `$` operator you get a vector (e.g., `data$column` returns a vector, not a dataframe).

### `select` Example: Including & Excluding

::: callout-note 
## Example

To select only the `species`, `island`, and `body_mass_g` columns, we can use the following code:

```{r select-include}
# Provide the name of the data and then the columns that you want!
penguins_selected <- dplyr::select(.data = penguins, species, island, body_mass_g)

# Look at the product
dplyr::glimpse(penguins_selected)
```

If we want to remove specific columns, we can use the `-` operator.

```{r select-exclude}
# Put a "-" in front of each column you would like to remove from your dataframe
penguins_selected <- dplyr::select(.data = penguins, -flipper_length_mm, -sex)

# Look at the product
dplyr::glimpse(penguins_selected)
```
:::

### `select` Example: Selecting with Helper Functions

::: callout-note
## Example

If we want to select the columns that contain length measurements, we can manually type `bill_length_mm` and `flipper_length_mm`, but there's actually an easier way using the `contains` function, also from the `dplyr` package. Enter a string that matches what you're looking for among the column names.

```{r select-contains}
# Enter a matching string inside of a `select` call
penguins_selected <- dplyr::select(.data = penguins, dplyr::contains("length"))

# Look at the product
dplyr::glimpse(penguins_selected)
```
:::

### `select` Example: Selecting a Range of Columns

:::callout-note
## Example

Now what if we wanted all the columns from the first column `species` to the sixth column `body_mass_g`? We can use a colon, `:`, between the first (leftmost) and last (rightmost) columns in the range that we want to include.

```{r select-range}
# Use a colon to indicate a range of columns you want to select
penguins_selected <- dplyr::select(.data = penguins, species:body_mass_g)

# Look at the product
dplyr::glimpse(penguins_selected)
```

Conveniently selecting a range of columns can be especially useful when you have a big dataframe and don't want to exhaustively list every column by name.
:::

## Subsetting Rows

Instead of selecting certain columns, how can we get a subset of rows that meet certain conditions? For example, in the diagram below, how can we filter for rows that contain a diamond shape? We can use `dplyr`'s handy `filter` function along with logical and boolean operators!

<p align="center">
<img src="images/wrangle-filter.png" alt="Graphic of a table with only an 'A' column and rows with different shapes getting subsetted to only the row with a diamond" width="25%"/>
</p>

For reference, here are the operators we can use to specify our conditions with `filter`.

- `==` -- two items are exactly equal to one another
- `!=` -- two items are not equal to one another 
- `<`/`>` -- one item is less/greater than the other 
- `<=`/`>=` -- one item is less/greater than or equal to the other 
- `|` -- one statement _or_ another is `TRUE`
- `&` -- _both_ statements are `TRUE`
- `%in%` -- a value is in a set of other values

You may have noticed that `filter` accepts the same operators that base R's `subset` function does. This is no accident and `filter` is one of the more accessible Tidyverse functions because the syntax it shares with its base R equivalent.

To get familiar with these operators, let's see some examples!

### `filter` Example: Exactly Equal

:::callout-note
## Example

To make a subset of our data that only contains information on Chinstrap penguins, we would use the `==` operator for "exactly equal to"

```{r filter-equal}
# Get all the rows where the species is "Chinstrap" 
penguins_filtered <- dplyr::filter(.data = penguins, species == "Chinstrap")

head(penguins_filtered)
```

Note that we need to write the value we're looking for as a character string bookended by quotation marks.
:::

### `filter` Example: Either / Or

:::callout-note
## Example

What if we wanted to get all the rows where the penguin species is "Chinstrap" **or** "Gentoo"? In other words, we want all the rows where *either* condition is true. There are two options to do this. The first option is to use the "or" operator (`|`) between each of the conditions.

```{r filter-either}
# Get all the rows where the species is "Chinstrap" or "Gentoo"
penguins_filtered <- dplyr::filter(.data = penguins,
                             species == "Chinstrap" | species == "Gentoo")

unique(penguins_filtered$species)
```

This method works fine for a few options but begins to get cumbersome when you have many possible conditions that you'd like to retain. In these cases you can use the `%in%` operator followed by a vector of values that you want to include in your filter.

```{r filter-in}
# Get all the rows where the species is "Chinstrap" or "Gentoo"
penguins_filtered <- dplyr::filter(.data = penguins,
                             species %in% c("Chinstrap", "Gentoo"))

unique(penguins_filtered$species)
```
:::

### `filter` Example: Multiple Conditions

:::callout-note
## Example

We can also keep rows where both conditions are met by using the `&` operator to specify multiple conditions that must *all* be true. To keep only the rows where the species is "Adelie" **and** the island is "Dream", we can use the following code:

```{r filter-and}
# Get all the rows where the species is "Adelie" and the island is "Dream"
penguins_filtered <- dplyr::filter(.data = penguins,
                             species == "Adelie" & island == "Dream")

dplyr::glimpse(penguins_filtered)
```
:::

### `filter` Example: Greater Than / Less Than

:::callout-note
## Example

When subsetting by numeric columns, we can use greater than (`>`) and less than (`<`) to capture the range of possible values that meet that criteria. If you want to include an "or equal to" clause, just add an equal sign to the right of the greater/less than sign (e.g., `>=` or `<=`). 

For instance, we can subset the data for only penguins whose bills are longer than 50 millimeters.

```{r filter-numeric}
# Filter based on bill length
penguins_filtered <- dplyr::filter(.data = penguins, bill_length_mm > 50)

sort(penguins_filtered$bill_length_mm)
```

Note that when filtering for numeric columns we do not need the quotation marks around the number(s) we use to filter.
:::

### `filter` Example: Exclusion Criteria

:::callout-note
## Example

Sometimes it's faster to subset the rows that **do not** meet a condition, rather than listing everything that we do want to keep. This is where the `!=` operator (or "not equal to") becomes useful. More generally, the exclamation mark indicates negation in the operator.

```{r filter-exclude}
# Get all the rows where the species is NOT "Chinstrap"
penguins_filtered <- dplyr::filter(.data = penguins, species != "Chinstrap")

dplyr::glimpse(penguins_filtered)
```
:::

### Challenge: `filter`

:::callout-important
## Your Turn!

Using `filter`, how would you get all of the rows that **do not** have any `NA` values in the `sex` column?
:::

## Making and Modifying Columns

Aside from selecting columns and subsetting rows, we may want to create new columns in our data. For instance, in the diagram below, we have a dataframe that only contains column A, and then we add new columns B and C. We can use `dplyr`'s `mutate` function to add a new column, while keeping the existing columns.

<p align="center">
<img src="images/wrangle-mutate.png" alt="Graphic of a table with an 'A' column gaining a 'B' and 'C' column" width="50%"/>
</p>

The general syntax to add a new column to your dataframe is as follows:

```{r mutate-syntax, eval = F}
your_data_v2 <- dplyr::mutate(.data = your_data, new_column_name = what_it_contains)
```

### `mutate` Example: Making New Columns

:::callout-note
## Example

If we wanted to add a new column that has the penguin's body mass in kilograms, we can do some arithmetic on the `body_mass_g` column and store the result in a new column.

```{r mutate-create}
# Create a new column with the penguins' body mass in kilograms
penguins_mutated <- dplyr::mutate(.data = penguins, body_mass_kg = body_mass_g / 1000)

dplyr::glimpse(penguins_mutated)
```
:::

### `mutate` Example: Overwriting Existing Columns

:::callout-note
## Example

Additionally, `mutate` can be used to overwrite an existing column. If we give the new column the same name as an existing column, the existing column will be **replaced**. As you can see, `island` is currently a factor. To change its class to a character, we would need to overwrite the column.

```{r mutate-overwrite}
# Check current format of the `island` column
class(penguins$island)

# Modify the existing island column
penguins_mutated <- dplyr::mutate(.data = penguins, island = as.character(island))

# the `island` column is now a character!
class(penguins_mutated$island)
```

Now `island` is a character column!
:::

### Conditional Operations

Sometimes in data wrangling we'll want to generate a new column where the contents of the column are dependent upon an existing column but we have many separate "if X then Y" type statements. Such statements are called "conditional" statements in programming. You may already be familiar with base R's `ifelse` function for handling cases where you have an either/or condition.

In the Tidyverse--specifically `dplyr`--we have `case_when` for handling multiple conditions in an efficient and relatively straightforward way! Why are we talking about `case_when` here? Because you can use `case_when` inside of a `mutate` to create a new column based on the conditions that you specify.

Here is what the general syntax for this operation looks like:

```{r case_when-syntax, eval = F}
your_data_v2 <- dplyr::mutate(.data = your_data,
                              new_column_name = dplyr::case_when(
                                  condition1 ~ value_for_condition1,
                                  condition2 ~ value_for_condition2,
                                  condition3 ~ value_for_condition3,
                                  ...
                                  TRUE ~ value_if_no_conditions_are_met))
```

Let's look at an example to make this somewhat more tangible.

### `mutate` + `case_when` Example: Creating a New Column Conditionally

:::callout-note
## Example

Suppose we want to add a new column called `flipper_rank` that contains the following:

-   "short" if `flipper_length_mm` is \< 190 mm
-   "long" if `flipper_length_mm` is \>= 190 mm

```{r case-when}
# Enter your conditions on the left side and the values on the right side of the tilde
penguins_mutated <- dplyr::mutate(.data = penguins,
                                  flipper_rank = dplyr::case_when(
                                    flipper_length_mm < 190 ~ "short",
                                    flipper_length_mm >= 190 ~ "long")
)

dplyr::glimpse(penguins_mutated)
```
:::

### Challenge: `mutate` + `case_when`

:::callout-important
## Your Turn!

Using `mutate` and `case_when`, create a new column called `size_bin` that contains the following:

-   "large" if body mass is greater than 4500 grams
-   "medium" if body mass is greater than 3000 grams, and less than or equal to 4500 grams
-   "small" if body mass is less than or equal to 3000 grams
:::

### Splitting a Column into Multiple Columns

Another relatively common task in data wrangling involves splitting the contents of one column into several columns. To demonstrate, let's first make a new column that contains the full scientific names for these penguins using `mutate` and `case_when`.

```{r penguins-spp}
# Remember that conditions are on the left side and the values are on the right side of the tilde
penguins_v1 <- dplyr::mutate(.data = penguins, 
                             scientific_name = dplyr::case_when(
                               species == "Adelie" ~ "Pygoscelis_adeliae",
                               species == "Chinstrap" ~ "Pygoscelis_antarcticus",
                               species == "Gentoo" ~ "Pygoscelis_papua"))

dplyr::glimpse(penguins_v1)
```

If we want to split the scientific name into genus and specific epithet, we can use the `separate_wider_delim` function from the `tidyr` package.

### `separate_wider_delim` Example: Splitting a Column Apart

:::callout-note
## Example

Using our new scientific name column, suppose we want to split it so that `scientific_name` becomes two columns: `genus` and `epithet`. Using `tidyr`'s `separate_wider_delim` function we can do this in a single step!

```{r separate}
# Indicate the column you want to split, the separator, and the new column names!
penguins_separated <- tidyr::separate_wider_delim(data = penguins_v1,
                                                  cols = scientific_name,
                                                  delim = "_",
                                                  names = c("genus", "epithet"))

dplyr::glimpse(penguins_separated)
```
:::