vignettes/multilevel.Rmd

---
title: "Multilevel Correlations"
output:
  rmarkdown::html_vignette:
    toc: true
    fig_width: 10.08
    fig_height: 6
tags: [r, correlation, types]
vignette: >
  %\VignetteIndexEntry{Multilevel Correlations}
  \usepackage[utf8]{inputenc}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
bibliography: bibliography.bib
---

---

This vignette can be cited as:

```{r cite}
citation("correlation")
```

---

```{r, include=FALSE}
library(knitr)
options(
  knitr.kable.NA = "",
  digits = 2,
  out.width = "100%",
  message = FALSE,
  warning = FALSE,
  dpi = 450
)

if (!requireNamespace("ggplot2", quietly = TRUE) ||
  !requireNamespace("BayesFactor", quietly = TRUE) ||
  !requireNamespace("lme4", quietly = TRUE)) {
  knitr::opts_chunk$set(eval = FALSE)
}
```

## Data

Imagine we have an experiment in which **10 individuals** completed a task with
**100 trials**. For each of the 1000 trials (10 * 100) in total,
we measured two things, **V1** and **V2**, and we are interested in
**investigating the link between these two variables**.

We will generate data using the
[`simulate_simpson()`](https://easystats.github.io/bayestestR/reference/simulate_simpson.html)
function from this package and look at its summary:

```{r}
library(correlation)

data <- simulate_simpson(n = 100, groups = 10)

summary(data)
```

Now let's visualize the two variables:

```{r}
library(ggplot2)

ggplot(data, aes(x = V1, y = V2)) +
  geom_point() +
  geom_smooth(colour = "black", method = "lm", se = FALSE) +
  theme_classic()
```

That seems pretty straightforward! It seems like there is a **negative
correlation** between V1 and V2. Let's test this.

## Simple correlation

```{r}
correlation(data)
```

Indeed, there is  a **strong, negative and significant correlation** between V1
and V2.

Great, can we go ahead and **publish these results in _PNAS_**?

## The Simpson's Paradox

Not so fast! Ever heard of the [**Simpson's Paradox**](https://en.wikipedia.org/wiki/Simpson%27s_paradox)?

Let's colour our datapoints by group (by individuals):

```{r}
library(ggplot2)

ggplot(data, aes(x = V1, y = V2)) +
  geom_point(aes(colour = Group)) +
  geom_smooth(aes(colour = Group), method = "lm", se = FALSE) +
  geom_smooth(colour = "black", method = "lm", se = FALSE) +
  theme_classic()
```

Mmh, interesting. It seems like, for each subject, the relationship is
different. The (global) negative trend seems to be an artifact of **differences between the groups** and could be spurious!

**Multilevel *(as in multi-group)* ** correlations allow us to account for
**differences between groups**. It is based on a partialization of the group,
entered as a random effect in a mixed linear regression.

You can compute them with the
[**correlations**](https://github.com/easystats/correlation) package by setting
the `multilevel` argument to `TRUE`.

```{r}
correlation(data, multilevel = TRUE)
```

For completeness, let's also see if its Bayesian cousin agrees with it:

```{r}
correlation(data, multilevel = TRUE, bayesian = TRUE)
```

**Dayum!** 
We were too hasty in our conclusions! Taking the group into account
seems to be super important.

_Note_: In this simple case where only two variables are of interest, it would be
of course best to directly proceed using a mixed regression model instead of
correlations. That being said, the latter can be useful for exploratory
analysis, when multiple variables are of interest, or in combination with a
network or structural approach.