CompactInference/140-Inference-isnt-everything.Rmd at master · debbieyuster/CompactInference · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# Remember, inference isn't everything

```{r include = FALSE}
library(dplyr)
library(tidyr)
library(ggformula)
```


Study design, sampling bias, etc.

## Confounding

MAYBE ... Cornfield 1959's rebuttal of Fisher. Note that we are out of the era  of classical inference. Cornfield showed that the common cause of smoking and cancer would have to be unplausibly strong in order to rule out a direct link from smoking to cancer.


# Some history

Jerome Cornfield considering Fisher's proposal that a confounder, not smoking, is responsible for lung cancer and the *observed* correlation of smoking with cancer.  Reading: [Schield-1999](/articles/Cornfields-conditions.pdf)

The question  he addressed: How strong would the hypothetical confounder have to be to account entirely for the observed relationship between smoking and cancer?

In the following A = smoking, B = Fisher's confounder. The risk ratio, r, was already known  to be about 9.

> *If an agent, A, with no causal effect upon the risk of a disease, nevertheless, because of a positive correlation with some other causal agent, B,  shows an apparent risk, r, for those exposed to A, relative to those not so exposed, then the prevalence of B, among those exposed to A, relative to the prevalence among those not so exposed, must be greater than r.* --  Cornfield J et al. *Smoking and lung cancer: recent evidence
and a discussion of some questions*. JNCI 1959;22:173–203. Reprint available [here](http://www.statlit.org/CP/Cornfield/2009-Int-J-Epidemiol-Reprint-1959-Cornfield.pdf).


Traditionally, introductory statistics courses emphasize the idea that "correlation is not causation." This is only  true  in  the sense that knowing that there is a correlation  between X and Y does not tell you which way the direct causal link between X and Y, if any, goes.

I prefer to say that "correlation *is* causation."

a. X $\rightarrow$ Y
b. X $\leftarrow$ Y
c. X $\leftarrow$ C $\rightarrow$ Y
d. X $\rightarrow$ C $\leftarrow$ Y

There are other possibilities that are variations on (c) and (d) with a causal connection between X and Y, for instance

![](images/X-C-Y.png)      or   ![](images/three-RRR.png)

Idea of confounding interval ... Accept that there might be confounding along the lines of the diagrams above.


```{r echo = FALSE, warning = FALSE}
confound <- function(Rt, Re, R) {
  helper <- function(R) {
    theta = (pi/2) - acos(Rt)
    beta = atan(Re * sqrt(1-R^2) / sqrt(R^2))

    top  <- theta + beta
    bottom <- theta - beta

    lower <- cos(top) / cos(theta)
    upper <- cos(bottom) /  cos(theta)

    data.frame(R  = R, lower  = lower, upper = upper,
                         top = top*180/pi, bottom = bottom*180/pi,
                         theta  = theta*180/pi)
  }
  bind_rows(lapply(as.list(R), helper))
}
R <- sqrt(seq(0.025, 0.5, by = 0.01))
Weak <- confound(0.2, 0.2, R) %>% mutate(R_squared = R^2)
Moderate <- confound(0.4, 0.4,  R) %>% mutate(R_squared = R^2)
Strong <- confound(0.6, 0.6, R) %>% mutate(R_squared = R^2)

P <- gf_ribbon(upper + lower ~ R_squared, data = Strong) %>%
  gf_ribbon(upper + lower ~ R_squared, data  = Moderate,
            fill = "gold", alpha = .5)  %>%
  gf_ribbon(upper + lower ~ R_squared, data = Weak,
            fill = "blue", alpha = .5) %>%
  gf_lims(y  = c(-0.5, 1.5)) %>%
  gf_labs(title = "Plausible confounding bounds",
          y = "Multiplier on effect size",
          x = expression({R^2}[xy]))  %>%
  gf_text( 0.15 ~ 0.15, label =  expression({R^2}[xc] == 0.35)) %>%
  gf_text( 0.65 ~ 0.18, label =  expression({R^2}[xc] == 0.16)) %>%
  gf_text( 0.93 ~ 0.21, label =  expression({R^2}[xc] == 0.04))

  P
```
# A proposed policy

1. Did a perfect experiment? No need for plausible confounding  bounds.
2. Did a real experiment? Use *weak* plausible confounding bounds.
3. Know a lot about  the system you're observing and confident that the  significant confounders have been adjusted for? Use *weak* plausible confounding bounds.
4. Not sure what all the confounders might be, but controlled for the ones you know about? Use moderate plausible confounding bounds.
5. Got some data and you want to use it to figure out the relationship between X and Y? Use strong  plausible confounding  bounds.


## Techniques when "degrees of freedom" are unknown,  as  with machine  learning methods.