forked from dtkaplan/CompactInference
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path140-Inference-isnt-everything.Rmd
More file actions
92 lines (64 loc) · 4.26 KB
/
140-Inference-isnt-everything.Rmd
File metadata and controls
92 lines (64 loc) · 4.26 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# Remember, inference isn't everything
```{r include = FALSE}
library(dplyr)
library(tidyr)
library(ggformula)
```
Study design, sampling bias, etc.
## Confounding
MAYBE ... Cornfield 1959's rebuttal of Fisher. Note that we are out of the era of classical inference. Cornfield showed that the common cause of smoking and cancer would have to be unplausibly strong in order to rule out a direct link from smoking to cancer.
# Some history
Jerome Cornfield considering Fisher's proposal that a confounder, not smoking, is responsible for lung cancer and the *observed* correlation of smoking with cancer. Reading: [Schield-1999](/articles/Cornfields-conditions.pdf)
The question he addressed: How strong would the hypothetical confounder have to be to account entirely for the observed relationship between smoking and cancer?
In the following A = smoking, B = Fisher's confounder. The risk ratio, r, was already known to be about 9.
> *If an agent, A, with no causal effect upon the risk of a disease, nevertheless, because of a positive correlation with some other causal agent, B, shows an apparent risk, r, for those exposed to A, relative to those not so exposed, then the prevalence of B, among those exposed to A, relative to the prevalence among those not so exposed, must be greater than r.* -- Cornfield J et al. *Smoking and lung cancer: recent evidence
and a discussion of some questions*. JNCI 1959;22:173–203. Reprint available [here](http://www.statlit.org/CP/Cornfield/2009-Int-J-Epidemiol-Reprint-1959-Cornfield.pdf).
Traditionally, introductory statistics courses emphasize the idea that "correlation is not causation." This is only true in the sense that knowing that there is a correlation between X and Y does not tell you which way the direct causal link between X and Y, if any, goes.
I prefer to say that "correlation *is* causation."
a. X $\rightarrow$ Y
b. X $\leftarrow$ Y
c. X $\leftarrow$ C $\rightarrow$ Y
d. X $\rightarrow$ C $\leftarrow$ Y
There are other possibilities that are variations on (c) and (d) with a causal connection between X and Y, for instance
 or 
Idea of confounding interval ... Accept that there might be confounding along the lines of the diagrams above.
```{r echo = FALSE, warning = FALSE}
confound <- function(Rt, Re, R) {
helper <- function(R) {
theta = (pi/2) - acos(Rt)
beta = atan(Re * sqrt(1-R^2) / sqrt(R^2))
top <- theta + beta
bottom <- theta - beta
lower <- cos(top) / cos(theta)
upper <- cos(bottom) / cos(theta)
data.frame(R = R, lower = lower, upper = upper,
top = top*180/pi, bottom = bottom*180/pi,
theta = theta*180/pi)
}
bind_rows(lapply(as.list(R), helper))
}
R <- sqrt(seq(0.025, 0.5, by = 0.01))
Weak <- confound(0.2, 0.2, R) %>% mutate(R_squared = R^2)
Moderate <- confound(0.4, 0.4, R) %>% mutate(R_squared = R^2)
Strong <- confound(0.6, 0.6, R) %>% mutate(R_squared = R^2)
P <- gf_ribbon(upper + lower ~ R_squared, data = Strong) %>%
gf_ribbon(upper + lower ~ R_squared, data = Moderate,
fill = "gold", alpha = .5) %>%
gf_ribbon(upper + lower ~ R_squared, data = Weak,
fill = "blue", alpha = .5) %>%
gf_lims(y = c(-0.5, 1.5)) %>%
gf_labs(title = "Plausible confounding bounds",
y = "Multiplier on effect size",
x = expression({R^2}[xy])) %>%
gf_text( 0.15 ~ 0.15, label = expression({R^2}[xc] == 0.35)) %>%
gf_text( 0.65 ~ 0.18, label = expression({R^2}[xc] == 0.16)) %>%
gf_text( 0.93 ~ 0.21, label = expression({R^2}[xc] == 0.04))
P
```
# A proposed policy
1. Did a perfect experiment? No need for plausible confounding bounds.
2. Did a real experiment? Use *weak* plausible confounding bounds.
3. Know a lot about the system you're observing and confident that the significant confounders have been adjusted for? Use *weak* plausible confounding bounds.
4. Not sure what all the confounders might be, but controlled for the ones you know about? Use moderate plausible confounding bounds.
5. Got some data and you want to use it to figure out the relationship between X and Y? Use strong plausible confounding bounds.
## Techniques when "degrees of freedom" are unknown, as with machine learning methods.