-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
243 lines (181 loc) · 9.06 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk[["set"]](
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
cache = FALSE
)
old_options <- options(scipen = 999) # turn off scientific notation
```
# `gips` <a href="https://przechoj.github.io/gips/"><img src="man/figures/logo.svg" align="right" height="139" /></a>
<!-- badges: start -->
[](https://lifecycle.r-lib.org/articles/stages.html#stable)
[](https://CRAN.R-project.org/package=gips)
[](https://github.com/PrzeChoj/gips/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/gh/PrzeChoj/gips?branch=main)

<!-- badges: end -->
gips - Gaussian model Invariant by Permutation Symmetry
`gips` is an R package that looks for permutation symmetries in the multivariate Gaussian sample. Such symmetries reduce the free parameters in the unknown covariance matrix. This is especially useful when the number of variables is substantially larger than the number of observations.
## `gips` will help you with two things:
1. Finding hidden symmetries between the variables. `gips` can be used as an exploratory tool for searching the space of permutation symmetries of the Gaussian vector. Useful in the Exploratory Data Analysis (EDA).
2. Covariance estimation. The Maximum Likelihood Estimator (MLE) for the covariance matrix is known to exist if and only if the number of variables is less or equal to the number of observations. Additional knowledge of symmetries significantly weakens this requirement. Moreover, the reduction of model dimension brings the advantage in terms of precision of covariance estimation.
## Installation
From [CRAN](https://CRAN.R-project.org/package=gips):
``` r
# Install the released version from CRAN:
install.packages("gips")
```
From [GitHub](https://github.com/PrzeChoj/gips):
``` r
# Install the development version from GitHub:
# install.packages("remotes")
remotes::install_github("PrzeChoj/gips")
```
## Examples
### Example 1 - EDA
Assume we have the data, and we want to understand its structure:
```{r example_mean_unknown0}
library(gips)
Z <- HSAUR2::aspirin
# Renumber the columns for better readability:
Z[, c(2, 3)] <- Z[, c(3, 2)]
```
```{r example_mean_unknown1, include=FALSE}
names(Z) <- NULL
plot_cosmetic_modifications <- function(gg_plot_object) {
my_col_names <- c(
"deaths after placebo", "deaths after Aspirin",
"treated with placebo", "treated with Aspirin"
)
suppressMessages( # message from ggplot2
out <- gg_plot_object +
ggplot2::scale_x_continuous(
labels = my_col_names,
breaks = 1:4
) +
ggplot2::scale_y_reverse(
labels = my_col_names,
breaks = 1:4
) +
ggplot2::theme(
title = ggplot2::element_text(face = "bold", size = 16),
axis.text.y = ggplot2::element_text(face = "bold", size = 10),
axis.text.x = ggplot2::element_text(face = "bold", size = 10, angle = 6)
)
)
out +
ggplot2::geom_text(
ggplot2::aes(
label = round(covariance, -3)
),
fontface = "bold", size = 6
) +
ggplot2::theme(legend.position = "none")
}
```
Assume the data `Z` is normally distributed.
```{r example_mean_unknown2}
dim(Z)
number_of_observations <- nrow(Z) # 7
p <- ncol(Z) # 4
S <- cov(Z)
round(S, -3)
g <- gips(S, number_of_observations)
plot_cosmetic_modifications(plot(g, type = "heatmap"))
```
Remember, we analyze the covariance matrix. We can see some nearly identical colors in the estimated covariance matrix. E.g., variances of columns 1 and 2 are very similar (`S[1,1]` $\approx$ `S[2,2]`), and variances of columns 3 and 4 are very similar (`S[3,3]` $\approx$ `S[4,4]`). What is more, Covariances are also similar (`S[1,3]` $\approx$ `S[1,4]` $\approx$ `S[2,3]` $\approx$ `S[2,4]`). Are those approximate equalities coincidental? Or do they reflect some underlying data properties? It is hard to decide purely by looking at the matrix.
`find_MAP()` will use the Bayesian model to quantify if the approximate equalities are coincidental. Let's see if it will find this relationship:
```{r example_mean_unknown3}
g_MAP <- find_MAP(g, optimizer = "brute_force")
g_MAP
```
The `find_MAP` found the relationship (1,2)(3,4). The variances [1,1] and [2,2] as well as [3,3] and [4,4] are so close to each other that it is reasonable to consider them equal. Similarly, the covariances [1,3] and [2,4]; just as covariances [2,3] and [1,4], also will be considered equal:
```{r example_mean_unknown4}
S_projected <- project_matrix(S, g_MAP)
round(S_projected)
plot_cosmetic_modifications(plot(g_MAP, type = "heatmap"))
```
This `S_projected` matrix can now be interpreted as a more stable covariance matrix estimator.
We can also interpret the output as suggesting that there is no change in covariance for treatment with Aspirin.
### Example 2 - modeling
First, construct data for the example:
```{r example_mean_known1}
# Prepare model, multivariate normal distribution
p <- 6
n <- 4
mu <- numeric(p)
sigma_matrix <- matrix(
data = c(
1.1, 0.8, 0.6, 0.4, 0.6, 0.8,
0.8, 1.1, 0.8, 0.6, 0.4, 0.6,
0.6, 0.8, 1.1, 0.8, 0.6, 0.4,
0.4, 0.6, 0.8, 1.1, 0.8, 0.6,
0.6, 0.4, 0.6, 0.8, 1.1, 0.8,
0.8, 0.6, 0.4, 0.6, 0.8, 1.1
),
nrow = p, byrow = TRUE
) # sigma_matrix is a matrix invariant under permutation (1,2,3,4,5,6)
# Generate example data from a model:
Z <- withr::with_seed(2022,
code = MASS::mvrnorm(n, mu = mu, Sigma = sigma_matrix)
)
# End of prepare model
```
```{r example_mean_known1_1, echo=FALSE}
plot(gips(sigma_matrix, 1), type = "heatmap") +
ggplot2::labs(title = "This is the real covariance matrix\nWe want to estimate it based on n = 4 observations", x = "", y = "")
```
Suppose we do not know the actual covariance matrix $\Sigma$ and we want to estimate it. We cannot use the standard MLE because it does not exists ($4 < 6$, $n < p$).
We will assume it was generated from the normal distribution with the mean $0$.
```{r example_mean_known2}
library(gips)
dim(Z)
number_of_observations <- nrow(Z) # 4
p <- ncol(Z) # 6
# Calculate the covariance matrix from the data:
S <- (t(Z) %*% Z) / number_of_observations
```
Make the gips object out of data:
```{r example_mean_known3}
g <- gips(S, number_of_observations, was_mean_estimated = FALSE)
```
We can see the standard estimator of the covariance matrix:
$$\hat{\Sigma} = (1/n) \cdot \Sigma_{i=1}^n \Big( Z^{(i)}\cdot\big({Z^{(i)}}^\top\big) \Big)$$
It is not MLE (again, because MLE does not exist for $n < p$):
```{r example_mean_known3_1}
plot(g, type = "heatmap") + ggplot2::ggtitle("Covariance estimated in standard way")
```
Let's find the Maximum A Posteriori Estimator for the permutation. Space is small ($6! = 720$), so it is reasonable to browse the whole of it:
```{r example_mean_known4}
g_map <- find_MAP(g, optimizer = "brute_force")
g_map
```
We see that the found permutation is over a hundred times more likely than making no additional assumption. That means the additional assumptions are justified.
```{r example_mean_known5}
summary(g_map)$n0
summary(g_map)$n0 <= number_of_observations # 1 <= 4
```
What is more, we see the number of observations ($4$) is bigger or equal to $n_0 = 1$, so we can estimate the covariance matrix with the Maximum Likelihood estimator:
```{r example_mean_known6}
S_projected <- project_matrix(S, g_map)
S_projected
# Plot the found matrix:
plot(g_map, type = "heatmap") + ggplot2::ggtitle("Covariance estimated with `gips`")
```
We see `gips` found the data's structure, and we could estimate the covariance matrix with huge accuracy only with a small amount of data and additional reasonable assumptions.
Note that the rank of the `S` matrix was 4, while the rank of the `S_projected` matrix was 6 (full rank).
# Further reading
For more examples and introduction, see `vignette("gips", package="gips")` or its [pkgdown page](https://przechoj.github.io/gips/articles/gips.html).
For an in-depth analysis of the package performance, capabilities, and comparison with similar packages, see the article "Learning permutation symmetries with gips in R" by `gips` developers Adam Chojecki, Paweł Morgen, and Bartosz Kołodziejek, [Journal of Statistical Software](<doi:10.18637/jss.v112.i07>).
# Acknowledgment
Project was funded by Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) programme.
The development was carried out with the support of the Laboratory of Bioinformatics and Computational Genomics and the High Performance Computing Center of the Faculty of Mathematics and Information Science Warsaw University of Technology.
```{r options_back, include = FALSE}
options(old_options) # back to the original options
```