Skip to content

Commit 9a565ee

Browse files
committed
update writing on version control vignette
1 parent 8a2b5c7 commit 9a565ee

File tree

1 file changed

+62
-17
lines changed

1 file changed

+62
-17
lines changed

vignettes/version_control.Rmd

Lines changed: 62 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,12 @@ options(width = 83)
1919

2020
## Introduction
2121

22-
This vignette focuses on what `git2rdata` does to make storing dataframes under version control more efficient and convenient. All details on the actual file format are described in `vignette("plain_text", package = "git2rdata")`. Hence we will not discuss the `optimize` and `na` arguments to the `write_vc()` function.
22+
This vignette focuses on what `git2rdata` does to make storing dataframes under version control more efficient and convenient.
23+
`vignette("plain_text", package = "git2rdata")` describes all details on the actual file format.
24+
Hence we will not discuss the `optimize` and `na` arguments to the `write_vc()` function.
2325

24-
We will not illustrate the efficiency of `write_vc()` and `read_vc()` since that is covered in `vignette("efficiency", package = "git2rdata")`.
26+
We will not illustrate the efficiency of `write_vc()` and `read_vc()`.
27+
`vignette("efficiency", package = "git2rdata")` covers those topics.
2528

2629
## Setup
2730

@@ -52,13 +55,22 @@ str(x)
5255

5356
## Assumptions
5457

55-
A critical assumption made by `git2rdata` is that all information is contained within the dataframe itself. Each row is an observation, each column is a variable and only the variables are named. This implies that two observations switching place does not alter the information content. Nor does switching two variables.
58+
A critical assumption made by `git2rdata` is that the dataframe itself contains all information.
59+
Each row is an observation, each column is a variable.
60+
The dataframe has `colnames` but no `rownames`.
61+
This implies that two observations switching place does not alter the information content.
62+
Nor does switching two variables.
5663

57-
Version control systems like [git](https://git-scm.com/), [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/) focus on accurately keeping track of _any_ change in the files. Two observations switching place in a plain text file _is_ a change, although the information content^[_sensu_ `git2rdata`] doesn't change. Therefore `git2rdata` helps the user to prepare the plain text files in such a way that any change in the version history is an actual change in the information content.
64+
Version control systems like [git](https://git-scm.com/), [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/) focus on accurately keeping track of _any_ change in the files.
65+
Two observations switching place in a plain text file _is_ a change, although the information content^[_sensu_ `git2rdata`] doesn't change.
66+
`git2rdata` helps the user to prepare the plain text files in such a way that any change in the version history is an actual change in the information content.
5867

5968
## Sorting Observations
6069

61-
Version control systems often track changes in plain text files based on row based differences. In layman's terms they only record which lines in a file are removed and which lines are inserted at what location. Changing an existing line implies removing the old version and inserting the new one. This is illustrated in the minimal example below.
70+
Version control systems often track changes in plain text files based on row based differences.
71+
In layman's terms they record lines removed from and inserted in the file at what location.
72+
Changing an existing line implies removing the old version and inserting the new one.
73+
The minimal example below illustrates this.
6274

6375
Original version
6476

@@ -69,7 +81,9 @@ A,B
6981
3,12
7082
```
7183

72-
Altered version. The row containing `1, 10` was moved to the last line. The row containing `3,12` was changed to `3,0`
84+
Altered version.
85+
The row containing `1, 10` moves to the last line.
86+
The row containing `3,12` changed to `3,0`.
7387

7488
```
7589
A,B
@@ -108,17 +122,26 @@ A,B
108122
+3,0
109123
```
110124

111-
This is where the `sorting` argument comes into play. If this argument is not provided when a file is written for the first time, it will yield a warning about the lack of sorting. The observations will be written in their current order. New versions of the file will not apply any sorting either, leaving this burden to the user. This is illustrated by the changed hash for the data file in the example below, whereas the metadata is not changed (no change in hash).
125+
This is where the `sorting` argument comes into play.
126+
If this argument is not provided when writing a file for the first time, it will yield a warning about the lack of sorting.
127+
`write_vc()` then writes the observations in their current order.
128+
New versions of the file will not apply any sorting either, leaving this burden to the user.
129+
The changed hash for the data file illustrates this in the example below.
130+
The metadata hash remains the same.
112131

113132
```{r row_order}
114133
library(git2rdata)
115134
write_vc(x, file = "row_order", root = root)
116135
write_vc(x[sample(nrow(x)), ], file = "row_order", root = root)
117136
```
118137

119-
`sorting` should contain a vector of variable names. The observations are automatically sorted along these variables prior to writing. However, we now get an error because the set of sorting variables has changed. The set of sorting variables is stored in the metadata. Changing the sorting can potentially lead to large diffs, which `git2rdata` tries to avoid as much as possible.
138+
`sorting` should contain a vector of variable names.
139+
The observations are automatically sorted along these variables.
140+
Now we get an error because the set of sorting variables has changed.
141+
The metadata stores the set of sorting variables.
142+
Changing the sorting can potentially lead to large diffs, which `git2rdata` tries to avoid as much as possible.
120143

121-
From this moment on we will store the output of `write_vc()` in an object to minimize the output.
144+
From this moment on we will store the output of `write_vc()` in an object reduce output.
122145

123146
```{r apply_sorting, error = TRUE}
124147
fn <- write_vc(x, "row_order", root, sorting = "y")
@@ -131,7 +154,10 @@ fn <- write_vc(x, "row_order", root, sorting = "y", strict = FALSE)
131154
fn <- write_vc(x, "row_order", root, sorting = c("y", "x"), strict = FALSE)
132155
```
133156

134-
Once the sorting is defined we may omit the `sorting` argument when writing new versions. The sorting as defined in the existing metadata will be used to sort the observations. A check for potential ties will be performed and results in a warning when ties are found.
157+
Once we have defined the sorting, we may omit the `sorting` argument when writing new versions.
158+
`write_vc` uses the sorting as defined in the existing metadata.
159+
It checks for potential ties.
160+
Ties results in a warning.
135161

136162
```{r update_sorted}
137163
print_file <- function(file, root, n = -1) {
@@ -156,7 +182,10 @@ B,A
156182
13,3
157183
```
158184

159-
The resulting diff is maximal because every single row was updated. Yet none of the information was changed. Hence, it is crucial to maintain column order when storing dataframes as plain text files under version control. This is illustrated on a more realistic data set in the `vignette("efficiency", package = "git2rdata")` vignette.
185+
The resulting diff is maximal because every single row changed.
186+
Yet none of the information changed.
187+
Hence, maintaining column order is crucial when storing dataframes as plain text files under version control.
188+
The `vignette("efficiency", package = "git2rdata")` vignette illustrates this on a more realistic data set.
160189

161190
```diff
162191
-A,B
@@ -169,7 +198,11 @@ The resulting diff is maximal because every single row was updated. Yet none of
169198
+13,3
170199
```
171200

172-
`git2rdata` tackles this problem by storing the order of the columns in the metadata. The order is defined by the order in the dataframe when it is written for the first time. From that moment on, the same order will be reused. The example below writes the same data set twice. The second version contains exactly the same information but randomizes the order of the observations and the columns. The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file.
201+
When `write_vc()` writes a dataframe for the first time, it stores the original order of the columns in the metadata.
202+
From that moment on, `write_vc()` uses the order stored in the metadata.
203+
The example below writes the same data set twice.
204+
The second version contains identical information but randomizes the order of the observations and the columns.
205+
The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file.
173206

174207
```{r variable_order}
175208
write_vc(x, "column_order", root, sorting = c("x", "abc"))
@@ -180,7 +213,8 @@ print_file("column_order.tsv", root, n = 5)
180213

181214
## Handling Factors Optimized
182215

183-
`vignette("plain_text", package = "git2rdata")` and `vignette("efficiency", package = "git2rdata")` illustrate how a factor can be stored more efficiently when storing their index in the data file and the indices and labels in the metadata. We take this even a bit further: what happens if new data arrives and an extra factor level is required?
216+
`vignette("plain_text", package = "git2rdata")` and `vignette("efficiency", package = "git2rdata")` illustrate how we can store a factor more efficiently when storing their index in the data file and the indices and labels in the metadata.
217+
We take this even a bit further: what happens if new data arrives and we need an extra factor level?
184218

185219
```{r factor}
186220
old <- data.frame(color = c("red", "blue"))
@@ -204,7 +238,9 @@ fn <- write_vc(updated, "factor", root, strict = FALSE)
204238
print_file("factor.yml", root)
205239
```
206240

207-
The next example removes the `"blue"` level and switches the order of the remaining levels. Notice that again the existing indices are retained. The order of the labels and indices reflects their new ordering.
241+
The next example removes the `"blue"` level and switches the order of the remaining levels.
242+
Notice that the medatadata retains the existing indices.
243+
The order of the labels and indices reflects their new ordering.
208244

209245
```{r factor_deleted}
210246
deleted <- data.frame(
@@ -224,7 +260,9 @@ print_file("factor.yml", root)
224260

225261
## Relabelling a Factor
226262

227-
The example below will store a dataframe, relabel the factor levels and store it again using `write_vc()`. Notice that both the labels and the indices are updated. Hence creating a large diff, where just updating the labels would be sufficient.
263+
The example below will store a dataframe, relabel the factor levels and store it again using `write_vc()`.
264+
Notice the update of both the labels and the indices.
265+
Hence creating a large diff, where updating the labels would do.
228266

229267
```{r}
230268
write_vc(old, "write_vc", root, sorting = "color")
@@ -236,7 +274,12 @@ write_vc(relabeled, "write_vc", root, strict = FALSE)
236274
print_file("write_vc.yml", root)
237275
```
238276

239-
Therefore we created `relabel()`, which changes only the labels in the metadata. It takes three arguments: the name of the data file, the root and the change. `change` accepts two formats, a list or a dataframe. The name of the list must match with the variable name of a factor in the data. Each element of the list must be a named vector, the name being the existing label and the value the new label. The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label.
277+
We created `relabel()`, which changes the labels in the meta data while maintaining their indices.
278+
It takes three arguments: the name of the data file, the root and the change.
279+
`change` accepts two formats, a list or a dataframe.
280+
The name of the list must match with the variable name of a factor in the data.
281+
Each element of the list must be a named vector, the name being the existing label and the value the new label.
282+
The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label.
240283

241284
```{r}
242285
write_vc(old, "relabel", root, sorting = "color")
@@ -247,4 +290,6 @@ relabel("relabel", root,
247290
print_file("relabel.yml", root)
248291
```
249292

250-
A _caveat_: `relabel()` only makes sense when the data file uses optimized storage. The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff. Therefore `relabel()` will only handle the optimized storage.
293+
A _caveat_: `relabel()` does not make sense when the data file uses verbose storage.
294+
The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff.
295+
Hence, `relabel()` requires the optimized storage.

0 commit comments

Comments
 (0)