You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/version_control.Rmd
+62-17Lines changed: 62 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -19,9 +19,12 @@ options(width = 83)
19
19
20
20
## Introduction
21
21
22
-
This vignette focuses on what `git2rdata` does to make storing dataframes under version control more efficient and convenient. All details on the actual file format are described in `vignette("plain_text", package = "git2rdata")`. Hence we will not discuss the `optimize` and `na` arguments to the `write_vc()` function.
22
+
This vignette focuses on what `git2rdata` does to make storing dataframes under version control more efficient and convenient.
23
+
`vignette("plain_text", package = "git2rdata")` describes all details on the actual file format.
24
+
Hence we will not discuss the `optimize` and `na` arguments to the `write_vc()` function.
23
25
24
-
We will not illustrate the efficiency of `write_vc()` and `read_vc()` since that is covered in `vignette("efficiency", package = "git2rdata")`.
26
+
We will not illustrate the efficiency of `write_vc()` and `read_vc()`.
27
+
`vignette("efficiency", package = "git2rdata")` covers those topics.
25
28
26
29
## Setup
27
30
@@ -52,13 +55,22 @@ str(x)
52
55
53
56
## Assumptions
54
57
55
-
A critical assumption made by `git2rdata` is that all information is contained within the dataframe itself. Each row is an observation, each column is a variable and only the variables are named. This implies that two observations switching place does not alter the information content. Nor does switching two variables.
58
+
A critical assumption made by `git2rdata` is that the dataframe itself contains all information.
59
+
Each row is an observation, each column is a variable.
60
+
The dataframe has `colnames` but no `rownames`.
61
+
This implies that two observations switching place does not alter the information content.
62
+
Nor does switching two variables.
56
63
57
-
Version control systems like [git](https://git-scm.com/), [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/) focus on accurately keeping track of _any_ change in the files. Two observations switching place in a plain text file _is_ a change, although the information content^[_sensu_`git2rdata`] doesn't change. Therefore `git2rdata` helps the user to prepare the plain text files in such a way that any change in the version history is an actual change in the information content.
64
+
Version control systems like [git](https://git-scm.com/), [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/) focus on accurately keeping track of _any_ change in the files.
65
+
Two observations switching place in a plain text file _is_ a change, although the information content^[_sensu_`git2rdata`] doesn't change.
66
+
`git2rdata` helps the user to prepare the plain text files in such a way that any change in the version history is an actual change in the information content.
58
67
59
68
## Sorting Observations
60
69
61
-
Version control systems often track changes in plain text files based on row based differences. In layman's terms they only record which lines in a file are removed and which lines are inserted at what location. Changing an existing line implies removing the old version and inserting the new one. This is illustrated in the minimal example below.
70
+
Version control systems often track changes in plain text files based on row based differences.
71
+
In layman's terms they record lines removed from and inserted in the file at what location.
72
+
Changing an existing line implies removing the old version and inserting the new one.
73
+
The minimal example below illustrates this.
62
74
63
75
Original version
64
76
@@ -69,7 +81,9 @@ A,B
69
81
3,12
70
82
```
71
83
72
-
Altered version. The row containing `1, 10` was moved to the last line. The row containing `3,12` was changed to `3,0`
84
+
Altered version.
85
+
The row containing `1, 10` moves to the last line.
86
+
The row containing `3,12` changed to `3,0`.
73
87
74
88
```
75
89
A,B
@@ -108,17 +122,26 @@ A,B
108
122
+3,0
109
123
```
110
124
111
-
This is where the `sorting` argument comes into play. If this argument is not provided when a file is written for the first time, it will yield a warning about the lack of sorting. The observations will be written in their current order. New versions of the file will not apply any sorting either, leaving this burden to the user. This is illustrated by the changed hash for the data file in the example below, whereas the metadata is not changed (no change in hash).
125
+
This is where the `sorting` argument comes into play.
126
+
If this argument is not provided when writing a file for the first time, it will yield a warning about the lack of sorting.
127
+
`write_vc()` then writes the observations in their current order.
128
+
New versions of the file will not apply any sorting either, leaving this burden to the user.
129
+
The changed hash for the data file illustrates this in the example below.
`sorting` should contain a vector of variable names. The observations are automatically sorted along these variables prior to writing. However, we now get an error because the set of sorting variables has changed. The set of sorting variables is stored in the metadata. Changing the sorting can potentially lead to large diffs, which `git2rdata` tries to avoid as much as possible.
138
+
`sorting` should contain a vector of variable names.
139
+
The observations are automatically sorted along these variables.
140
+
Now we get an error because the set of sorting variables has changed.
141
+
The metadata stores the set of sorting variables.
142
+
Changing the sorting can potentially lead to large diffs, which `git2rdata` tries to avoid as much as possible.
120
143
121
-
From this moment on we will store the output of `write_vc()` in an object to minimize the output.
144
+
From this moment on we will store the output of `write_vc()` in an object reduce output.
Once the sorting is defined we may omit the `sorting` argument when writing new versions. The sorting as defined in the existing metadata will be used to sort the observations. A check for potential ties will be performed and results in a warning when ties are found.
157
+
Once we have defined the sorting, we may omit the `sorting` argument when writing new versions.
158
+
`write_vc` uses the sorting as defined in the existing metadata.
159
+
It checks for potential ties.
160
+
Ties results in a warning.
135
161
136
162
```{r update_sorted}
137
163
print_file <- function(file, root, n = -1) {
@@ -156,7 +182,10 @@ B,A
156
182
13,3
157
183
```
158
184
159
-
The resulting diff is maximal because every single row was updated. Yet none of the information was changed. Hence, it is crucial to maintain column order when storing dataframes as plain text files under version control. This is illustrated on a more realistic data set in the `vignette("efficiency", package = "git2rdata")` vignette.
185
+
The resulting diff is maximal because every single row changed.
186
+
Yet none of the information changed.
187
+
Hence, maintaining column order is crucial when storing dataframes as plain text files under version control.
188
+
The `vignette("efficiency", package = "git2rdata")` vignette illustrates this on a more realistic data set.
160
189
161
190
```diff
162
191
-A,B
@@ -169,7 +198,11 @@ The resulting diff is maximal because every single row was updated. Yet none of
169
198
+13,3
170
199
```
171
200
172
-
`git2rdata` tackles this problem by storing the order of the columns in the metadata. The order is defined by the order in the dataframe when it is written for the first time. From that moment on, the same order will be reused. The example below writes the same data set twice. The second version contains exactly the same information but randomizes the order of the observations and the columns. The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file.
201
+
When `write_vc()` writes a dataframe for the first time, it stores the original order of the columns in the metadata.
202
+
From that moment on, `write_vc()` uses the order stored in the metadata.
203
+
The example below writes the same data set twice.
204
+
The second version contains identical information but randomizes the order of the observations and the columns.
205
+
The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file.
@@ -180,7 +213,8 @@ print_file("column_order.tsv", root, n = 5)
180
213
181
214
## Handling Factors Optimized
182
215
183
-
`vignette("plain_text", package = "git2rdata")` and `vignette("efficiency", package = "git2rdata")` illustrate how a factor can be stored more efficiently when storing their index in the data file and the indices and labels in the metadata. We take this even a bit further: what happens if new data arrives and an extra factor level is required?
216
+
`vignette("plain_text", package = "git2rdata")` and `vignette("efficiency", package = "git2rdata")` illustrate how we can store a factor more efficiently when storing their index in the data file and the indices and labels in the metadata.
217
+
We take this even a bit further: what happens if new data arrives and we need an extra factor level?
The next example removes the `"blue"` level and switches the order of the remaining levels. Notice that again the existing indices are retained. The order of the labels and indices reflects their new ordering.
241
+
The next example removes the `"blue"` level and switches the order of the remaining levels.
242
+
Notice that the medatadata retains the existing indices.
243
+
The order of the labels and indices reflects their new ordering.
The example below will store a dataframe, relabel the factor levels and store it again using `write_vc()`. Notice that both the labels and the indices are updated. Hence creating a large diff, where just updating the labels would be sufficient.
263
+
The example below will store a dataframe, relabel the factor levels and store it again using `write_vc()`.
264
+
Notice the update of both the labels and the indices.
265
+
Hence creating a large diff, where updating the labels would do.
Therefore we created `relabel()`, which changes only the labels in the metadata. It takes three arguments: the name of the data file, the root and the change. `change` accepts two formats, a list or a dataframe. The name of the list must match with the variable name of a factor in the data. Each element of the list must be a named vector, the name being the existing label and the value the new label. The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label.
277
+
We created `relabel()`, which changes the labels in the meta data while maintaining their indices.
278
+
It takes three arguments: the name of the data file, the root and the change.
279
+
`change` accepts two formats, a list or a dataframe.
280
+
The name of the list must match with the variable name of a factor in the data.
281
+
Each element of the list must be a named vector, the name being the existing label and the value the new label.
282
+
The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label.
240
283
241
284
```{r}
242
285
write_vc(old, "relabel", root, sorting = "color")
@@ -247,4 +290,6 @@ relabel("relabel", root,
247
290
print_file("relabel.yml", root)
248
291
```
249
292
250
-
A _caveat_: `relabel()` only makes sense when the data file uses optimized storage. The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff. Therefore `relabel()` will only handle the optimized storage.
293
+
A _caveat_: `relabel()` does not make sense when the data file uses verbose storage.
294
+
The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff.
295
+
Hence, `relabel()` requires the optimized storage.
0 commit comments