Skip to content

Commit 42e5905

Browse files
authored
Merge pull request #16 from inbo/hansvancalster-patch-3
Update version_control.Rmd
2 parents 271cad9 + 70a6347 commit 42e5905

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

vignettes/version_control.Rmd

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ Version control systems like [git](https://git-scm.com/), [subversion](https://s
5858

5959
## Sorting observations
6060

61-
Version control system often track changes on plain text files based on row based differences. In layman's terms it only records which lines in a file are removed and which lines are inserted at what location. Changing an existing line implies removing the old version and inserting the new one. This is illustrated is the minimal example below.
61+
Version control systems often track changes on plain text files based on row based differences. In layman's terms it only records which lines in a file are removed and which lines are inserted at what location. Changing an existing line implies removing the old version and inserting the new one. This is illustrated is the minimal example below.
6262

6363
Original version
6464

@@ -89,7 +89,7 @@ A,B
8989
+1,10
9090
```
9191

92-
Ensuring that the observations are always sorted in the same way thus helps minimizing the diff. The sorted version of the same altered version looks the the example below.
92+
Ensuring that the observations are always sorted in the same way thus helps minimizing the diff. The sorted version of the same altered version looks like the example below.
9393

9494
```
9595
A,B
@@ -98,7 +98,7 @@ A,B
9898
3,0
9999
```
100100

101-
Diff between original and the sorted alternate version. Notice that all changes revert to actual changes in the information content. Another benefit is changes are easily spotted in the diff. A deletion without insertion on the next line is a removed observation. An insertion without preceding deletion is a new observation. A deletion followed by an insertion is an updated observation.
101+
Diff between original and the sorted alternate version. Notice that all changes revert to actual changes in the information content. Another benefit is that changes are easily spotted in the diff. A deletion without insertion on the next line is a removed observation. An insertion without preceding deletion is a new observation. A deletion followed by an insertion is an updated observation.
102102

103103
```diff
104104
A,B
@@ -108,23 +108,23 @@ A,B
108108
+3,0
109109
```
110110

111-
This is where the `sorting` argument comes into play. If this argument is not provided when a file is written for the first time, it will yields a warning about the lack of sorting. The observations will be written in their current order. New versions of the file will not apply any sorting either. Leaving this burden to the user. This is illustrated by the changed hash for the data file in the example below, whereas the metadata is not changed (no change it hash).
111+
This is where the `sorting` argument comes into play. If this argument is not provided when a file is written for the first time, it will yield a warning about the lack of sorting. The observations will be written in their current order. New versions of the file will not apply any sorting either, leaving this burden to the user. This is illustrated by the changed hash for the data file in the example below, whereas the metadata is not changed (no change in hash).
112112

113113
```{r row_order}
114114
library(git2rdata)
115115
write_vc(x, file = "row_order", root = root)
116116
write_vc(x[sample(nrow(x)), ], file = "row_order", root = root)
117117
```
118118

119-
`sorting` should contain a vector of variable names. The observations are automatically sorted along these variables prior to writing. However, we now get an error because the set of sorting variables has changed. The set of sorting variables is stored in the metadata. Changing the sorting can potentially lead to large diffs, which `git2rdata` try to avoid as much as possible.
119+
`sorting` should contain a vector of variable names. The observations are automatically sorted along these variables prior to writing. However, we now get an error because the set of sorting variables has changed. The set of sorting variables is stored in the metadata. Changing the sorting can potentially lead to large diffs, which `git2rdata` tries to avoid as much as possible.
120120

121121
From this moment on we will store the output of `write_vc()` in an object to minimize the output.
122122

123123
```{r apply_sorting, error = TRUE}
124124
fn <- write_vc(x, "row_order", root, sorting = "y")
125125
```
126126

127-
Using `strict = FALSE` turns such errors into warnings and allows to update the file. Notice that we get a new warning: the variable we used for sorted results in ties, thus the order of the observations is not guaranteed to be stable. The solution is to use more or different variables. We'll need `strict = FALSE` again to override the change in sorting variables.
127+
Using `strict = FALSE` turns such errors into warnings and allows to update the file. Notice that we get a new warning: the variable we used for sorting resulted in ties, thus the order of the observations is not guaranteed to be stable. The solution is to use more or different variables. We'll need `strict = FALSE` again to override the change in sorting variables.
128128

129129
```{r update_sorting}
130130
fn <- write_vc(x, "row_order", root, sorting = "y", strict = FALSE)
@@ -169,7 +169,7 @@ The resulting diff is maximal because every single row was updated. Yet none of
169169
+13,3
170170
```
171171

172-
`git2rdata` tackles this problem by storing the order of the columns in the metadata. The order is defined by the order in the dataframe when it is written for the first time. From that moment on, the same order will be reused. The example below writes the same data set twice. The second version contains the exactly the same information but randomizes the order of the observations and the columns. The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file.
172+
`git2rdata` tackles this problem by storing the order of the columns in the metadata. The order is defined by the order in the dataframe when it is written for the first time. From that moment on, the same order will be reused. The example below writes the same data set twice. The second version contains exactly the same information but randomizes the order of the observations and the columns. The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file.
173173

174174
```{r variable_order}
175175
write_vc(x, "column_order", root, sorting = c("x", "abc"))
@@ -188,7 +188,7 @@ write_vc(old, "factor", root, sorting = "color")
188188
print_file("factor.yml", root)
189189
```
190190

191-
Let's add an observation with a new factor level. If we store the updated dataframe in a new file, we see that the indices are different. The factor level `"blue"` remains unchanged, but `"red"` becomes the third level and get index `3` instead of index `2`. This could lead to a large diff where as the potential semantics (and thus the information content) are not changed.
191+
Let's add an observation with a new factor level. If we store the updated dataframe in a new file, we see that the indices are different. The factor level `"blue"` remains unchanged, but `"red"` becomes the third level and get index `3` instead of index `2`. This could lead to a large diff whereas the potential semantics (and thus the information content) are not changed.
192192

193193
```{r factor2}
194194
updated <- data.frame(color = c("red", "green", "blue"))
@@ -204,7 +204,7 @@ fn <- write_vc(updated, "factor", root, strict = FALSE)
204204
print_file("factor.yml", root)
205205
```
206206

207-
The next example removes the `"blue"` level and switches the order of the remaining levels. Notice that again the existing index are retained. The order of the labels and indices reflects their new ordering.
207+
The next example removes the `"blue"` level and switches the order of the remaining levels. Notice that again the existing indices are retained. The order of the labels and indices reflects their new ordering.
208208

209209
```{r factor_deleted}
210210
deleted <- data.frame(
@@ -213,7 +213,7 @@ write_vc(deleted, "factor", root, sorting = "color", strict = FALSE)
213213
print_file("factor.yml", root)
214214
```
215215

216-
Changing an factor to an ordered factor or _vice versa_ will also keep existing level indices.
216+
Changing a factor to an ordered factor or _vice versa_ will also keep existing level indices.
217217

218218
```{r factor_ordered}
219219
ordered <- data.frame(
@@ -236,7 +236,7 @@ write_vc(relabeled, "write_vc", root, strict = FALSE)
236236
print_file("write_vc.yml", root)
237237
```
238238

239-
Therefore we created `relabel()`, which changes only the labels in the metadata. It takes three arguments: the name of the data file, the root and the change. `change` accepts two formats, a list or a dataframe. The named of the list must match with the variable name of a factor in the data. Each element of the list must be named vector, the name being the existing label and the value the new label. The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label.
239+
Therefore we created `relabel()`, which changes only the labels in the metadata. It takes three arguments: the name of the data file, the root and the change. `change` accepts two formats, a list or a dataframe. The name of the list must match with the variable name of a factor in the data. Each element of the list must be a named vector, the name being the existing label and the value the new label. The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label.
240240

241241
```{r}
242242
write_vc(old, "relabel", root, sorting = "color")
@@ -247,4 +247,4 @@ relabel("relabel", root,
247247
print_file("relabel.yml", root)
248248
```
249249

250-
A _caveat_: `relabel()` only makes sense with the data file uses optimized storage. The verbose mode stores the factor labels and not their indices, in which case relabels a label will always yield a large diff. Therefore `relabel()` will only handle the optimized storage.
250+
A _caveat_: `relabel()` only makes sense when the data file uses optimized storage. The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff. Therefore `relabel()` will only handle the optimized storage.

0 commit comments

Comments
 (0)