Skip to content

Commit 2ed454e

Browse files
committed
Tweak the split_by vignette
1 parent 3f44c55 commit 2ed454e

File tree

1 file changed

+9
-3
lines changed

1 file changed

+9
-3
lines changed

vignettes/split_by.Rmd

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -123,16 +123,22 @@ update_geom_defaults("smooth", list(colour = "#356196"))
123123
## Introduction
124124

125125
Sometimes, a large dataframe has one or more variables with a small number of unique combinations.
126-
E.g. a dataframe with factor variables.
126+
E.g. a dataframe with one or more factor variables.
127+
Storing the entire dataframe as a single text file requires storing lots of replicated data.
128+
Each row stores the information for every variable, even if a subset of these variables remains constant over a subset of the data.
127129

128130
In such a case we can use the `split_by` argument of `write_vc()`.
129131
This will store the large dataframe over a set of tab separated files.
130132
One file for every combination of the variables defined by `split_by`.
131-
Every partial data file holds one combination of `split_by`.
133+
Every partial data file holds the other variables for one combination of `split_by`.
132134
We remove the `split_by` variables from the partial data files, reducing their size.
133135
We add an `index.tsv` containing the combinations of the `split_by` variables and a unique hash for each combination.
134136
This hash becomes the base name of the partial data files.
135137

138+
Splitting the dataframe into smaller files makes them easier to handle in version control system.
139+
The overall size depends on the amount of replication in the dataframe.
140+
More on that in the next section.
141+
136142
## When to Split the Dataframe
137143

138144
Let's set the following variables:
@@ -151,7 +157,7 @@ Let's set the following variables:
151157

152158
Storing the dataframe with `write_vc()` without `split_by` requires $h_s + h_r + 1$ bytes for the header and $s + r + 1$ bytes for every observation.
153159
The total number of bytes is $T_0 = h_s + h_r + 1 + N (s + r + 1)$.
154-
The $+ 1$ originates from the tab character to separate the `split_by` variables from the remaining variables.
160+
Both $+ 1$ originate from the tab character to separate the `split_by` variables from the remaining variables.
155161

156162
Storing the dataframe with `write_vc()` with `split_by` requires an index file to store the combinations of the `split_by` variables.
157163
It will use $h_s$ bytes for the header and $N_s s$ for the data.

0 commit comments

Comments
 (0)