@@ -123,16 +123,22 @@ update_geom_defaults("smooth", list(colour = "#356196"))
123123## Introduction
124124
125125Sometimes, a large dataframe has one or more variables with a small number of unique combinations.
126- E.g. a dataframe with factor variables.
126+ E.g. a dataframe with one or more factor variables.
127+ Storing the entire dataframe as a single text file requires storing lots of replicated data.
128+ Each row stores the information for every variable, even if a subset of these variables remains constant over a subset of the data.
127129
128130In such a case we can use the ` split_by ` argument of ` write_vc() ` .
129131This will store the large dataframe over a set of tab separated files.
130132One file for every combination of the variables defined by ` split_by ` .
131- Every partial data file holds one combination of ` split_by ` .
133+ Every partial data file holds the other variables for one combination of ` split_by ` .
132134We remove the ` split_by ` variables from the partial data files, reducing their size.
133135We add an ` index.tsv ` containing the combinations of the ` split_by ` variables and a unique hash for each combination.
134136This hash becomes the base name of the partial data files.
135137
138+ Splitting the dataframe into smaller files makes them easier to handle in version control system.
139+ The overall size depends on the amount of replication in the dataframe.
140+ More on that in the next section.
141+
136142## When to Split the Dataframe
137143
138144Let's set the following variables:
@@ -151,7 +157,7 @@ Let's set the following variables:
151157
152158Storing the dataframe with ` write_vc() ` without ` split_by ` requires $h_s + h_r + 1$ bytes for the header and $s + r + 1$ bytes for every observation.
153159The total number of bytes is $T_0 = h_s + h_r + 1 + N (s + r + 1)$.
154- The $+ 1$ originates from the tab character to separate the ` split_by ` variables from the remaining variables.
160+ Both $+ 1$ originate from the tab character to separate the ` split_by ` variables from the remaining variables.
155161
156162Storing the dataframe with ` write_vc() ` with ` split_by ` requires an index file to store the combinations of the ` split_by ` variables.
157163It will use $h_s$ bytes for the header and $N_s s$ for the data.
0 commit comments