Skip to content

Commit 7fa1193

Browse files
committed
Add design docs written while implementating generational compaction
1 parent 3b942eb commit 7fa1193

File tree

2 files changed

+406
-0
lines changed

2 files changed

+406
-0
lines changed
Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
# Crash safety
2+
3+
In 3.4 the process for ending compaction in `finish_compaction` goes like this:
4+
5+
0. state during compaction:
6+
7+
db.couch
8+
db.couch.compact.data
9+
db.couch.compact.meta
10+
11+
1. rename the data file:
12+
13+
db.couch
14+
db.couch.compact
15+
db.couch.compact.meta
16+
17+
2. delete the original DB file:
18+
19+
db.couch.compact
20+
db.couch.compact.meta
21+
22+
3. rename the .compact file:
23+
24+
db.couch
25+
db.couch.compact.meta
26+
27+
4. delete the meta file:
28+
29+
db.couch
30+
31+
Crash safety: if a crash occurs after step 2, then on restart the
32+
`db.couch.compact` file is taken to be the canonical file, renamed to
33+
`db.couch`, and we continue as normal.
34+
35+
Question: why can't we get the same behaviour by directly renaming
36+
`db.couch.compact.data` to `db.couch`, producing the same state as after step 3?
37+
Rename in the same directory is atomic and we should have stopped writing to the
38+
original `db.couch` file in order to perform this switch-over.
39+
40+
Answer: this might not be true on Windows, you cannot remove a file that is in
41+
use, we have to close `db.couch`, remove it, then rename the compacted file into
42+
its place. The rename of `.compact.data` to `.compact` signals a state where
43+
compaction completed and we were in the middle of swapping the files over, so
44+
nothing new has been written to `db.couch` and we can resume by using
45+
`db.couch.compact`. This is not necessarily the case if `db.couch.compact.data`
46+
still exists.
47+
48+
After `open_db_file` returns, the code in `init` that reads/creates the DB
49+
header also deletes all the compaction files (`.compact`, `.compact.data`,
50+
`.compact.meta`) if the header was newly created, rather than read from existing
51+
file data.
52+
53+
For generational compaction, there are several cases to consider.
54+
55+
56+
## A. Compacting gen 0
57+
58+
In this case we're appending to `db.1.couch`; references to new data written
59+
there ends up in `db.couch.compact.data`, along with references to _existing_
60+
data that was already in `db.{1,2,...}.couch`. The following files exist during
61+
compaction:
62+
63+
db.couch
64+
db.1.couch
65+
db.couch.compact.data
66+
db.couch.compact.meta
67+
68+
On completion, none of the generational files will be removed. Therefore all
69+
pointers in `db.couch` and `db.couch.compact.data` remain valid and we are free
70+
to use either file as our canonical DB file. The original cleanup procedure can
71+
be used without modification.
72+
73+
74+
## B. Compacting gen 1, 2, ...
75+
76+
One generation up from 0, we have data being moved from gen 1 to 2, or in
77+
general from G to G+1. Files existing during compaction are:
78+
79+
db.couch
80+
db.1.couch
81+
db.2.couch
82+
db.couch.compact.data
83+
db.couch.compact.meta
84+
85+
When compacting gen 1, if `db.couch` points to gen 1 then
86+
`db.couch.compact.data` will refer to gen 2. Also, data from `db.couch` has to
87+
be moved to `db.couch.compact.data` in order to avoid being lost on completion.
88+
All pointers to gen 2 and above remain unmodified.
89+
90+
On completion, `db.1.couch` will be removed on the grounds that all its data has
91+
been moved to `db.2.couch` and nothing new has been written to `db.1.couch`.
92+
This is because the compactor is the only thing that writes to `db.G.couch` and
93+
only one compaction per shard runs at a time.
94+
95+
However, it is only safe to remove `db.1.couch` when nothing is referring to it
96+
any more, i.e. after `db.couch.compact.data` is renamed to `db.couch.compact`.
97+
Once this has happened, any future DB access will either use `db.couch.compact`,
98+
or the contents of it after moving to `db.couch`, not the original `db.couch`,
99+
and so the old pointers into `db.1.couch` have expired.
100+
101+
At any point before `db.couch.compact.data` is renamed, the data in `db.1.couch`
102+
is still being referenced, and so it cannot be removed.
103+
104+
Therefore we can extend the cleanup process to:
105+
106+
1. Rename `db.couch.compact.data` to `db.couch.compact`
107+
2. Delete `db.1.couch`
108+
3. Delete `db.couch`
109+
4. Rename `db.couch.compact` to `db.couch`
110+
5. Delete `db.couch.compact.meta`
111+
112+
The reason for putting the deletion of `db.1.couch` as early as possible is to
113+
reduce the set of crash scenarios where this file remains in place. If
114+
`db.1.couch` remains after a crash, this is not _unsafe_ (i.e. it does not cause
115+
a consistency problem or data loss) but it does leave a pile of unreferenced
116+
data that needs to be cleaned up. The simplest way to achieve this would be to
117+
re-run compaction of gen 1, which could be triggered by noticing the file has an
118+
active size of 0. Possibly Smoosh could prioritise the generation with the
119+
largest proportion of garbage when deciding what to compact next.
120+
121+
122+
## C. Compacting the last generation
123+
124+
Say the DB has a maximum generation of 2. This means that normally, the existing
125+
files are:
126+
127+
db.couch
128+
db.1.couch
129+
db.2.couch
130+
131+
During compaction of gen 2, a temporary additional generation is created along
132+
with the usual compaction files:
133+
134+
db.couch
135+
db.1.couch
136+
db.2.couch
137+
db.2.couch.compact.maxgen
138+
db.couch.compact.data
139+
db.couch.compact.meta
140+
141+
Live data is copied from `db.2.couch` to `db.2.couch.compact.maxgen`, but the
142+
pointers stored in `db.couch.compact.data` refer to gen 2, with the intention
143+
that `db.2.couch.compact.maxgen` will eventually be renamed to `db.2.couch`
144+
rather than letting the generation number grow indefinitely.
145+
146+
This requires a further change the cleanup process:
147+
148+
1. Rename `db.couch.compact.data` to `db.couch.compact`
149+
2. Delete `db.2.couch`
150+
3. Rename `db.2.couch.compact.maxgen` to `db.2.couch`
151+
4. Delete `db.couch`
152+
5. Rename `db.couch.compact` to `db.couch`
153+
6. Delete `db.couch.compact.meta`
154+
155+
A crash after steps 1, 2, or 3 produces one of these states:
156+
157+
Step 1 Step 2 Step 3
158+
------------------------- ------------------------- -------------------------
159+
db.couch db.couch db.couch
160+
db.2.couch (old) db.2.couch (new)
161+
db.2.couch.compact.maxgen db.2.couch.compact.maxgen
162+
db.couch.compact db.couch.compact db.couch.compact
163+
164+
These states have an ambiguity to them; all of them will cause `db.couch` to be
165+
used when the DB is re-opened, but it's not clear whether the data in
166+
`db.2.couch` is valid for that file or not. It creates a state where you need to
167+
determine which data to preserved based on the presence of all the other files,
168+
which is complicated. In state 1 you can continue using the old `db.2.couch`,
169+
but in the other states you need to decide to open `db.couch.compact` instead
170+
and possibly clean up the `db.2.couch.compact.maxgen` file. This suggests that
171+
removing `db.couch` earlier in the process might be good.
172+
173+
1. Rename `db.couch.compact.data` to `db.couch.compact`
174+
2. Delete `db.couch`
175+
3. Delete `db.2.couch`
176+
4. Rename `db.2.couch.compact.maxgen` to `db.2.couch`
177+
5. Rename `db.couch.compact` to `db.couch`
178+
6. Delete `db.couch.compact.meta`
179+
180+
Having done that, we need to extend the recovery path where we fail to find
181+
`db.couch` and check for `db.couch.compact`. The possible crash states of the
182+
above process are:
183+
184+
Step 1 Step 2 Step 3 Step 4
185+
------------------------- ------------------------- ------------------------- -------------------------
186+
db.couch
187+
db.2.couch (old) db.2.couch (old) db.2.couch (new)
188+
db.2.couch.compact.maxgen db.2.couch.compact.maxgen db.2.couch.compact.maxgen
189+
db.couch.compact db.couch.compact db.couch.compact db.couch.compact
190+
191+
In the first state, we would open `db.couch` on restart, and this refers to data
192+
in the _old_ `db.2.couch`. We can continue to use this and either leave
193+
`db.2.couch.compact.maxgen` in place for some future compaction of gen 2, or we
194+
can delete it as we're not using any data in it.
195+
196+
In all the other states we will use `db.couch.compact` and therefore need to
197+
complete the process of moving `db.2.couch.compact.maxgen` to `db.2.couch`
198+
before using it. If `db.2.couch` exists, we remove it, and then we rename
199+
`db.2.couch.compact.maxgen` to `db.2.couch`. This process is safe if we crash
200+
after removing the old `db.2.couch`.
201+
202+
We could also try to resolve these two questions independently:
203+
204+
1. What to do if both `db.couch` and `db.couch.compact` exist
205+
2. What to do if `db.G.couch` where `G > max_generation` exists
206+
207+
But as we've seen, the cleanup operations for both these questions create states
208+
where the mutual ordering of their operations is important and it would be wise
209+
to minimise the set of possible such states. Therefore we suggest the following
210+
recovery routine:
211+
212+
1. Attempt to open `db.couch`. If this succeeds, remove any generation files
213+
above `max_generation`. Otherwise...
214+
215+
2. If `db.M+1.couch` where `M = max_generation` exists, then remove `db.M.couch`
216+
then rename `db.M+1.couch` to `db.M.couch`
217+
218+
3. Rename `db.couch.compact` to `db.couch`
219+
220+
4. Open `db.couch`
221+
222+
This process works correctly if the generation files are cleaned up _before_ the
223+
rename of `db.couch.compact` to `db.couch` in `finish_compaction`. Otherwise the
224+
`.compact` file that indicates the partial completion of this process may not
225+
exist following a crash and this makes it harder to tell which generation files
226+
are valid on recovery.
227+
228+
This also highlights that letting users modify `max_generation` for a DB is not
229+
safe while compaction is happening, because it may lead to confusion during
230+
recovery that could cause data loss (i.e. mistaken deletion of
231+
`db.2.couch.compact.maxgen`).
232+
233+
If we allow post-creation changes to `max_generation`, then:
234+
235+
- Increasing it is "free"; all existing data remains valid but it simply becomes
236+
_possible_ for future compactions to create higher generations.
237+
238+
- Decreasing it requires a "reversed" compaction to move data from higher
239+
generations to lower ones followed by deleting the emptied generation(s). And
240+
so this change should properly be thought of as requesting a compaction with
241+
special properties.

0 commit comments

Comments
 (0)