|
| 1 | +# Crash safety |
| 2 | + |
| 3 | +In 3.4 the process for ending compaction in `finish_compaction` goes like this: |
| 4 | + |
| 5 | + 0. state during compaction: |
| 6 | + |
| 7 | + db.couch |
| 8 | + db.couch.compact.data |
| 9 | + db.couch.compact.meta |
| 10 | + |
| 11 | + 1. rename the data file: |
| 12 | + |
| 13 | + db.couch |
| 14 | + db.couch.compact |
| 15 | + db.couch.compact.meta |
| 16 | + |
| 17 | + 2. delete the original DB file: |
| 18 | + |
| 19 | + db.couch.compact |
| 20 | + db.couch.compact.meta |
| 21 | + |
| 22 | + 3. rename the .compact file: |
| 23 | + |
| 24 | + db.couch |
| 25 | + db.couch.compact.meta |
| 26 | + |
| 27 | + 4. delete the meta file: |
| 28 | + |
| 29 | + db.couch |
| 30 | + |
| 31 | +Crash safety: if a crash occurs after step 2, then on restart the |
| 32 | +`db.couch.compact` file is taken to be the canonical file, renamed to |
| 33 | +`db.couch`, and we continue as normal. |
| 34 | + |
| 35 | +Question: why can't we get the same behaviour by directly renaming |
| 36 | +`db.couch.compact.data` to `db.couch`, producing the same state as after step 3? |
| 37 | +Rename in the same directory is atomic and we should have stopped writing to the |
| 38 | +original `db.couch` file in order to perform this switch-over. |
| 39 | + |
| 40 | +Answer: this might not be true on Windows, you cannot remove a file that is in |
| 41 | +use, we have to close `db.couch`, remove it, then rename the compacted file into |
| 42 | +its place. The rename of `.compact.data` to `.compact` signals a state where |
| 43 | +compaction completed and we were in the middle of swapping the files over, so |
| 44 | +nothing new has been written to `db.couch` and we can resume by using |
| 45 | +`db.couch.compact`. This is not necessarily the case if `db.couch.compact.data` |
| 46 | +still exists. |
| 47 | + |
| 48 | +After `open_db_file` returns, the code in `init` that reads/creates the DB |
| 49 | +header also deletes all the compaction files (`.compact`, `.compact.data`, |
| 50 | +`.compact.meta`) if the header was newly created, rather than read from existing |
| 51 | +file data. |
| 52 | + |
| 53 | +For generational compaction, there are several cases to consider. |
| 54 | + |
| 55 | + |
| 56 | +## A. Compacting gen 0 |
| 57 | + |
| 58 | +In this case we're appending to `db.1.couch`; references to new data written |
| 59 | +there ends up in `db.couch.compact.data`, along with references to _existing_ |
| 60 | +data that was already in `db.{1,2,...}.couch`. The following files exist during |
| 61 | +compaction: |
| 62 | + |
| 63 | + db.couch |
| 64 | + db.1.couch |
| 65 | + db.couch.compact.data |
| 66 | + db.couch.compact.meta |
| 67 | + |
| 68 | +On completion, none of the generational files will be removed. Therefore all |
| 69 | +pointers in `db.couch` and `db.couch.compact.data` remain valid and we are free |
| 70 | +to use either file as our canonical DB file. The original cleanup procedure can |
| 71 | +be used without modification. |
| 72 | + |
| 73 | + |
| 74 | +## B. Compacting gen 1, 2, ... |
| 75 | + |
| 76 | +One generation up from 0, we have data being moved from gen 1 to 2, or in |
| 77 | +general from G to G+1. Files existing during compaction are: |
| 78 | + |
| 79 | + db.couch |
| 80 | + db.1.couch |
| 81 | + db.2.couch |
| 82 | + db.couch.compact.data |
| 83 | + db.couch.compact.meta |
| 84 | + |
| 85 | +When compacting gen 1, if `db.couch` points to gen 1 then |
| 86 | +`db.couch.compact.data` will refer to gen 2. Also, data from `db.couch` has to |
| 87 | +be moved to `db.couch.compact.data` in order to avoid being lost on completion. |
| 88 | +All pointers to gen 2 and above remain unmodified. |
| 89 | + |
| 90 | +On completion, `db.1.couch` will be removed on the grounds that all its data has |
| 91 | +been moved to `db.2.couch` and nothing new has been written to `db.1.couch`. |
| 92 | +This is because the compactor is the only thing that writes to `db.G.couch` and |
| 93 | +only one compaction per shard runs at a time. |
| 94 | + |
| 95 | +However, it is only safe to remove `db.1.couch` when nothing is referring to it |
| 96 | +any more, i.e. after `db.couch.compact.data` is renamed to `db.couch.compact`. |
| 97 | +Once this has happened, any future DB access will either use `db.couch.compact`, |
| 98 | +or the contents of it after moving to `db.couch`, not the original `db.couch`, |
| 99 | +and so the old pointers into `db.1.couch` have expired. |
| 100 | + |
| 101 | +At any point before `db.couch.compact.data` is renamed, the data in `db.1.couch` |
| 102 | +is still being referenced, and so it cannot be removed. |
| 103 | + |
| 104 | +Therefore we can extend the cleanup process to: |
| 105 | + |
| 106 | +1. Rename `db.couch.compact.data` to `db.couch.compact` |
| 107 | +2. Delete `db.1.couch` |
| 108 | +3. Delete `db.couch` |
| 109 | +4. Rename `db.couch.compact` to `db.couch` |
| 110 | +5. Delete `db.couch.compact.meta` |
| 111 | + |
| 112 | +The reason for putting the deletion of `db.1.couch` as early as possible is to |
| 113 | +reduce the set of crash scenarios where this file remains in place. If |
| 114 | +`db.1.couch` remains after a crash, this is not _unsafe_ (i.e. it does not cause |
| 115 | +a consistency problem or data loss) but it does leave a pile of unreferenced |
| 116 | +data that needs to be cleaned up. The simplest way to achieve this would be to |
| 117 | +re-run compaction of gen 1, which could be triggered by noticing the file has an |
| 118 | +active size of 0. Possibly Smoosh could prioritise the generation with the |
| 119 | +largest proportion of garbage when deciding what to compact next. |
| 120 | + |
| 121 | + |
| 122 | +## C. Compacting the last generation |
| 123 | + |
| 124 | +Say the DB has a maximum generation of 2. This means that normally, the existing |
| 125 | +files are: |
| 126 | + |
| 127 | + db.couch |
| 128 | + db.1.couch |
| 129 | + db.2.couch |
| 130 | + |
| 131 | +During compaction of gen 2, a temporary additional generation is created along |
| 132 | +with the usual compaction files: |
| 133 | + |
| 134 | + db.couch |
| 135 | + db.1.couch |
| 136 | + db.2.couch |
| 137 | + db.2.couch.compact.maxgen |
| 138 | + db.couch.compact.data |
| 139 | + db.couch.compact.meta |
| 140 | + |
| 141 | +Live data is copied from `db.2.couch` to `db.2.couch.compact.maxgen`, but the |
| 142 | +pointers stored in `db.couch.compact.data` refer to gen 2, with the intention |
| 143 | +that `db.2.couch.compact.maxgen` will eventually be renamed to `db.2.couch` |
| 144 | +rather than letting the generation number grow indefinitely. |
| 145 | + |
| 146 | +This requires a further change the cleanup process: |
| 147 | + |
| 148 | +1. Rename `db.couch.compact.data` to `db.couch.compact` |
| 149 | +2. Delete `db.2.couch` |
| 150 | +3. Rename `db.2.couch.compact.maxgen` to `db.2.couch` |
| 151 | +4. Delete `db.couch` |
| 152 | +5. Rename `db.couch.compact` to `db.couch` |
| 153 | +6. Delete `db.couch.compact.meta` |
| 154 | + |
| 155 | +A crash after steps 1, 2, or 3 produces one of these states: |
| 156 | + |
| 157 | + Step 1 Step 2 Step 3 |
| 158 | + ------------------------- ------------------------- ------------------------- |
| 159 | + db.couch db.couch db.couch |
| 160 | + db.2.couch (old) db.2.couch (new) |
| 161 | + db.2.couch.compact.maxgen db.2.couch.compact.maxgen |
| 162 | + db.couch.compact db.couch.compact db.couch.compact |
| 163 | + |
| 164 | +These states have an ambiguity to them; all of them will cause `db.couch` to be |
| 165 | +used when the DB is re-opened, but it's not clear whether the data in |
| 166 | +`db.2.couch` is valid for that file or not. It creates a state where you need to |
| 167 | +determine which data to preserved based on the presence of all the other files, |
| 168 | +which is complicated. In state 1 you can continue using the old `db.2.couch`, |
| 169 | +but in the other states you need to decide to open `db.couch.compact` instead |
| 170 | +and possibly clean up the `db.2.couch.compact.maxgen` file. This suggests that |
| 171 | +removing `db.couch` earlier in the process might be good. |
| 172 | + |
| 173 | +1. Rename `db.couch.compact.data` to `db.couch.compact` |
| 174 | +2. Delete `db.couch` |
| 175 | +3. Delete `db.2.couch` |
| 176 | +4. Rename `db.2.couch.compact.maxgen` to `db.2.couch` |
| 177 | +5. Rename `db.couch.compact` to `db.couch` |
| 178 | +6. Delete `db.couch.compact.meta` |
| 179 | + |
| 180 | +Having done that, we need to extend the recovery path where we fail to find |
| 181 | +`db.couch` and check for `db.couch.compact`. The possible crash states of the |
| 182 | +above process are: |
| 183 | + |
| 184 | + Step 1 Step 2 Step 3 Step 4 |
| 185 | + ------------------------- ------------------------- ------------------------- ------------------------- |
| 186 | + db.couch |
| 187 | + db.2.couch (old) db.2.couch (old) db.2.couch (new) |
| 188 | + db.2.couch.compact.maxgen db.2.couch.compact.maxgen db.2.couch.compact.maxgen |
| 189 | + db.couch.compact db.couch.compact db.couch.compact db.couch.compact |
| 190 | + |
| 191 | +In the first state, we would open `db.couch` on restart, and this refers to data |
| 192 | +in the _old_ `db.2.couch`. We can continue to use this and either leave |
| 193 | +`db.2.couch.compact.maxgen` in place for some future compaction of gen 2, or we |
| 194 | +can delete it as we're not using any data in it. |
| 195 | + |
| 196 | +In all the other states we will use `db.couch.compact` and therefore need to |
| 197 | +complete the process of moving `db.2.couch.compact.maxgen` to `db.2.couch` |
| 198 | +before using it. If `db.2.couch` exists, we remove it, and then we rename |
| 199 | +`db.2.couch.compact.maxgen` to `db.2.couch`. This process is safe if we crash |
| 200 | +after removing the old `db.2.couch`. |
| 201 | + |
| 202 | +We could also try to resolve these two questions independently: |
| 203 | + |
| 204 | +1. What to do if both `db.couch` and `db.couch.compact` exist |
| 205 | +2. What to do if `db.G.couch` where `G > max_generation` exists |
| 206 | + |
| 207 | +But as we've seen, the cleanup operations for both these questions create states |
| 208 | +where the mutual ordering of their operations is important and it would be wise |
| 209 | +to minimise the set of possible such states. Therefore we suggest the following |
| 210 | +recovery routine: |
| 211 | + |
| 212 | +1. Attempt to open `db.couch`. If this succeeds, remove any generation files |
| 213 | + above `max_generation`. Otherwise... |
| 214 | + |
| 215 | +2. If `db.M+1.couch` where `M = max_generation` exists, then remove `db.M.couch` |
| 216 | + then rename `db.M+1.couch` to `db.M.couch` |
| 217 | + |
| 218 | +3. Rename `db.couch.compact` to `db.couch` |
| 219 | + |
| 220 | +4. Open `db.couch` |
| 221 | + |
| 222 | +This process works correctly if the generation files are cleaned up _before_ the |
| 223 | +rename of `db.couch.compact` to `db.couch` in `finish_compaction`. Otherwise the |
| 224 | +`.compact` file that indicates the partial completion of this process may not |
| 225 | +exist following a crash and this makes it harder to tell which generation files |
| 226 | +are valid on recovery. |
| 227 | + |
| 228 | +This also highlights that letting users modify `max_generation` for a DB is not |
| 229 | +safe while compaction is happening, because it may lead to confusion during |
| 230 | +recovery that could cause data loss (i.e. mistaken deletion of |
| 231 | +`db.2.couch.compact.maxgen`). |
| 232 | + |
| 233 | +If we allow post-creation changes to `max_generation`, then: |
| 234 | + |
| 235 | +- Increasing it is "free"; all existing data remains valid but it simply becomes |
| 236 | + _possible_ for future compactions to create higher generations. |
| 237 | + |
| 238 | +- Decreasing it requires a "reversed" compaction to move data from higher |
| 239 | + generations to lower ones followed by deleting the emptied generation(s). And |
| 240 | + so this change should properly be thought of as requesting a compaction with |
| 241 | + special properties. |
0 commit comments