-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Generational compaction #5583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jcoglan
wants to merge
11
commits into
apache:main
Choose a base branch
from
neighbourhoodie:feat/generational-compaction
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Generational compaction #5583
jcoglan
wants to merge
11
commits into
apache:main
from
neighbourhoodie:feat/generational-compaction
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
f0666ca to
29f855a
Compare
29f855a to
09e16b3
Compare
To support a generational storage model, the #st struct needs to have
multiple file handles open. Whereas we currently back a shard with a
single file, `db.suffix.couch`, the generational model will augment this
with a set of "generation" files named `db.1.suffix.couch`,
`db.2.suffix.couch`, etc. The original `db.suffix.couch` file is
henceforth referred to as "gen-0".
Each of these file handles needs to be monitored by the incref/decref
functions and so we replace the `fd` and `fd_monitor` fields with a pair
of `{fd, monitor}` stored in the `fd` field. The new `gen_fds` field
stores a list of such pairs, and points at the `db.{1,2,...}.couch`
files.
The number of generational files opened is determined by a new field in
the DB header named `max_generation`. This defaults to 0 so that all
existing databases stay on the current storage model, and need to opt in
to using generational storage.
Here we also add a set of functions that the engine and compactor will
need for managing generational files:
- `generation_file_path()`: returns the path to the Nth generation file;
returns the normal `db.suffix.couch` path for gen-0.
- `open_generation_file()`: opens and monitors the Nth generation file.
- `open_generation_files()`: opens and monitors all the files for
generations from 1 to N.
- `maybe_open_generation_files()`: opens and monitors all the generation
files except if the `compacting` option is set; the compactor does not
need to re-open the generation files as it will share the existing
handles with the engine (i.e. we don't open multiple handles to the
same file).
- `open_additional_generation_file()`: when compacting the highest
generation, we will open an extra temporary file for its live data to
be moved into; if `max_generation` = M then this causes `gen_fds` to
contain M+1 file handles.
- `reopen_generation_file()`: once the file `db.N.couch` has been
compacted into `db.N+1.couch`, this function will remove and reopen
the existing `db.N.couch` file so that it becomes empty.
- `delete_generational_files()`: when deleting the database, this
removes all the generational files.
- `get_fd()`: returns the file handle for the Nth generation, or the
original gen-0 `db.suffix.couch` file.
In the generational storage model, all new docs/revs continue to be
written to "gen-0", the `db.suffix.couch` file. On compaction, live data
is "promoted" to the next generation; data in `db.couch` is moved to
`db.1.couch`, data in `db.1.couch` to `db.2.couch`, etc. Therefore, doc
body and attachment pointers need to include a representation of which
file they reside in.
This is accomplished by storing a pair of `{Gen, Ptr}` instead of just
`Ptr` when a body/attachment is written to generation 1 or above. When
writing to gen-0, we continue to just store the pointer, rather than
wrapping it in `{0, Ptr}`. This means that we continue to write
backwards-compatible data for databases that have not opted in to
generational storage, and it makes sure we can continue to read existing
data, as pointers stored in gen-0 look the same as they always have.
This commit implements the generational compaction scheme wherein live data is "promoted" to a higher generation by the compactor. Each compaction run targets a specific generation N, from 0 up to the database's maximum generation M. If a database has gen-0 file `db.couch`, then the compactor works as follows: - The compactor still creates `db.couch.compact.data` and `db.couch.compact.meta` files. If N = M then it also opens the file `db.M.couch.compact.maxgen`, and this file is added to the end of `gen_fds`, creating a temporary generation M+1 file. - The compactor shares the `gen_fds` file handles with the main DB engine, so that only one file handle exists for these files at a time. Since only the compactor writes to generational files, it may be safe for it to open its own handles, but that is not currently implemented. - All the *structure* of the database -- the by-id and by-seq trees, purge history, metadata, etc -- remains in the gen-0 file, that is, the new structure continues to be built in `db.couch.compact.data`. Only *data*, i.e. document bodies and attachments, is ever stored in a higher generation. - If an attachment is currently stored in gen N, then it is copied into gen N+1. If it resides in a different non-zero generation, it remains where it is. If it resides in gen-0, and N > 0, then it is copied to `db.couch.compact.data`, since the original `db.couch` file will be discarded at the end of compaction. - Document bodies follow the same rule, with one addition: if they contain any attachment pointers that have been moved by the previous rule, then a new copy of the document must be stored with updated attachment pointers. If the document is currently in gen N, then it is copied to gen N+1 with updated attachments. Otherwise, a fresh copy is written to its current generation -- either a generational file, or `db.couch.compact.data`. - If N = M = 0, then doc/attachment data is copied from `db.couch` to `db.couch.compact.data`, rather than to `db.1.couch`. This means compaction continues to work as it currently does for existing databases. - When compaction is complete, `db.couch.compact.data` is moved to `db.couch`. If N > 0 then `db.N.couch` is removed and reopened. Any live data it contained should now reside in `db.N+1.couch`. If N = M, then `db.M.couch.compact.maxgen` is moved to `db.M.couch`, and `gen_fds` reverts to its normal size. - When N = M, i.e. we are compacting the max generation, the target generation will be the M+1 entry in `gen_fds`, but this file will eventually be moved to `db.M.couch`. Therefore we need to write pointers to this file's data with generation M, even though it is at position M+1 in `gen_fds` when it is being written to.
This adds a parameter named `gen` to the `PUT /db` and `POST /db/_compact` endpoints. This sets the `max_generation` of the database when it's created, and sets which generation to compact. The parameter defaults to zero in both endpoints.
In order for smoosh to trigger compactions of generations above 0, we need to store per-generation size information, rather than just storing the total for all the shard's files. The key changes are: - `#full_doc_info.sizes` can now store a list of #size_info rather than a single record. - `couch_db_updater:add_sizes()` uses the generation of the leaf pointer to build a list of #size_info, one for each generation. If there is only a single generation, then a single #size_info is returned, so that we continue to store a single #size_info record for non-generational databases and maximise backwards compatibility. - In `couch_bt_engine`: `get_partition_info()` sums the sizes of each generation to return the total size of the partition shard; `split_sizes()` and `join_sizes()` can work on a list of #size_info as well as a single record; and `reduce_sizes()` can merge two lists of #size_info records. - `couch_db_updater:flush_trees()` and `couch_bt_engine_compactor:copy_docs()` fold the attachment sizes into the active and external sizes when the end result is a multi-generation list of sizes. - `couch_db:get_size_info()` returns a list of #size_info records. The first one is calculated for gen-0 as normal, i.e. the active size is got by adding all the tree sizes to the size of the stored data. For higher generations, the active size is just the size of the stored data. - `fabric_db_info:merge_results()` continues to return a single object for the `sizes` for non-generational databases, but returns an array of per-generation size info for generational ones. - `couch_db_updater:estimate_size()` sums the sizes of all generations to estimate the total size.
Now that we store per-generation size information, we can make smoosh trigger compaction when any generation passes a channel's thresholds. We achieve this by adjusting the events that smoosh reacts to, so that it considers a specific generation for compaction: - When the `updated` event occurs, enqueue the affected database at generation 0, since all new data is written to gen-0. - In `couch_bt_engine:finish_compaction_int()`, we return the compaction's target generation in the result. In `couch_db_engine:finish_compaction()` we use this value to emit a `compacted_into_generation` event. This notifies smoosh that the target generation has gained new data and should be considered for compaction into the generation above it. The generation is then fed into `find_channel()` and `get_priority()` so that these functions examine the correct size information when deciding whether to trigger compaction. We also include the source generation in the compaction's "key" to identify which generation of a DB is being compacted, so that it resumes correctly from pausing or crashing.
7fa1193 to
b59d5ce
Compare
b59d5ce to
6d08c33
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(This is a draft we're opening for discussion. The bulk of required information on design background, analysis and implementation is in the commits, including some design docs added to the repo. We will flesh this PR out as the feature gets closer to being ready.)
Overview
This PR implements a "generational" storage model in
couch_bt_engine, which @janl and I have been working on. Its aim is to improve the performance of compaction on large databases with seldom-changing documents, where every compaction run currently has to copy a mostly-unchanged set of data into the new file.The generational model splits a shard's data storage into multiple generations, where the usual
db.couchfile is "generation 0". On compaction, live data in this file is promoted into generation 1. The next time generation 0 is compacted, it does not have to copy the same set of data again has much of it will have been moved to another file.Further detail on the design and analysis is in design docs we have committed to the repo; see https://github.com/neighbourhoodie/couchdb/blob/feat/generational-compaction/src/couch/doc/generational-compaction. The commit messages give further details about the implementation.
Open questions
Testing recommendations
Related Issues or Pull Requests
Checklist
rel/overlay/etc/default.inisrc/docsfolder