Proposal: drop bbolt for chain storage

# Proposal: drop bbolt for chain storage

Rationale: bbolt is the largest cpu and mem consumer during initial sync and that is entirely due to storing headers and cfilters and fetching blocks for difficulty validation. The features provided by bbolt (in particular: transaction atomicity) are not necessary for storage of chain data. That, plus a few other desirable features (listed below) could be better provided by a new chain db implementation.

NOTE: this is **NOT** about moving the entirety of dcrwallet off bbolt, **ONLY** the chain data (headers and cfilters). 

# Downsides of bbolt

The following measurements are from a sample chain sync of a new SPV dcrwallet instance connected to a single remote source known to have good connectivity and being up to date. Sync height was 885252.

- Significant CPU performance hit compared to max theoretical
  - Total Wall time: 3m43s
  - Total CPU profile time:  1m25s (86s)
  - Aproximate bbolt cpu time: 47s (~54% of CPU time)
  - Min possible: ~1.4s to write the min possible size in the same hardware  
- Significant Mem performance hit compared to max theoretical
  - ~16GB total allocated memory
  - ~12.8GB bbolt related allocated memory
  - Min possible: close to zero additional memory over the non-bbolt related memory
- Significant disk size hit compared to max theoretical
  - 966MB total bbolt db size
  - Min possible without compression:
    - ~160MB for headers
    - ~279MB for CFilters
    - ~439MB total


# Desired New Features

The following would be the new features desired out of a new chain db:

- Smallest possible performance hit  
- Ability for fast/batch import of chain data
  - For example, if downloaded from a torrent or HTTP source
- Ability for sharing across multiple wallets/apps
  - Useful for installations that use multiple wallets
  - Ideally it should allow simultaneous access from multiple apps through some RPC interface
- Take into account Decred chain characteristics
  - Few alternative main chain tips
  - Low probability of > 1 block reorg (but still possible)
  - **Very** low probably of > 6 block reorg (technically possible, but very unlikely)
  - Make use of the AssumeValid and MinKnownChainWork chain parameters
  - Make use of max possible sizes of CFilters
  - Make use of fact that older blocks are rarely accessed
- Modular enough that it could be used by products other than dcrwallet
  - Nice to have, but not a significant priority
  
# Overall Design

This is a rough first draft of the design for this DB:

- Write a custom DB implementation (as opposed to using an existing DB engine)
- Split data storage into two arenas:
  - Long term arena: blocks after 6 confs
  - Short term arena: tip+5 blocks
  
The storage strategy for each arena may be different, according to what provides best throughput during IBD and ongoing sync (after IBD completes).

## Long Term Arena

- Write directly to file
- Designed for fast iteration (backwards and forwards)
- Harder to rollback (may have to resort to just erasing all reorged blocks)
- Ready-made format (if you download it, it is complete and ready to use - modulo integrity and validation checks)


## Short Term Arena

- Fixed size (max size for 6 blocks + cfilters)
- Circular buffer
- Indexed from memory, index rebuilt on restart


# Long Term Arena File Format

## Files

The long term arena is composed of four files. Before the actual data, each file starts with an 8 byte file header:

- Headers file
  - Magic Header bytes 0x6463726865616430 ("dcrhead0")
- CFilters file
  - Magic header bytes 0x6463726366733030 ("dcrcfs00")
- Index file 
  - Magic header bytes 0x6463726964783030 ("dcridx00")
- Sidechains File
  - Magic header bytes 0x6463727364633030 ("dcrsdc00")

### Headers File

File for storing headers. This file works as raw header storage. Each entry has the following format:

```
[180 bytes]: Header
```

This file MAY have padding at the end, filled with zero bytes.

Due to headers having a fixed size, the last entry can easily be found when iterating backwards (it's the last one at a position modulo the header size that is not all zeros).

Headers are only put into the headers file of the LTA after they have at least 6 confs (see algorithms below). Due to this, the block hash of a header (other than the last one) can be found without requiring a hashing operation by simply looking at the `PrevBlock` field of the next header.

### CFilters file

File for storing CFilters. Each entry has the following format: 

```
[4 bytes] Total entry size
[n bytes] CFilter size
[n bytes] CFilter
[4 bytes] CFilter proof index
[4 bytes] Number of proof entries
[n*32 bytes] CFilter inclusion proof
[4 bytes] Block Height
[4 bytes] Total entry size
```

After the last valid entry, this file has a footer with the following data:

```
[20 bytes] All ones
```

This file MAY have padding at the end, filled with zero bytes.

The goal of having a footer is to explicitly mark the end of (what should be) valid data. Note the footer itself is _not_ a valid entry: the total entry size would be larger than the file itself and the height would be larger than the last header. The footer is meant to be easy to find when iterating backwards from the end of the file.

The goal of having the total entry size on both ends of the struct is to allow backwards iteration of the CFilters file: after identifying the start of the footer, the last valid entry can be correctly read by reading the final total entry size and prior entries can be iterated backwards.

This will be useful during DB startup, to validate the last entry of the CFilters file matches the last header.

## Index file

File for storing the index of block hashes to block height and CFilter entry location. This can be rebuilt entirely from the headers and CFilters file if needed, so it could be an optional feature.

Each entry contains the following:

```
[4 bytes] Reduced block hash
[4 bytes] CFilter offset location
[4 bytes] Flags
  [bits 0:28] Total CFilter entry size
  [bits 29:30] unused
  [bit 31]: Has Full Block Hash
[0 or 32 bytes]: Full block hash
```


Due to being stored already sorted, the height of a block is implicit (so the first entry in the file corresponds to block 1, and so on). Due to the header sizes being fixed, the offset into the header file is implicit.

The "reduced block hash" is similar to the "short block key" in dcrd: it's the first 32 bits of the block hash. 

The MSB of the flags field is used to indicate whether the full block hash is present in this entry or not, while bits 0 to 28 is the total CFilter entry size.

The full block hash is only present in instances where there is a collision in the reduced block hash. In this case, the _first_ block where a particular reduced block hash occurs does _not_ have the flag set, while later occurrences have it set. The wallet can reconstruct the index appropriately. On the current mainnet, as of block 885208, only 84 blocks have a collision in their 32 bit reduced block hash key.

This file contains a footer. The last entries of this file are:

```
[44 bytes] All Ones
[12 bytes] All zeros
```

The footer is intended as a marker that the last write operation of the index was finished correctly. When reading the index file, if the footer is not correctly present, the file may be assumed as corrupted and should be rebuilt from scratch. Both entries of the footer are invalid: the all zeros entry points to offset location 0 in the CFilters file, while the all ones points to an entry that will be at an offset larger than the CFilters file size.

This file MAY have padding filled with zero bytes after the last two entries. The footer, therefore, ensures the presence of _at least_ 12 zero bytes at the end (but more are possible). 

The goal for the index file is to allow the wallet to quickly load the known blockchain into an in-memory index, for fast header and CFilter lookup. The current mainnet chain as of height 885208 would require (at a minimum, disregarding overhead for pointers and other structures) about 10.6 MB of RAM to keep this index in memory.

There is an alternative design we may pursue for the index that does not require keeping any data in memory. The drawback for this design is that it has an increased disk space requirement and requires disk accesses for fetching data from the index. See Appendix 1 at the bottom for more details.   

## Sidechains File

File for storing blocks that were once part of the main chain but that have been reorged out.

The entries in this file are rarely needed - they are only used in certain scenarios that involve failed writes and multi-wallet usage of the DB.

Each entry in this file has the following format:

```
[4 bytes] Total entry size
[4 bytes] CFilter size
[n bytes] CFilter
[4 bytes] CFilter proof index
[4 bytes] Number of proof entries
[n*32 bytes] CFilter inclusion proof
[180 bytes] header
[32 bytes] block hash
[4 bytes] Total entry size
```

This file may be padded with zero bytes at the end.

Note that entries' fields are organized in reverse of what would be ordinarily expected: block hash and header appear at the end of the entry. This is meant to ease reverse iteration of this file, when looking for a block with specific hash or height.

There is no particular structure to this file. However, it is expected that later blocks will appear after earlier blocks. When a reorg greater than depth 1 happens, blocks that were reorged out will appear contiguously, in block sequence order.


# Short Term Arena

This arena is meant to store the most recent 6 blocks: the tip block plus its five immediate ancestors. This arena is organized as a single file, meant to be used in a circular buffer fashion: a new tip block will cause one block to be evicted from this arena after it is written to the long term arena, and then the prior storage location for that block will be overwritten with the newly received block (assuming the newly received block is extending the chain as opposed to creating a sidechain).

This file starts with an 8 byte magic header:  0x6463727374613030 ("dcrsta00").

Each entry in the file has the following structure:

```
[180 bytes] Block Header
[4 bytes] CFilter size
[1174036 bytes] CFilter
[4 bytes] CFilter proof index
[4 bytes] Number of proof entries
[64*32 bytes] CFilter inclusion proof
[32 bytes] Special File Checksum
```

The CFilter storage space is pre-selected according to the [max possible CFilter size](https://github.com/decred/dcrd/blob/master/gcs/blockcf2/maxsize_test.go#L89). It is pre-determined in order to ensure each entry has a completely fixed size on-disk. If the max possible sizes of CFilters change due to a consensus rule change, then this would require a db upgrade.

The max number of CFilter inclusion proofs is set to 64 for a similar reason. While the currently deployed consensus rules enforce that the inclusion proof for CFilters has size 0, we opt to increase the max possible size of the inclusion proof in order to ensure future rules upgrades do not require a db upgrade. This would only be required if a very large number of new commitments were added to the header.

The special file checksum is special in the sense that it is not a standard file checksum/hash computed by processing the file in-order. Instead, the checksum is calculated by hashing the data for the block, plus its 5 immediate ancestors in block sequence, excluding their own special file checksum. This is important to note, because the blocks in the short term arena file are NOT stored in height order. The checksum is meant to ensure the last write operation completed successfully: during startup, while loading the blocks in the short term arena, the wallet may verify that the checksum of the tip block matches the expected one based on prior blocks, and if not, then it should discard all blocks in this arena and re-fetch them (or rewind until the checksum matches).

Any number of entries in this file may be completely filled with zeroes (for example, because of a reorg happened or because the file was recreated after a write failure operation). 

# Algorithms For Blockchain Evolution

Of particular importance is considering write failures before/during/after each step of operation, how that can be detected (i.e. what is the disk state and startup validation procedures) and what action should be taken.  

## Connecting new blocks

This is the procedure for connecting recently mined, individual blocks. This is the primary write function during steady-state operation of the DB (after IBD has completed) and is the most likely to fail due to lack of disk space (because it is the primary one that may require growing the files).

Assumptions:

- All blocks at height > 6 than current tip have been written to the long term arena.
- The tip block and its immediate ancestors have been written to the short term arena.
- The block index is fully populated.
- The block that is being connected extends the main chain tip.
- Both the new block header and CFilter data have been fetched.

Process:

1. Calculate the new Special File Checksum of the new block + its most recent 5 ancestors.
2. Grow any files that need it (pad them with zeroes).
3. Store the oldest header from the STA in the headers file of the LTA.
4. Store the oldest header's CFilter in the CFilters file of the LTA.
5. Store the oldest header's metadata in the index file of the LTA.
6. Overwrite the entry of the oldest header in the STA with the new block data.  

Stage 6 can be roughly understood as the "commit" stage: if the checksum of the most recent block in the STA is correct and the most recent block of the LTA is at a 6 block delta from it, then the prior write operation was successful.

Failures before that stage can be detected by the prior validation check. Failures between stages 1 to 5 can be easily recovered: the write operation can simply be repeated (completely or starting from where it failed).

Failures on step 6 means the most recent block of the STA is damaged (the checksum is mismatched) and should be discarded: in that case, the client code (wallet) will fetch the most recent block after DB startup completes. The faulty entry may be cleaned up by being zeroed.

## Adding an Alt Chain Tip

This happens purely in memory. The  memory-only sidechain forest is updated to track the new chain tip, but otherwise no file modification is necessary.

## Perform a <= 6 depth reorg

This is the procedure for connecting an alternative chain tip to the current tip that DOES cause a reorg. This happens episodically within the Decred chain when a new block is found that extends the current chain tip and it has more work than the current best chain tip.

Assumptions:

- All blocks at height > 6 than current tip have been written to the long term arena.
- The block index is fully populated.
- The alternative headers and CFilter data have been fetched.
- The DB has already determined the new chain contains an alternative chain tip that DOES cause a reorg.

1. If the new tip also extends the chain past the 6 block height mark, perform the standard block connection process for blocks that will become at depth > 6 (but during step 6, do not overwrite the older entries but rather zero them).
2. Calculate the new Special File Checksum for each of new blocks + its most recent 5 ancestors (including the ones outside the STA).
3. Store the (soon to be) reorged-out blocks to the sidechains file.
4. Zero out all disconnected blocks in the STA in reverse block sequence order.
5. Write the new blocks in the STA in block sequence order.

Step 3 is necessary so that, even if a failure happens in the software after the entire reorg procedure is carried out, the DB clients (wallet) will be able to fetch the header and CFilter of the reorged-out block in order to carry out their own reorg procedures (e.g. rewinding transactions). It is particularly necessary when a single DB is used for multiple wallets (which may not be online when the DB is updated).

Steps 4 and 5 are split so that, upon restart, it is obvious whether one of them failed and whether the STA is correct or not. Failures during step 4 mean only part of the old block data has been erased, while failures during step 5 mean only part of the new block data has been written. In either case, that entry's checksum (which is the last information written to disk for an entry) will fail to validate. 

Doing steps 4 and 5 separately may be seen as superfluous (as the overwrite would erase the data anyway), and doing it in block sequence order (as opposed to file order) may be seen as taking a performance hit, but doing it in this fashion will increase the chances of having as many blocks as possible written correctly to disk, in case of write failures.

Note that long running nodes have never publicly documented a > 1 block depth reorg in Decred, and thus it is expected that no more than depth 1 reorgs will be needed (though this algorithm should support reorgs up to depth 6).

## Perform a > 6 depth reorg

This is the hardest reorg to perform in the DB and the one that could most likely cause corruption. Therefore, this process is geared towards correctness as opposed to performance.

Assumptions:

- All blocks at height > reorg depth have been written to the long term arena.
- The block index is fully populated.
- The alternative headers and CFilter data have been fetched.
- The DB has already determined the new chain contains an alternative chain tip that DOES cause a > 6 depth reorg.

Process:

1. Calculate the new Special File Checksum for the new tip + its most recent 5 ancestors
2. Store the (soon to be) reorged-out blocks to the sidechains file.
3. Zero out the STA file.
4. Zero out the disconnected entries in the headers file.
5. Zero out the disconnected entries in the CFilters file.
6. Zero out the disconnected entries in the index file.
7. Grow any files as needed.
8. Add the headers that are > 6 blocks from the new tip to the headers file.
9. Add the CFilters entries that are > 6 blocks from the new tip to the CFilter file.
10. Add the index entries that are > 6 blocks from the new tip to the index file.
11. Recreate the STA in block sequence order.

The first disk operation must be adding the reorged-out blocks to the sidechains file to ensure any clients may fetch that data if it is needed to perform its own reorg procedure.

The rationale for zeroing out files _before_ performing any of the write operations is the same as for regular reorgs: this increases the confidence that write failures will leave the DB in a state that is at least partially consistent in the sense that blocks earlier than one that is detected as broken are known to be correct.

The rationale for zeroing out the headers file first (as opposed to last) is that headers that are missing their CFilters should cause clients (i.e. wallet) to fetch the missing CFilters while extra CFilters without corresponding headers is a faulty state detected and corrected during DB startup (the extra CFilters and index entries may be erased). The first option would cause the client the potentially attempt to fetch CFilters that, at a minimum, are not needed anymore and on a worst case scenario won't be provided by any nodes anymore. The second is easily recoverable from after the extra data is erased - syncing continues as if the wallet is out of sync.

Note that if the process fails before step 8, the blocks that would be reorged-in (the new chain) is not recorded and will have to be downloaded again after startup.  

## Initial Block Download

During IBD, a very large number of headers and CFilters is fetched at once, in batches. It is therefore ideal to make use of that information to avoid performing redundant writes. In particular, writes to the Short Term Arena are unnecessary, because it is likely that a new batch of headers that will soon be received will overwrite it.

Writes to the LTA files can be batched: a batch of headers, CFilters and block index changes may be written directly, instead of being written block by block (assuming the clients have validated the correctness of the headers).

Additionally, the files can be grown directly in one operation, once the target sync height is known (however, care should be taken to limit the max file growth possible when syncing from untrusted sources).

The IBD batch addition process may be carried out until the tip is within 6 blocks from either the target sync height or the estimated blockchain height (calculated based on the target block production interval). After that, the DB should switch to the standard, steady-state operation mode.

## DB Startup

Several validations must be performed during DB startup in order to ensure the data is consistent, and to fix faulty states in case of write failures. The following is a rough process for loading the DB from disk and creating its in-memory structures:


- Load the last header from the LTA headers file  
- Load the last CFilter from the CFilters file
  - If the height of the last filter is greater than that of the last header, remove filters until they match
  - Verify the CFilter proof is valid for the last header
    - For blocks before DCP0005, there should should be an inclusion proof of size 1 equal to the block hash
    - Remove filters that fail validation in reverse block sequence order
  - If the height of the last valid filter is < than last header, mark DB is requiring missing CFilters  
- Load index file
  - Identify and validate the footer
  - If the height of the last entry is greater than that of the last CFilter, remove entries until they match
  - If the height of the last entry is lesser than that of the last CFilter, create entries until they match
  - If the reduced block hash of the last entry in the index does not match the corresponding header
    - Rewind the index file until they match
- Load the sidechains file
  - Identify the last entry and record its height and hash
- Load the Short Term Arena file
  - Load all (non-zero) entries into memory
  - Validate all checksums
    - Remove entries whose checksum does not match
  - If there are CFilters in the STA that are not in the CFilters file, add them
    - If this has fetched all missing CFilters, mark DB as NOT requiring missing CFilters
  - If there are blocks in the STA that are already stored in the LTA (duplicated), then remove them from the STA
  
# Final Discussion

This proposal is meant to establish whether want to move the wallet to use a new database implementation for its chain data.

If this is deemed as a reasonable goal, then the design in this proposal may be tweaked as necessary and implemented as a new independent package for testing. It can then be integrated into the wallet, with appropriate code to migrate the state of old wallets.

# Appendix 1: Alternative Index File Design

Instead of storing the index by height (one entry per block), we may opt to store the index by reduced hash. Assume we use 24 bits of the reduced hash as key. We can create a file with 2^24 entries, each taking 16 bytes for a total of ~268MB. Each entry in this file would correspond to the presence of a block with that reduced hash value.

If the entry is all zeroes, that block does not exist. Otherwise, it may be set to one of two values (appropriately flagged): 
  - Height, CFilter offset and total size of CFilter entry 
  - Offset within the index file that has the full table of block hash to index and CFilter entry data
  
The second value type is necessary when 24 bits are not sufficient to disambiguate between different blocks (i.e. more than one block shares the same reduced hash key). As of height 885208, there are 22815 blocks sharing the first 24 bits of their reduced hash key, with the max number of blocks sharing the same prefix being 4.

This design sacrifices disk space and query performance for reduced RAM usage.

# Appendix 2: Use of Syscalls, MMap, and io_uring

The natural implementation for this DB would be using standard syscalls for file read/write. The writes are small enough and (apart from IBD) infrequent enough that using memory mapped files, io_uring or any other more advanced IO APIs should not be necessary (specially considering we need to fsync the final writes for durability).

However, it would be ideal if the data layer of the DB could be abstracted enough to support any IO API. This will need to be investigated.

# Appendix 3: Use of Chain Data by Multiple Wallets

The same chain data could be used by multiple wallets (though not simultaneously) easily: during DB startup, the LTA and STA files would be locked by the running instance, ensuring only a single wallet instance can use them.

Then, after the wallet has gained exclusive access to these files, it can work as usual. During wallet startup (after DB startup), if it detects its rescan point (which is NOT stored in the chain DB) is earlier than the best chain header, this means some other wallet already updated the chain and the starting wallet should perform appropriate rescan and transaction sync duties.

`dcrwallet` could be configured to use a separate (global) dir for chain data, such that `appdata` and this new `chaindata` could be configured separately.

This would help multi-wallet apps (such as Decrediton) by reducing the total amount of space needed for each wallet (about 1GB with the current bbolt implementation on mainnet). 

# Appendix 4: Concurrent Access by Multiple DB Instances

This is a tricky subject, in particular because it would allow the chain to be updated "externally" from the point of view of a running wallet: wallet A could add a block to the chain that wallet B would have to be notified about.

This would require much larger changes in the wallet, so that it was driven by the chain DB (as a syncer) instead of the existing syncer packages (RPC and SPV).

If this is deemed as a worthy goal, we can discuss further requirements and challenges, but this is currently listed as an additional feature. 

# Appendix 5: Alignment of CFilter data in disk

Needed? Desired?

# Appendix 6: Choice of checksum function

Ideally, this would be a rolling hash function so that on new blocks, the data for the last block is removed while the data for new block is added. This needs to be investigated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: drop bbolt for chain storage #2395

Proposal: drop bbolt for chain storage

Downsides of bbolt

Desired New Features

Overall Design

Long Term Arena

Short Term Arena

Long Term Arena File Format

Files

Headers File

CFilters file

Index file

Sidechains File

Short Term Arena

Algorithms For Blockchain Evolution

Connecting new blocks

Adding an Alt Chain Tip

Perform a <= 6 depth reorg

Perform a > 6 depth reorg

Initial Block Download

DB Startup

Final Discussion

Appendix 1: Alternative Index File Design

Appendix 2: Use of Syscalls, MMap, and io_uring

Appendix 3: Use of Chain Data by Multiple Wallets

Appendix 4: Concurrent Access by Multiple DB Instances

Appendix 5: Alignment of CFilter data in disk

Appendix 6: Choice of checksum function

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: drop bbolt for chain storage #2395

Description

Proposal: drop bbolt for chain storage

Downsides of bbolt

Desired New Features

Overall Design

Long Term Arena

Short Term Arena

Long Term Arena File Format

Files

Headers File

CFilters file

Index file

Sidechains File

Short Term Arena

Algorithms For Blockchain Evolution

Connecting new blocks

Adding an Alt Chain Tip

Perform a <= 6 depth reorg

Perform a > 6 depth reorg

Initial Block Download

DB Startup

Final Discussion

Appendix 1: Alternative Index File Design

Appendix 2: Use of Syscalls, MMap, and io_uring

Appendix 3: Use of Chain Data by Multiple Wallets

Appendix 4: Concurrent Access by Multiple DB Instances

Appendix 5: Alignment of CFilter data in disk

Appendix 6: Choice of checksum function

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions