Skip to content

test: Adds tests showcasing existence of certain bugs on bp32 implementation - DO NOT MERGE#5442

Open
clockworkgr wants to merge 2 commits intofeat/jae/bp32treefrom
test/bp32tree-review
Open

test: Adds tests showcasing existence of certain bugs on bp32 implementation - DO NOT MERGE#5442
clockworkgr wants to merge 2 commits intofeat/jae/bp32treefrom
test/bp32tree-review

Conversation

@clockworkgr
Copy link
Copy Markdown
Contributor

B+32 Tree Code Review — Bug Summary

Review of the tm2/pkg/bptree/ package, a versioned B+32 tree intended as a
drop-in replacement for IAVL. Each bug is cross-referenced against standard
tm2 blockchain usage patterns to assess real-world likelihood.

Reproducer tests for all bugs: bugs_test.go


Bug Summary Table

# Bug Severity Location Likelihood Impact
1 Shallow clone shares mutable slices Low (latent) node.go:123 None currently Silent version corruption if future code mutates key bytes in-place
2 resolveValue returns hash as value Medium mutable_tree.go:112, immutable_tree.go:26 None on normal store path 32-byte hash returned instead of value; silent data corruption
3 SaveValue bypasses batch; Rollback leaks Medium nodedb.go:151 Certain Slow unbounded DB bloat from orphaned values
4 No bounds check on deserialized numKeys Medium node.go:209 Low Panic or undefined behavior on corrupted/malicious data
5 Exporter goroutine leak + version reader lock Medium export.go:26 Low (dormant) Goroutine leak; version permanently unprunable
6 Pruning deletes shared nodes across versions Critical prune.go:103-142 Certain Chain halt, node bricked on restart

Bug Details

Bug 1 — Shallow Clone Shares Mutable []byte Slices

Location: node.go:123Clone() does c := *n (struct copy)

Problem: The struct copy shares all []byte slice pointers between original
and clone. If any code path later mutates key bytes in-place (e.g. via
append() or direct byte writes), the mutation would silently corrupt the
original node in a different version.

Current status: Exhaustive audit of insert.go, remove.go, and split.go
confirms all key mutations use slot reassignment or copyKey(). No in-place
byte mutation exists today. The bug is latent.

Blockchain likelihood: None currently. One append() away from becoming a
silent cross-version corruption bug.

Tests: TestBug1_CloneSharesSlices, TestBug1_COWSafety,
TestBug1_COWRegressionTest


Bug 2 — resolveValue Returns 32-Byte Hash as Value

Location: mutable_tree.go:112, immutable_tree.go:26

Problem: B+32 stores values out-of-line: leaf nodes contain SHA256 hashes,
actual values live in separate DB entries. When resolveValue() is called
without a ValueResolver set, it silently falls back to returning the 32-byte
hash as if it were the value. No error is returned.

// immutable_tree.go:26
func (t *ImmutableTree) resolveValue(vh Hash) ([]byte, error) {
    if t.valueResolver != nil {
        return t.valueResolver(vh)
    }
    return vh[:], nil  // BUG: returns hash, not value
}

Blockchain likelihood: None on normal path. The store layer
(store/bptree/store.go:70) always sets valueResolver via
st.mtree.GetValueByHash(). Three layers of setup ensure it's wired. Only
exposed if someone constructs a tree manually (tests, external tooling).

Tests: TestBug2_ResolveValueReturnsHash, TestBug2_ImmutableResolveValue,
TestBug2_GetReturnsHashNotValue, TestBug2_StoreLayerSetsResolver


Bug 3 — SaveValue Bypasses Batch; Rollback Leaks Values

Location: nodedb.go:151

Problem: SaveValue() writes directly to the DB via ndb.db.Set(), while
all other writes (SaveNode, SaveRoot) use ndb.batch.Set(). This means
values are persisted immediately and irrevocably, even if the block is later
rolled back. Rollback() only restores the root pointer — it cannot undo the
direct DB writes.

// nodedb.go:151 — direct write, not batched
func (ndb *NodeDB) SaveValue(vh Hash, value []byte) {
    ndb.db.Set(vh[:], value)  // BUG: bypasses batch
}

// Compare with SaveNode, which uses the batch:
func (ndb *NodeDB) SaveNode(node Node) {
    ndb.batch.Set(key, data)  // correct: batched
}

Blockchain likelihood: Certain. Every Set() call during block execution
writes the value immediately. Rollback() after failed ABCI DeliverTx leaves
orphaned values. The codebase acknowledges this — PLAN.md notes values are
"never GC'd".

Impact: Slow DB bloat. Not a correctness bug (values are content-addressed
and deduped by hash), but disk usage grows monotonically with unreferenced
values. On a high-throughput chain, this could add GBs/year of dead data.

Tests: TestBug3_SaveValueBypassesBatch, TestBug3_RollbackLeavesOrphanedValues,
TestBug3_ValueVisibleBeforeCommit, TestBug3_OrphanAccumulation


Bug 4 — No Bounds Checking on Deserialized numKeys

Location: node.go:209ReadNode() casts uvarint to int16 without
validation

Problem: When deserializing a node from DB, numKeys is read as a uvarint
and cast to int16. No check ensures it falls within [0, B-1] (i.e.
[0, 31]). A value of 32 causes an array index panic. A value of 32768
overflows int16 to -32768 and is silently accepted, leading to undefined
iteration behavior.

Blockchain likelihood: Low but non-zero. Requires corrupted storage,
bit-flip, or malicious state sync snapshot. Standard operation always writes
valid numKeys. However, there is no defense against storage corruption — a
single corrupted byte can brick a node.

Tests: TestBug4_NumKeysOverflow, TestBug4_NegativeNumKeysOverflow,
TestBug4_ZeroNumKeys, TestBug4_MaxValidNumKeys


Bug 5 — Exporter Goroutine Leak and Permanent Version Reader Lock

Location: export.go:26go e.run() with no context cancellation

Problem: Export() spawns a goroutine that traverses the tree and sends
nodes on a channel. If the caller abandons the Exporter without calling
Close(), the goroutine blocks forever on channel send and the version reader
count is never decremented. This permanently prevents that version from being
pruned.

IAVL solves this with context.WithCancel(). B+32 has no cancellation
mechanism.

// export.go — no way to cancel the goroutine
func (t *ImmutableTree) Export(ndb *NodeDB) (*Exporter, error) {
    ndb.incrVersionReaders(t.version)
    e := &Exporter{...}
    go e.run()  // BUG: no context, no cancellation
    return e, nil
}

Blockchain likelihood: Low currently. Export/Import is only used in tests,
not production state sync. But when state sync is implemented (the interface
exists), this becomes a real concern — any timeout, error, or interruption
during export leaks a goroutine and locks the version.

Tests: TestBug5_ExporterGoroutineLeak, TestBug5_VersionReaderLeak


Bug 6 — Pruning Deletes Nodes Shared Across Versions (CRITICAL)

Location: prune.go:103-142walkAndPrune()

Problem: The pruning algorithm walks old-version and new-version trees in
parallel. For each old inner node, it finds the "corresponding" node in the new
tree and builds a set of the new node's child hashes. Any old child whose hash
is not in this set is deemed orphaned and deleted.

The flaw: after an inner node split, children from the old node may be
distributed across two or more nodes in the new version. The algorithm only
checks one corresponding node, so it treats children that moved to a sibling
(due to the split) as orphaned and deletes them — even though they are actively
referenced by the new version.

// prune.go:128-142 — only checks ONE corresponding new node
newChildHashes := make(map[Hash]bool)
for i := int16(0); i <= newNode.numKeys; i++ {
    newChildHashes[newNode.childHashes[i]] = true  // misses split siblings
}
for i := int16(0); i <= oldNode.numKeys; i++ {
    if !newChildHashes[oldNode.childHashes[i]] {
        // BUG: child may have moved to a split sibling, not orphaned
        ndb.deleteSubtree(oldNode.childHashes[i])
    }
}

Blockchain likelihood: Certain. Default pruning strategy is PruneSyncable
(KeepRecent=705,600, KeepEvery=10). Pruning runs synchronously in every
Commit(). The first prune at block ~705,601 operates on a tree that has
undergone thousands of inner node splits. The bug is guaranteed to trigger.

Impact: Chain halt with three escalating failures:

  1. Next prune panics: findCorrespondingChild traverses into a deleted
    subtree and panics on missing node.
  2. Block execution fails: Get()/Has() on keys under deleted subtrees
    panic during ABCI DeliverTx.
  3. Node bricked on restart: LoadVersion() panics when encountering
    missing nodes — the node cannot start without state sync from peers.

Confirmed by tests:

  • TestBug6_SingleVersionPruneCorruptsTree: Panic at block ~98 with per-block
    pruning, 302 cascading errors.
  • TestBug6_PruneCorruptsNewerVersions: Latest version corrupted; only
    4,686 of 18,000 keys readable after bulk prune.
  • TestBug6_PruneBricksNodeOnRestart: LoadVersion() panics on cold restart
    after pruning.

Tests: TestBug6_SingleVersionPruneCorruptsTree,
TestBug6_PruneCorruptsNewerVersions, TestBug6_PruneBricksNodeOnRestart


Usage Patterns Checked

Pattern How tm2 Uses It Bug Exposure
Commit() SaveVersion() then DeleteVersionsTo(latest - KeepRecent) Bug #6 on every prune cycle
Default pruning PruneSyncable: keep 705,600 versions, waypoint every 10th First prune at block ~705,601; massive tree maximizes Bug #6 surface
Rollback() Called on failed ABCI block; restores root pointer only Bug #3: orphaned values accumulate
GetImmutable() Store always calls SetValueResolver() Bug #2: not triggered on normal path
Export/Import Only in tests; state sync not yet in production Bug #5: dormant until state sync ships
Cold restart LoadVersion(latest) on node startup Bug #6 aftermath: panics if pruning has corrupted the tree

Bottom Line

Bug #6 is a ship-blocker. Any chain running B+32 with default pruning will
eventually panic and brick. The 705,600-block delay before first prune means it
passes all testing but fails catastrophically in production (~12 days at
1-second blocks).

Bug #3 is the only other bug guaranteed to manifest — it causes slow DB
bloat but not correctness failures.

Bugs #2, #4, #5 are dormant under current usage but represent landmines for
future features (state sync, external tooling, storage corruption recovery).

Bug #1 is not currently exploitable but is one append() away from becoming
a silent corruption bug.

@Gno2D2
Copy link
Copy Markdown
Collaborator

Gno2D2 commented Apr 7, 2026

🛠 PR Checks Summary

🔴 Must not contain the "don't merge" label

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

🔴 Must not contain the "don't merge" label

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Automated Checks
Must not contain the "don't merge" label

If

🟢 Condition met
└── 🟢 A label matches this pattern: don't merge (label: don't merge)

Then

🔴 Requirement not satisfied
└── 🔴 On no pull request

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

notJoon added a commit to notJoon/gno-core that referenced this pull request Apr 8, 2026
Add bound checks in deserialization to preven OOM and OOB panics from corrupted node data. `readBytes` now rejects length exceeding the reader's remaining bytes, and `readInnerNode`/`readLeafNode` reject `numKeys` beyond the fixed array capacity.

Ref: gnolang#5442 (Bug no.4: no bounds check on deserialized numKeys)
@notJoon notJoon mentioned this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

don't merge Please don't merge this functionality temporarily 📦 🌐 tendermint v2 Issues or PRs tm2 related

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants