Skip to content

Disable embedded-blob ingestion in crash test (T277310719)#14899

Open
anand1976 wants to merge 1 commit into
facebook:mainfrom
anand1976:export-D110100542
Open

Disable embedded-blob ingestion in crash test (T277310719)#14899
anand1976 wants to merge 1 commit into
facebook:mainfrom
anand1976:export-D110100542

Conversation

@anand1976

@anand1976 anand1976 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary:
Disables the experimental embedded-blob SST ingestion path in the crash test (ingest_external_file_with_embedded_blobs is forced to 0) to unblock the fbcode_blackbox_crash_test CI signal tracked in T277310719. The feature still has unresolved correctness bugs, so this keeps it out of the crash test until they are fixed. Re-enable by reverting the one-line change; the gating logic in finalize_and_sanitize is left in place.

Culprit

The embedded-blob feature was introduced by D108564468 (commit 5f01ffb1d40c, "[rocksdb][PR] Add experimental embedded blob SST support"), which also enabled ingest_external_file_with_embedded_blobs in db_crashtest.py. The first crash-test failure on T277310719 appeared the same day, shortly after that commit. D109796428 ("Fix use-after-free in EmbeddedBlobResolvingIterator when key() called before value()") fixed one of the bugs below but the corruption persists, so disabling in the crash test is still required.

Bugs found in the embedded-blob feature

These all stem from one root cause: EmbeddedBlobResolvingIterator resolves a same-file blob payload into a buffer (resolved_pinned_value_ / resolved_value_) that is invalidated on the next iterator reposition, but consumers expect the value to outlive a Next().

  1. heap-use-after-free (FIXED by D109796428): calling key() before value() on a whole-value blob entry made value() -> MaterializeValue() move-assign resolved_internal_key_, freeing the buffer the earlier key() Slice still pointed to. Surfaced under ASAN during compaction (CompactionIterator::NextFromInput -> ParseInternalKey).

  2. IsValuePinned() assertion (NOT fixed): EmbeddedBlobResolvingIterator::IsValuePinned() only reports a resolved value as pinned when a PinnedIteratorsManager is active, but DBIter::FindValueForCurrentKeyUsingSeek (db_iter.cc:1464) asserts iter_.iter()->IsValuePinned() without setting one. Reproduces under ASAN via the user-iteration path.

  3. IsValueBaseValid(value_base) assertion / value corruption (NOT fixed): this is the failure in T277310719. With use_merge=1 plus embedded blobs, a value persisted to the DB comes back with a garbage value base, tripping the assert in db_stress (expected_value.cc:102, reached from VerifyDb -> VerifyOrSyncValue and from TestGet). Reproduced reliably on the post-D109796428 tree by forcing use_merge=1 + embedded-blob ingestion (read-heavy, no iteration): fails on the first crash-test iteration; without merge, 36 iterations passed. Same value-lifetime root cause as (2), surfacing through the merge/compaction write path. Exact line not yet pinned down; needs the feature author.

Differential Revision: D110100542

@meta-cla meta-cla Bot added the CLA Signed label Jun 30, 2026
@meta-codesync

meta-codesync Bot commented Jun 30, 2026

Copy link
Copy Markdown

@anand1976 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110100542.

@github-actions

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 0.0s.

Summary:
Disables the experimental embedded-blob SST ingestion path in the crash test (`ingest_external_file_with_embedded_blobs` is forced to 0) to unblock the `fbcode_blackbox_crash_test` CI signal tracked in T277310719. The feature still has unresolved correctness bugs, so this keeps it out of the crash test until they are fixed. Re-enable by reverting the one-line change; the gating logic in `finalize_and_sanitize` is left in place.

## Culprit

The embedded-blob feature was introduced by D108564468 (commit `5f01ffb1d40c`, "[rocksdb][PR] Add experimental embedded blob SST support"), which also enabled `ingest_external_file_with_embedded_blobs` in `db_crashtest.py`. The first crash-test failure on T277310719 appeared the same day, shortly after that commit. D109796428 ("Fix use-after-free in EmbeddedBlobResolvingIterator when key() called before value()") fixed one of the bugs below but the corruption persists, so disabling in the crash test is still required.

## Bugs found in the embedded-blob feature

These all stem from one root cause: `EmbeddedBlobResolvingIterator` resolves a same-file blob payload into a buffer (`resolved_pinned_value_` / `resolved_value_`) that is invalidated on the next iterator reposition, but consumers expect the value to outlive a `Next()`.

1. heap-use-after-free (FIXED by D109796428): calling `key()` before `value()` on a whole-value blob entry made `value()` -> `MaterializeValue()` move-assign `resolved_internal_key_`, freeing the buffer the earlier `key()` Slice still pointed to. Surfaced under ASAN during compaction (`CompactionIterator::NextFromInput` -> `ParseInternalKey`).

2. `IsValuePinned()` assertion (NOT fixed): `EmbeddedBlobResolvingIterator::IsValuePinned()` only reports a resolved value as pinned when a `PinnedIteratorsManager` is active, but `DBIter::FindValueForCurrentKeyUsingSeek` (`db_iter.cc:1464`) asserts `iter_.iter()->IsValuePinned()` without setting one. Reproduces under ASAN via the user-iteration path.

3. `IsValueBaseValid(value_base)` assertion / value corruption (NOT fixed): this is the failure in T277310719. With `use_merge=1` plus embedded blobs, a value persisted to the DB comes back with a garbage value base, tripping the assert in `db_stress` (`expected_value.cc:102`, reached from `VerifyDb` -> `VerifyOrSyncValue` and from `TestGet`). Reproduced reliably on the post-D109796428 tree by forcing `use_merge=1` + embedded-blob ingestion (read-heavy, no iteration): fails on the first crash-test iteration; without merge, 36 iterations passed. Same value-lifetime root cause as (2), surfacing through the merge/compaction write path. Exact line not yet pinned down; needs the feature author.

Differential Revision: D110100542
@anand1976 anand1976 force-pushed the export-D110100542 branch from 94d2252 to 7e8736b Compare June 30, 2026 18:24
@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 7e8736b


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase

@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 7e8736b


Summary

Clean, minimal change that disables an experimental feature (ingest_external_file_with_embedded_blobs) in the crash test to unblock CI. The diff is a single config value change (random.choice([0, 1]) -> 0) plus a well-written explanatory comment linking to the tracking task. No correctness, performance, or compatibility concerns.

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

None.

🟢 LOW / NIT

L1. Redundant finalize_and_sanitize guard — db_crashtest.py:1513
  • Issue: The existing guard at line 1513 already forces ingest_external_file_with_embedded_blobs = 0 when ingest_external_file_one_in == 0 or format_version < 7. With the default now hardcoded to 0, this guard is redundant but harmless. No action needed — when the feature is re-enabled, both the default and the guard should be revisited together.
L2. Comment style is good
  • Issue: Not an issue — the comment clearly explains why the feature is disabled, references the tracking task (T277310719), and notes how to re-enable. This is the right pattern for temporary disablement (cf. the similar # Temporarily disable hash index comment at line 217).

Cross-Component Analysis

The change is entirely within db_crashtest.py (test infrastructure). It does not touch any production code, headers, build files, or serialization formats. The finalize_and_sanitize function at line 1513 already has a complementary guard that forces the flag to 0 under certain conditions, so there is no path where the crash test could accidentally re-enable the feature.

Context Affected? Notes
Production code No No production files changed
Public API No No API changes
Serialization format No No format changes
Other test suites No Only db_crashtest.py default_params affected
db_stress binary No The gflag still exists; it just won't be set to 1 by the crash test

Positive Observations

  • The PR description is exceptionally thorough — it documents three specific bugs, identifies the root cause pattern (value-lifetime in EmbeddedBlobResolvingIterator), references the introducing commit, and explains exactly how to re-enable.
  • The change is minimal and surgical: one value change + comment, no unnecessary refactoring.
  • The lambda: 0 pattern (rather than removing the key entirely) preserves the parameter in the crash test infrastructure, making re-enablement trivial.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant