Skip to content

SMT2 parser with explicit work stack#9040

Draft
tautschnig wants to merge 7 commits into
diffblue:developfrom
tautschnig:smt2-rework
Draft

SMT2 parser with explicit work stack#9040
tautschnig wants to merge 7 commits into
diffblue:developfrom
tautschnig:smt2-rework

Conversation

@tautschnig

@tautschnig tautschnig commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Problem

Deeply nested SMT-LIB expressions make the recursive descent in smt2_parsert::expression() recurse once per nesting level, overflowing the default 8 MB C++ stack and crashing (SIGSEGV) at around depth 2500–3000. Two shapes trigger this and both occur in real benchmarks / solver output:

  • a chain of nested (let ((xN xN-1)) ...) bindings, and
  • a right-associated operand chain such as (bvand x (bvand x ... x)).

Approach

expression() is reworked to consume these chains with explicit work stacks instead of the C++ call stack. Building on the value-class tokent from #9003, this is considerably smaller and cleaner than the original revision of this PR (which routed everything through smt2irep and a serialize-back fallback):

  • Nested lets are collected on a frame stack (bindings plus the identifiers they hide); the innermost body is parsed once and the let_exprt chain is folded back outward, restoring id_map scoping.
  • Function-application operands are collected on an explicit stack. For a "simple" head — anything other than the binders let/lambda/exists/ forall and the token-reading heads _ and ! — each completed application
    is dispatched through the existing expressions[] / id_map logic, with the already-parsed operands handed over via a new precollected_operands member that operands() returns in lieu of reading from the tokenizer. To enable
    this, the head-symbol dispatch of function_application() is factored into function_application_with_id().

Heads that read additional structure (binders, indexed identifiers, term attributes) remain recursive; those forms do not give rise to the deep chains that overflow the stack.

Note: smt2irep already parses S-expressions iteratively on develop, so no change is needed there — this PR only addresses the expression parser that builds exprts.

Performance

Parsing keeps the same O(n) complexity; the explicit stacks trade cheap native call frames for heap-allocated frame/operand vectors, a small constant-factor cost. Measured with two smt2_solver binaries built from the same tree (differing only in smt2_parser.{cpp,h}):

  • Pathologically parse-bound input (19 MB, 200k shallow operator applications): the iterative parser is ~3% slower — 2.98 s vs 2.90 s (median of 9 runs).
  • Inputs dominated by expression construction or solving: within noise (ratio ≈ 1.00).

So inputs the recursive parser handled are essentially unaffected, in exchange for removing the stack-overflow crash.

Testing

regression/smt2_solver/deep-nesting/ adds a depth-3500 nested-let test (unsat) and a depth-3500 operand-chain test (sat); both crashed the recursive parser and now pass. The existing regression/smt2_solver suite and the [smt2] unit tests remain green.

  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • n/a The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • n/a White-space or formatting changes outside the feature-related changed lines are in commits of their own.

@tautschnig tautschnig self-assigned this Jun 16, 2026
Copilot AI review requested due to automatic review settings June 16, 2026 03:49
@tautschnig tautschnig changed the title Smt2 rework SMT2 parser with explicit work stack Jun 16, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reworks the SMT-LIB v2 expression parsing in src/solvers/smt2/smt2_parser.* to avoid C++ call-stack overflows caused by deeply nested let chains and right-associated operand chains seen in real solver output/benchmarks. The parser now linearises those patterns using explicit work stacks while preserving the existing expression-dispatch machinery.

Changes:

  • Replaced recursive descent for nested let chains with an explicit let frame stack and a fold-back reconstruction of let_exprt.
  • Added iterative parsing for deeply right-nested operand chains by collecting operands on an explicit application stack and reusing existing dispatch via precollected_operands.
  • Added regression tests under regression/smt2_solver/deep-nesting/ for deep nested-let and deep operand-chain inputs.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/solvers/smt2/smt2_parser.h Adds function_application_with_id() API and precollected_operands to support iterative operand collection.
src/solvers/smt2/smt2_parser.cpp Implements iterative parsing in expression() and factors function-application dispatch for reuse.
regression/smt2_solver/deep-nesting/deep-operand-chain.smt2 New regression input for deep right-associated operand chains.
regression/smt2_solver/deep-nesting/deep-operand-chain.desc New expected-output spec for the operand-chain regression.
regression/smt2_solver/deep-nesting/deep-let.smt2 New regression input for deep nested let chains.
regression/smt2_solver/deep-nesting/deep-let.desc New expected-output spec for the deep let regression.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/solvers/smt2/smt2_parser.cpp
Comment thread regression/smt2_solver/deep-nesting/deep-operand-chain.desc Outdated
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.75786% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.70%. Comparing base (9919381) to head (8f7a294).

Files with missing lines Patch % Lines
src/solvers/smt2/smt2_parser.cpp 87.41% 38 Missing ⚠️
src/solvers/flattening/boolbv_let.cpp 90.09% 10 Missing ⚠️
src/solvers/smt2/smt2_solver.cpp 93.54% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9040      +/-   ##
===========================================
+ Coverage    80.68%   80.70%   +0.01%     
===========================================
  Files         1714     1714              
  Lines       189508   189806     +298     
  Branches        73       73              
===========================================
+ Hits        152911   153182     +271     
- Misses       36597    36624      +27     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tautschnig tautschnig force-pushed the smt2-rework branch 2 times, most recently from fb54847 to 206e3f3 Compare June 16, 2026 07:04
Deeply nested SMT-LIB expressions -- e.g. a chain of thousands of nested
(let ((xN xN-1)) ...) or a right-associated (bvand x (bvand x ...)) --
made smt2_parsert::expression() recurse once per nesting level and
overflow the default 8 MB C++ stack (segfault around depth 2500-3000).

This reworks expression() to consume such chains with explicit work
stacks rather than the call stack, building on the value-class tokent
introduced in diffblue#9003:

- Nested lets are collected on a frame stack (bindings plus the ids they
  hide); the innermost body is parsed once and the let_exprt chain is
  folded back outward, restoring id_map scoping.
- A function application whose head is "simple" -- anything other than
  the binders let/lambda/exists/forall and the token-reading heads _ and
  ! -- has its operands collected on an explicit stack; each completed
  application is dispatched through the existing expressions[]/id_map
  logic. To reuse that logic without re-reading tokens, the already
  parsed operands are handed over via the new precollected_operands,
  which operands() returns in lieu of reading from the tokenizer.

To enable the latter, the head-symbol dispatch of function_application()
is factored into function_application_with_id(), which expression() calls
with a head symbol it has already consumed.

Heads that read additional structure (binders, indexed identifiers, term
attributes) are still handled recursively; those forms do not give rise
to the deep chains that overflow the stack.

Performance: parsing keeps the same O(n) complexity, trading cheap native
call frames for heap-allocated frame/operand vectors -- a small
constant-factor cost. On a pathologically parse-bound input (19 MB of
200k shallow operator applications) the iterative parser is ~3% slower
than the recursive one (2.98s vs 2.90s, median of 9 runs); on inputs
where expression construction or solving dominates the difference is in
the noise. Inputs the recursive parser did not crash on are therefore
essentially unaffected.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@tautschnig tautschnig marked this pull request as draft June 17, 2026 11:37
tautschnig and others added 6 commits June 17, 2026 11:41
Pin the iterative SMT2 parser against regression: a deeply nested let
chain and a deeply right-associated operand chain, both of a depth at
which the previous recursive parser overflowed the 8 MB C++ stack, now
parse and solve (unsat and sat respectively).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
expand_function_applications() is a post-order rewrite: for every node,
first rewrite the operands, then — if the current node is a
function_application with a known definition — substitute it with the
(already-rewritten) body. The recursive form overflowed the call stack
on deeply-nested inputs produced by the iterative SMT2 parser
(e.g. 50k-deep bvand chains in the try5_*.flanagansaxe.smt2 family),
which turned into stack overflows in the solver phase for files that
the parser now successfully accepts.

Replace the recursion with a two-pass iterative walk:

1. gather all nodes in post-order into an explicit work list, and
2. process them back-to-front, by which point every operand has already
   been rewritten.

The only residual recursion is on the inlined body of a user-defined
function, which is typically shallow; this matches the pre-existing
behaviour.

Regression: bf13.smt2 (the single remaining crash in the QF_ABV run
with the iterative parser) now runs into the 2 s solver timeout
instead of dumping core. The full QF_ABV sweep (15,148 files, 2 s
timeout, 2 GB RLIMIT_AS, 32-way parallel) goes from 1 crash to 0.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Move the have_to_replace(dest) pre-check from the top of
replace_symbolt::replace(exprt&) to guard only the code paths that
would break expression sharing: Forall_operands (which detaches shared
operand vectors) and to_let_expr/to_binding_expr (which detach shared
let/binding nodes).

The symbol check (dest.id() == ID_symbol) is O(1) and does not detach,
so it needs no guard. This avoids the O(n²) behavior where
have_to_replace traversed the entire subtree at every non-symbol node,
while still preserving sharing for subtrees that contain no replaceable
symbols.

Apply the same pattern to address_of_aware_replace_symbolt::replace.

Test that replace_symbolt respects variable bindings in let and forall
expressions: bound variables must not be replaced in the binding scope,
while free variables are still replaced. Also test that the return
value is correct when no replacement occurs.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Deeply nested ireps (e.g. a deep SMT-LIB let chain) overflow the call stack
when the recursive destructor unwinds them. Bound the destruction recursion:
remove_ref still deletes recursively -- which is as cheap as the natural
recursive destructor for the shallow trees that dominate -- but tracks the
recursion depth in a thread-local counter and, once it reaches a fixed bound,
collapses the remainder of the subtree with an explicit work stack instead.
Only the deep tail pays the cost of detaching children and allocating a work
stack; shallow destruction (the overwhelmingly common case) is unaffected.

This keeps destruction depth-safe on a small (~1 MiB) stack while avoiding the
~8% regression on destruction-heavy workloads (symbolic execution and
simplification churn) that an always-iterative destructor incurs.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
A deeply right- or left-associated chain of bit-vector operators
(e.g. (bvand x (bvand x ... x)) of depth thousands) makes boolbvt's
convert_bitvector recurse once per nesting level and overflow the call
stack, even though the SMT2 parser, expand_function_applications and
replace_symbolt now handle such input iteratively.

Add a cache-priming pre-pass: at the outermost convert_bv entry, convert the
sub-expressions bottom-up with an explicit work stack so that the recursive
convert_bitwise dispatch only ever meets already-cached operands and does not
recurse deeply.

boolbvt conversion is not a pure function of the expression: array-theory
axioms, uninterpreted-function congruence and let/quantifier scoping all make
it stateful and order-sensitive, so converting arbitrary sub-expressions
eagerly and out of order changes the generated constraints (and regresses the
array/UF/let tests). The pre-pass is therefore restricted to a conservative
whitelist of operations whose conversion is a pure function of the operand
bit-vectors -- exactly the per-bit gates handled by convert_bitwise
(ID_bitand/bitor/bitxor/bitnand/bitnor/bitxnor/bitnot). For any other
outermost expression the pre-pass is a no-op and the normal recursive path is
used unchanged.

This clears the deep bitwise operand-chain regression test under a 1 MiB
stack. Deeply nested let chains still overflow, because convert_let renames
the bound symbols into a freshly constructed 'where' expression at each level
and recurses on it -- those exprs cannot be primed ahead of time; making that
iterative requires rewriting convert_let to process the let-chain in a loop.

Perf: the pre-pass runs one extra (iterative) walk per outermost convert_bv,
but only descends into pure-bitwise nodes, so it is a no-op for formulas
without large bitwise chains. The whitelist can be extended to other
provably-pure operators after per-operator review.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
convert_let converted the 'where' expression by recursion (convert_bv ->
convert_bitvector -> convert_let -> ...), one stack frame per nesting level,
and renamed each frame's bound variables across the whole remaining 'where'
with a replace_symbolt -- O(depth) per frame, O(depth^2) overall. A deeply
nested chain of let-expressions therefore both overflows the call stack and
scales quadratically.

For chains of nested lets at or beyond a depth threshold, convert them
iteratively:

- Alpha-rename every bound variable in the whole let-expression to a
  globally-fresh identifier in a single iterative, scope-aware pass
  (let_chain_alpha_rename). Fresh, unique names are what make it sound to
  drop the per-frame replace_symbolt: because no bound name recurs, the
  conversions cached in bv_cache for sub-expressions of a let body cannot
  collide with other occurrences of the original name. (This is exactly the
  soundness pitfall of resolving original names via a scope stack consulted
  in convert_bv: bv_cache is keyed on structure, not scope, so cached body
  sub-expressions would leak scope-dependent results.)
- Walk the now alpha-renamed let-chain in a loop, setting up each frame's
  bindings (values converted in the outer scope, then made visible) and
  tearing them down in reverse, converting the innermost where-expression
  exactly once.

Shallow lets (the common case, including essentially all of CBMC's own
usage) keep using the unchanged recursive implementation, which additionally
renames bound variables occurring in types; the iterative pass does not
rewrite type-position occurrences, which do not arise for SMT-LIB let
bindings (the only source of deep chains).

Validated: with the threshold lowered to 0 (all lets iterative) the full
smt2_solver suite passes, including the let and let-array tests; the deep
nested-let regression test now passes under a 1 MiB stack.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants