Skip to content

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Oct 24, 2025

(Stacked on top of #11921)

This repurposes the code from #11826 to "inject calls": when in a signal
handler, we can update the register state to redirect execution upon
signal-handler return to a special hand-written trampoline, and this
trampoline can save all registers and enter the host, just as if a
hostcall had occurred.

As before, this is Linux-only in its current draft. I need to add macOS and Windows support, still. Putting this up to show how a few loose ends in #11921 get used.

…roperly in Pulley.

This is a followup to bytecodealliance#11895 where I had disabled a test that failed to
emit a debug event for a hostcall-generated trap on a divide-by-zero in
Pulley. This PR allows that test to pass, and brings Pulley back to
parity with native Cranelift in debug support currently.

This was a bit of a "start to pull the thread and the entire finished
mechanism materializes" PR; happy to consider ways to split it up if
needed. In short, disabling signal-based traps on a Pulley configuration
still relies on Pulley opcodes (e.g., divide) actually trapping, in a
way that looks more like a "native ISA trap"; so I had to start to build
out the actual trap-handling mechanisms. In any case, this will all be
needed for followup work soon that will handle traps on native platforms
(redirecting from signals by injecting calls), so this is not a
distraction.

This PR includes, ranked in decreasing order of "may scare other
Wasmtime maintainers" score:

- A raw `NonNull<dyn VMStore>` in the `CallThreadState`, with a long
  comment about provenance and mut-borrow exclusivity. This is needed
  right now to allow the interpreter to invoke the debug event handler,
  but will soon be needed when injecting hostcalls on signals, because a
  signal context also has no state available from the Wasm code other
  than what is in TLS. Hence, we need a way to get the store back from
  the Wasm when we do something that is "morally a hostcall" at a
  trapping instruction.

  I do believe this is sound, or at least close to it if not (please
  scrutinize carefully!); the basic idea is that the Wasm acts as an
  opaque blob in the middle, and the pointer comes out of it one way or
  another (the normal way, as the first arg to a hostcall, or the weird
  way, via TLS and the CallThreadState during a trap).
  Exclusive ownership is still clear at any given point and only one
  `&mut` ever exists in the current frame at a time. That said, I
  haven't tested with miri yet.

  This does require careful thought about the Wasm compilation, too; we
  need the moral equivalent of a `&mut self` reborrow as-if we were
  making a hostcall on each trapping instruction. It turns out that we
  already treat them as memory-fence instructions, so nothing loaded
  from the store can be moved or cached across them, and I've added a
  comment now about how this is load-bearing.

- Updates to `CallThreadState`'s "exit state", normally set by the exit
  trampoline, that we now also set when we invoke a debug event handler
  during a trap context[^1] so that `Store::debug_frames` properly sees
  the current activation. This is a little more awkward than it could be
  because we store the *trampoline* FP, not last Wasm FP, and there is
  no trampoline frame in this case, so I've added a flag and some
  conditionals. I'm happy to refactor instead to go (back) to storing
  the last Wasm FP instead, with the extra load in the exit trampoline
  to compute that.

- A whole bunch of plumbing, creating a large but mechanical diff, in
  the code translator to actually add debug tags on all traps and calls
  to `raise`. It turns out that once I got all of the above working in
  Pulley, the test disagreed about current Wasm PC between native and
  Pulley, and Pulley was right; native was getting it wrong because the
  `raise` libcall was sunk to the bottom in a cold block and, without
  tags, we scanned backward to pick up the last Wasm PC in the function.
  This new plumbing and addition of tags in all the appropriate places
  fixes that.

[^1]: I keep saying "during a trap context" here, but to avoid any
      signal-safety scares, note that when this is done for native
      signals in a followup PR, we will inject a hostcall by modifying
      stack/register state and returning from the actual signal context,
      so it really is as-if we did a hostcall from a trapping
      instruction.
…nal-based traps.

This repurposes the code from bytecodealliance#11826 to "inject calls": when in a signal
handler, we can update the register state to redirect execution upon
signal-handler return to a special hand-written trampoline, and this
trampoline can save all registers and enter the host, just as if a
hostcall had occurred.
@cfallin
Copy link
Member Author

cfallin commented Oct 24, 2025

I'll note for brainstorming purposes that the current problem in front of me is how to rework macOS' Mach ports-based signal handler to work with this. To recap a bit what the requirements on each platform are, and how this "call injection" works:

  • All three of our main platforms (Linux, macOS, Windows) give us the ability to catch traps and edit register state before resuming.
  • Linux lets us do this in a "signal context" where we really shouldn't do much of anything if we can help it -- no allocation, etc. We run on a sigaltstack and we can push more to the guest stack if we want.
  • Windows lets us do this in a vectored exception handler, where we run on the guest's stack and cannot push anything to the stack.
  • macOS lets us do this from a separate thread reading exceptions from a Mach port, where we can do anything a normal thread can do, except we don't have the guest's TLS because we're a separate thread.

The basic need is to inject enough state into the register context, along with redirecting PC, that a stub can take control, find the Store state, invoke any debug event handler, then restore all context and return to the guest if it's a resumable trap (which this PR doesn't have, but we will have in a few more PRs for breakpoints).

One can see how this is a little tricky. The approach I've taken that is at least Windows and Linux-compatible is to update only registers, not the stack (because Windows); inject args into the registers; save off the original register values and PC to the VMStoreContext (which we have via TLS in the signal handler); then in the trampoline, save all regs to the stack, and copy the original values of the injected registers back from the store to the stack save-frame.

macOS inverts most of the "can do" and "can't do" bits: we can push to the stack (unlike Windows) but we can't read TLS, so we have nowhere to save state that we clobber when redirecting other than to push it to the stack. So probably the best we can do is to push the original register values to the guest stack ourselves from the exception handler thread.

Of course this means that we need a slightly different stub for macOS (for x86-64 and aarch64 both); and we'll need a slightly different stub for Windows/x86-64 too because of fastcall when we call the host code.

One more thing about the riscv64 stub: it saves all of the V-extension state, because vector registers are separate from float registers, but unlike our other three architectures, we don't unconditionally assume that vector registers are present. So technically to run with V disabled with debugging enabled, if we care about that, we need an alternate riscv64 stub too that elides that bit. Note that we need to care because we have to save everything, not just the ABI callee-saves, because we're "interrupting" with no regalloc cooperation.

All of this to say: I am starting to think that the efficiency advantage of "trap-based implicit hostcalls", with all that entails (breakpoints that are just break instructions we can patch in), may not be worth the complexity and maintenance burden. The alternative is to go with hostcall-based-traps universally. (We still do need the wonky raw *mut dyn VMStore for the Pulley case, because Pulley does seem to unconditionally rely on interpreter traps on at least the divide instruction.)

Partly that would make me sad, but on the other hand, it would make me quite happy too: it would mean that we are one PR away from breakpoints if we go with the bitmask scheme, or two if we still patch in a call (self-modifying code but not trapping).

I'm happy to go either way, and these stubs were quite fun to write, but with my "not impossible to maintain" hat on, I think I know the better answer...

(cc @alexcrichton and @fitzgen for thoughts)

@cfallin
Copy link
Member Author

cfallin commented Oct 24, 2025

Quick napkin math on efficiency if we abandon call injection on signals:

  • Execution efficiency takes about a 1.5-2x hit mainly on explicit bounds-checking. (This on top of the ~2.5x for debug instrumentation.) That's tolerable if not great.
  • Instead of a two-byte (ud2 on x86-64) or four-byte (brk on aarch64) breakpoint, we can do patchable calls in five bytes (call + riprel32 on x86-64) or four bytes (bl + PCrel26 on aarch64). The key here is to define a new callconv that is all callee-save, and use our normal trampoline machinery to emit a trampoline for this. The call at the CLIF level would take vmctx. We'd use a new opcode patchable_call and the only difference from a normal call opcode would be that emission would place the byte range of the instruction and the instruction bytes themselves in metadata, and an equivalent length nop in code. This should have fairly small perf impact (fetch bandwidth but nothing more for nops; and vmctx will already likely be in the first-arg reg so no additional moves).
  • For single-stepping, rather than the trap-on-null-load trick to enable all breakpoints that I described in yesterday's Cranelift weekly, I think I would go with the "enable all in func on entry to func" approach; and func entry/exit hostcalls themselves can be guarded by a flag to minimize that overhead.

The upshot of all that is that it's much more portable and easier to reason about, and the latter at least is in short supply otherwise with everything else we're adding for debugging. One could see this as "hostcalls everywhere" as in debug RFC v1, except with SMC to avoid overhead until patched in.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator pulley Issues related to the Pulley interpreter wasmtime:api Related to the API of the `wasmtime` crate itself labels Oct 24, 2025
@github-actions
Copy link

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "cranelift", "pulley", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: pulley

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@cfallin
Copy link
Member Author

cfallin commented Nov 1, 2025

Closing as this is pushed to "post-MVP debugging" due to all the above complexities; will keep the branch around for mining for the good bits later as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift Issues related to the Cranelift code generator pulley Issues related to the Pulley interpreter wasmtime:api Related to the API of the `wasmtime` crate itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant