Debug: implement call injection to invoke debug event handlers at signal-based traps. #11930

cfallin · 2025-10-24T00:49:50Z

(Stacked on top of #11921)

This repurposes the code from #11826 to "inject calls": when in a signal
handler, we can update the register state to redirect execution upon
signal-handler return to a special hand-written trampoline, and this
trampoline can save all registers and enter the host, just as if a
hostcall had occurred.

As before, this is Linux-only in its current draft. I need to add macOS and Windows support, still. Putting this up to show how a few loose ends in #11921 get used.

…roperly in Pulley. This is a followup to bytecodealliance#11895 where I had disabled a test that failed to emit a debug event for a hostcall-generated trap on a divide-by-zero in Pulley. This PR allows that test to pass, and brings Pulley back to parity with native Cranelift in debug support currently. This was a bit of a "start to pull the thread and the entire finished mechanism materializes" PR; happy to consider ways to split it up if needed. In short, disabling signal-based traps on a Pulley configuration still relies on Pulley opcodes (e.g., divide) actually trapping, in a way that looks more like a "native ISA trap"; so I had to start to build out the actual trap-handling mechanisms. In any case, this will all be needed for followup work soon that will handle traps on native platforms (redirecting from signals by injecting calls), so this is not a distraction. This PR includes, ranked in decreasing order of "may scare other Wasmtime maintainers" score: - A raw `NonNull<dyn VMStore>` in the `CallThreadState`, with a long comment about provenance and mut-borrow exclusivity. This is needed right now to allow the interpreter to invoke the debug event handler, but will soon be needed when injecting hostcalls on signals, because a signal context also has no state available from the Wasm code other than what is in TLS. Hence, we need a way to get the store back from the Wasm when we do something that is "morally a hostcall" at a trapping instruction. I do believe this is sound, or at least close to it if not (please scrutinize carefully!); the basic idea is that the Wasm acts as an opaque blob in the middle, and the pointer comes out of it one way or another (the normal way, as the first arg to a hostcall, or the weird way, via TLS and the CallThreadState during a trap). Exclusive ownership is still clear at any given point and only one `&mut` ever exists in the current frame at a time. That said, I haven't tested with miri yet. This does require careful thought about the Wasm compilation, too; we need the moral equivalent of a `&mut self` reborrow as-if we were making a hostcall on each trapping instruction. It turns out that we already treat them as memory-fence instructions, so nothing loaded from the store can be moved or cached across them, and I've added a comment now about how this is load-bearing. - Updates to `CallThreadState`'s "exit state", normally set by the exit trampoline, that we now also set when we invoke a debug event handler during a trap context[^1] so that `Store::debug_frames` properly sees the current activation. This is a little more awkward than it could be because we store the *trampoline* FP, not last Wasm FP, and there is no trampoline frame in this case, so I've added a flag and some conditionals. I'm happy to refactor instead to go (back) to storing the last Wasm FP instead, with the extra load in the exit trampoline to compute that. - A whole bunch of plumbing, creating a large but mechanical diff, in the code translator to actually add debug tags on all traps and calls to `raise`. It turns out that once I got all of the above working in Pulley, the test disagreed about current Wasm PC between native and Pulley, and Pulley was right; native was getting it wrong because the `raise` libcall was sunk to the bottom in a cold block and, without tags, we scanned backward to pick up the last Wasm PC in the function. This new plumbing and addition of tags in all the appropriate places fixes that. [^1]: I keep saying "during a trap context" here, but to avoid any signal-safety scares, note that when this is done for native signals in a followup PR, we will inject a hostcall by modifying stack/register state and returning from the actual signal context, so it really is as-if we did a hostcall from a trapping instruction.

… trampoline FP.

… Wasm exit FP.

…nal-based traps. This repurposes the code from bytecodealliance#11826 to "inject calls": when in a signal handler, we can update the register state to redirect execution upon signal-handler return to a special hand-written trampoline, and this trampoline can save all registers and enter the host, just as if a hostcall had occurred.

cfallin · 2025-10-24T01:03:55Z

I'll note for brainstorming purposes that the current problem in front of me is how to rework macOS' Mach ports-based signal handler to work with this. To recap a bit what the requirements on each platform are, and how this "call injection" works:

All three of our main platforms (Linux, macOS, Windows) give us the ability to catch traps and edit register state before resuming.
Linux lets us do this in a "signal context" where we really shouldn't do much of anything if we can help it -- no allocation, etc. We run on a sigaltstack and we can push more to the guest stack if we want.
Windows lets us do this in a vectored exception handler, where we run on the guest's stack and cannot push anything to the stack.
macOS lets us do this from a separate thread reading exceptions from a Mach port, where we can do anything a normal thread can do, except we don't have the guest's TLS because we're a separate thread.

The basic need is to inject enough state into the register context, along with redirecting PC, that a stub can take control, find the Store state, invoke any debug event handler, then restore all context and return to the guest if it's a resumable trap (which this PR doesn't have, but we will have in a few more PRs for breakpoints).

One can see how this is a little tricky. The approach I've taken that is at least Windows and Linux-compatible is to update only registers, not the stack (because Windows); inject args into the registers; save off the original register values and PC to the VMStoreContext (which we have via TLS in the signal handler); then in the trampoline, save all regs to the stack, and copy the original values of the injected registers back from the store to the stack save-frame.

macOS inverts most of the "can do" and "can't do" bits: we can push to the stack (unlike Windows) but we can't read TLS, so we have nowhere to save state that we clobber when redirecting other than to push it to the stack. So probably the best we can do is to push the original register values to the guest stack ourselves from the exception handler thread.

Of course this means that we need a slightly different stub for macOS (for x86-64 and aarch64 both); and we'll need a slightly different stub for Windows/x86-64 too because of fastcall when we call the host code.

One more thing about the riscv64 stub: it saves all of the V-extension state, because vector registers are separate from float registers, but unlike our other three architectures, we don't unconditionally assume that vector registers are present. So technically to run with V disabled with debugging enabled, if we care about that, we need an alternate riscv64 stub too that elides that bit. Note that we need to care because we have to save everything, not just the ABI callee-saves, because we're "interrupting" with no regalloc cooperation.

All of this to say: I am starting to think that the efficiency advantage of "trap-based implicit hostcalls", with all that entails (breakpoints that are just break instructions we can patch in), may not be worth the complexity and maintenance burden. The alternative is to go with hostcall-based-traps universally. (We still do need the wonky raw *mut dyn VMStore for the Pulley case, because Pulley does seem to unconditionally rely on interpreter traps on at least the divide instruction.)

Partly that would make me sad, but on the other hand, it would make me quite happy too: it would mean that we are one PR away from breakpoints if we go with the bitmask scheme, or two if we still patch in a call (self-modifying code but not trapping).

I'm happy to go either way, and these stubs were quite fun to write, but with my "not impossible to maintain" hat on, I think I know the better answer...

(cc @alexcrichton and @fitzgen for thoughts)

cfallin · 2025-10-24T01:22:21Z

Quick napkin math on efficiency if we abandon call injection on signals:

Execution efficiency takes about a 1.5-2x hit mainly on explicit bounds-checking. (This on top of the ~2.5x for debug instrumentation.) That's tolerable if not great.
Instead of a two-byte (ud2 on x86-64) or four-byte (brk on aarch64) breakpoint, we can do patchable calls in five bytes (call + riprel32 on x86-64) or four bytes (bl + PCrel26 on aarch64). The key here is to define a new callconv that is all callee-save, and use our normal trampoline machinery to emit a trampoline for this. The call at the CLIF level would take vmctx. We'd use a new opcode patchable_call and the only difference from a normal call opcode would be that emission would place the byte range of the instruction and the instruction bytes themselves in metadata, and an equivalent length nop in code. This should have fairly small perf impact (fetch bandwidth but nothing more for nops; and vmctx will already likely be in the first-arg reg so no additional moves).
For single-stepping, rather than the trap-on-null-load trick to enable all breakpoints that I described in yesterday's Cranelift weekly, I think I would go with the "enable all in func on entry to func" approach; and func entry/exit hostcalls themselves can be guarded by a flag to minimize that overhead.

The upshot of all that is that it's much more portable and easier to reason about, and the latter at least is in short supply otherwise with everything else we're adding for debugging. One could see this as "hostcalls everywhere" as in debug RFC v1, except with SMC to avoid overhead until patched in.

github-actions · 2025-10-24T03:48:27Z

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "cranelift", "pulley", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

fitzgen: pulley

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

cfallin · 2025-11-01T18:21:05Z

Closing as this is pushed to "post-MVP debugging" due to all the above complexities; will keep the branch around for mining for the good bits later as needed.

cfallin added 4 commits October 22, 2025 20:35

Refactor Wasm activation exit state to have one FP only: Wasm FP, not…

7fd1f60

… trampoline FP.

Update disas tests after trampoline change from trampoline exit FP to…

f2f3e2b

… Wasm exit FP.

github-actions bot added cranelift Issues related to the Cranelift code generator pulley Issues related to the Pulley interpreter wasmtime:api Related to the API of the `wasmtime` crate itself labels Oct 24, 2025

cfallin mentioned this pull request Oct 24, 2025

Debug: add some infrastructure for catching traps, and handle traps properly in Pulley. #11921

Closed

cfallin closed this Nov 1, 2025

cfallin mentioned this pull request Nov 1, 2025

Debug: plan for simple libcall/instrumentation-based MVP #11964

Open

12 tasks

cfallin mentioned this pull request Nov 19, 2025

Use virtual memory tricks to make interrupt checks smaller and faster #1749

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debug: implement call injection to invoke debug event handlers at signal-based traps. #11930

Debug: implement call injection to invoke debug event handlers at signal-based traps. #11930

Uh oh!

cfallin commented Oct 24, 2025

Uh oh!

cfallin commented Oct 24, 2025

Uh oh!

cfallin commented Oct 24, 2025

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

cfallin commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Debug: implement call injection to invoke debug event handlers at signal-based traps. #11930

Debug: implement call injection to invoke debug event handlers at signal-based traps. #11930

Uh oh!

Conversation

cfallin commented Oct 24, 2025

Uh oh!

cfallin commented Oct 24, 2025

Uh oh!

cfallin commented Oct 24, 2025

Uh oh!

github-actions bot commented Oct 24, 2025

Subscribe to Label Action

Uh oh!

cfallin commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant