Skip to content

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Nov 20, 2025

This will allow patching code to implement e.g. breakpoints. (That is, for now the copies are redundant, but soon they will not be.)

This change follows the discussion here and offline to define a few types that better encapsulate the distinction we want to enforce. Basically, there is almost never a bare CodeMemory; they are always wrapped in an EngineCode or StoreCode, the latter being a per-store instance of the former. Accessors are moved to the relevant place so that, for example, one cannot get a pointer to a Wasm function's body without being in the context of a Store where the containing module has been registered. The registry then returns a ModuleWithCode that boxes up a Module reference and StoreCode together for cases where we need both the metadata from the module and the raw code to derive something.

The only case where we return raw code pointers to the EngineCode directly have to do with Wasm-to-array trampolines: in some cases, e.g. InstancePre pre-creating data structures with references to host functions, it breaks our expected performance characteristics to make the function pointers store-specific. This is fine as long as the Wasm-to-array trampolines never bake in direct calls to Wasm functions; the strong invariant is that Wasm functions never execute from EngineCode directly. Some parts of the component runtime would also have to be substantially refactored if we wanted to do away with this exception.

The per-Store module registry is substantially refactored in this PR. I got rid of the modules-without-code distinction (the case where a module only has trampolines and no defined functions still works fine), and organized the BTreeMaps to key on start address rather than end address, which I find a little more intuitive (one then queries with the dual to the range -- 0-up-to-PC and last entry found).

@cfallin cfallin requested a review from a team as a code owner November 20, 2025 00:36
@cfallin cfallin requested review from fitzgen and removed request for a team November 20, 2025 00:36
@cfallin cfallin force-pushed the you-get-a-code-segment-you-get-a-code-segment-everyone-gets-a-code-segment branch 3 times, most recently from 1db1437 to 36e1bc6 Compare November 20, 2025 01:14
@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Nov 20, 2025
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments here and there, but a high-level thought I had here is that I was originally hoping that it would be possible to use Arc::get_mut to dynamically prove that it's safe to mutate the memory. In reading over this though we have quite a lot of clones in quite a lot of places so I'm realizing that it may not be quite as applicable as previously thought.

Another high-level thought though is that there's a fair bit of juggling of trying to make sure we're using the right code memory in the right context and it feels relatively cumbersome to maintain that everywhere.

Would it be possible to maybe do the deep clone of-a-sort on the Module itself to avoid having to juggle one-vs-the-other? I'd ideally like to keep the self-contained nature of Module as-is where callers don't have to worry about passing in the right object or making sure they're dynamically selecting the right object. One semi-rough idea is that we could change the representation of Module to be an enum which is either "pointer to the stuff" or "private code plus pointer to the stuff"

@cfallin
Copy link
Member Author

cfallin commented Nov 21, 2025

Thanks for your comments @alexcrichton! A high-level discussion point:

Another high-level thought though is that there's a fair bit of juggling of trying to make sure we're using the right code memory in the right context and it feels relatively cumbersome to maintain that everywhere.

Would it be possible to maybe do the deep clone of-a-sort on the Module itself to avoid having to juggle one-vs-the-other? I'd ideally like to keep the self-contained nature of Module as-is where callers don't have to worry about passing in the right object or making sure they're dynamically selecting the right object. One semi-rough idea is that we could change the representation of Module to be an enum which is either "pointer to the stuff" or "private code plus pointer to the stuff"

I think this is the ModuleRuntimeInfo with my changes -- it goes from a simple Module to a Module plus Arc<CodeMemory>.

In a little more detail: I did explore the "clone all the way up" approach before taking the current direction, and the main issue I came across was that (to say it bluntly) "it's a lie": in the presence of introspection APIs, e.g. the ability to get a module from an Instance handle (and for one to get all instances via debugging introspection APIs), we want the same module to be the same. It also adds a lot more cloning cost obviously, which can be mitigated with a bunch of internal Arcs, but that's not great either. (It puts most of the cost in the debugging case rather than base case, but does mean there are more indirections in the base case.)

My thought was that the ModuleRuntimeInfo would be the place where a module is joined together with a particular copy of its code. The most proper ("most truthful") refactor from there would be to remove all code-related accessors from Module and put them on ModuleRuntimeInfo instead, but that's also a large refactor that affects a bunch of callsites. Actually now that I think about it, though -- I had been worried about the accessors that fetch data from the image rather than return pointers to code (Wasm data, exception table, ...) and what they would do, but we could create an ad-hoc ModuleRuntimeInfo at that point. Or describe the abstraction as a new ModuleWithCode, replace ModuleRuntimeInfo::Module's payload with that, and hang all the accessors that need code off ModuleWithCode. Then we can get a "module with default code" from the Module itself if we know we aren't going to return pointers to it.

What do you think?

@cfallin
Copy link
Member Author

cfallin commented Nov 21, 2025

Then we can get a "module with default code" from the Module itself if we know we aren't going to return pointers to it.

And, actually, as a refinement on that: we could do some type-system shenanigans so that a "module with default code" is not allowed to return raw code pointers, only other (meta)data from the image, while a "module with code" obtained from an instance is.

@fitzgen
Copy link
Member

fitzgen commented Nov 21, 2025

but that's also a large refactor that affects a bunch of callsites.

But it also sounds really nice...

Spitballing without having dug into the PR yet: is it possible that some lsp actions could help make this mass refactor easier?

@cfallin
Copy link
Member Author

cfallin commented Nov 21, 2025

Yeah, I guess "large refactor that affects a lot of callsites" was my reason for not doing it so far, but I'm happy to do it to get the right conceptual model.

To be clear, that's ModuleWithCode and moving image-related accessors onto that (and getting at them from the ModuleRuntimeInfo), not clone-Module-all-the-way-up. @alexcrichton does this sound good to you? I want to make sure you're onboard before I spend a day doing this :-)

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To briefly repeat some comments from DM: I am in favor of the refactor discussed in this PR's main thread where

  • we create a type separation between the shared code at the engine level and the (potentially private copy) code used at the store level
  • all accessors for function pointers and such move to the store-level code type (or a Module and store-level code pair type)
  • the act of creating the store-level code is the thing one place we are making a private copy or reusing the shared copy

bikeshedding names for these two new types:

  • EngineCode and StoreCode
  • CodeModule and CodeInstance
  • CodeFactory and CodeInstance lol

@cfallin
Copy link
Member Author

cfallin commented Nov 26, 2025

Just an update on progress here: I'm midway through the refactor, and it's proving quite a lot more invasive than I had hoped. On the positive side, I've worked out a way to have truly unique ownership of a StoreCode, so mutation can be proved to be safe given a mut borrow of the Store. But in terms of impact, this is very deeply invasive, and I worry that (for example) indirect function calls are going to be penalized: lazily setting up the funcref now requires looking up the StoreCode for the module in question, so rather than a few fast array accesses in the CompiledModule, we have a BTree lookup. The impact on the guest profiler is pretty deep (it handles all its metadata regarding PCs at the module level) and InstancePre (though the latter may be OK only if we allow executing Wasm-to-array trampolines in the engine code rather than the store code, allowing pre-resolution of host funcrefs; this breaks a desirable invariant that we never execute engine code directly though).

I'll keep plugging away at this, but just wanted to surface the cost of the above requested refactor :-) I will say that with the types as now defined, I feel much more reasonable about having not missed any place that would execute the wrong copy of code.

cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 29, 2025
This will allow patching code to implement e.g. breakpoints. (That is,
for now the copies are redundant, but soon they will not be.)

This change follows the discussion [here] and offline to define a few
types that better encapsulate the distinction we want to enforce.
Basically, there is almost never a bare `CodeMemory`; they are always
wrapped in an `EngineCode` or `StoreCode`, the latter being a per-store
instance of the former. Accessors are moved to the relevant place so
that, for example, one cannot get a pointer to a Wasm function's body
without being in the context of a `Store` where the containing module
has been registered. The registry then returns a `ModuleWithCode` that
boxes up a `Module` reference and `StoreCode` together for cases where
we need both the metadata from the module and the raw code to derive
something.

The only case where we return raw code pointers to the `EngineCode`
directly have to do with Wasm-to-array trampolines: in some cases, e.g.
`InstancePre` pre-creating data structures with references to host
functions, it breaks our expected performance characteristics to make
the function pointers store-specific. This is fine as long as the
Wasm-to-array trampolines never bake in direct calls to Wasm functions;
the strong invariant is that Wasm functions never execute from
`EngineCode` directly. Some parts of the component runtime would also
have to be substantially refactored if we wanted to do away with this
exception.

The per-`Store` module registry is substantially refactored in this PR.
I got rid of the modules-without-code distinction (the case where a
module only has trampolines and no defined functions still works fine),
and organized the BTreeMaps to key on start address rather than end
address, which I find a little more intuitive (one then queries with the
dual to the range -- 0-up-to-PC and last entry found).

[here]: bytecodealliance#12051 (review)
@cfallin cfallin force-pushed the you-get-a-code-segment-you-get-a-code-segment-everyone-gets-a-code-segment branch from 36e1bc6 to 6af982c Compare November 29, 2025 23:53
@cfallin cfallin changed the title Debugging: implement private copies of CodeMemory. Debug: create private code memories per store when debugging is enabled. Nov 29, 2025
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 30, 2025
This will allow patching code to implement e.g. breakpoints. (That is,
for now the copies are redundant, but soon they will not be.)

This change follows the discussion [here] and offline to define a few
types that better encapsulate the distinction we want to enforce.
Basically, there is almost never a bare `CodeMemory`; they are always
wrapped in an `EngineCode` or `StoreCode`, the latter being a per-store
instance of the former. Accessors are moved to the relevant place so
that, for example, one cannot get a pointer to a Wasm function's body
without being in the context of a `Store` where the containing module
has been registered. The registry then returns a `ModuleWithCode` that
boxes up a `Module` reference and `StoreCode` together for cases where
we need both the metadata from the module and the raw code to derive
something.

The only case where we return raw code pointers to the `EngineCode`
directly have to do with Wasm-to-array trampolines: in some cases, e.g.
`InstancePre` pre-creating data structures with references to host
functions, it breaks our expected performance characteristics to make
the function pointers store-specific. This is fine as long as the
Wasm-to-array trampolines never bake in direct calls to Wasm functions;
the strong invariant is that Wasm functions never execute from
`EngineCode` directly. Some parts of the component runtime would also
have to be substantially refactored if we wanted to do away with this
exception.

The per-`Store` module registry is substantially refactored in this PR.
I got rid of the modules-without-code distinction (the case where a
module only has trampolines and no defined functions still works fine),
and organized the BTreeMaps to key on start address rather than end
address, which I find a little more intuitive (one then queries with the
dual to the range -- 0-up-to-PC and last entry found).

[here]: bytecodealliance#12051 (review)
@cfallin cfallin force-pushed the you-get-a-code-segment-you-get-a-code-segment-everyone-gets-a-code-segment branch from 6af982c to 9339956 Compare November 30, 2025 00:01
This will allow patching code to implement e.g. breakpoints. (That is,
for now the copies are redundant, but soon they will not be.)

This change follows the discussion [here] and offline to define a few
types that better encapsulate the distinction we want to enforce.
Basically, there is almost never a bare `CodeMemory`; they are always
wrapped in an `EngineCode` or `StoreCode`, the latter being a per-store
instance of the former. Accessors are moved to the relevant place so
that, for example, one cannot get a pointer to a Wasm function's body
without being in the context of a `Store` where the containing module
has been registered. The registry then returns a `ModuleWithCode` that
boxes up a `Module` reference and `StoreCode` together for cases where
we need both the metadata from the module and the raw code to derive
something.

The only case where we return raw code pointers to the `EngineCode`
directly have to do with Wasm-to-array trampolines: in some cases, e.g.
`InstancePre` pre-creating data structures with references to host
functions, it breaks our expected performance characteristics to make
the function pointers store-specific. This is fine as long as the
Wasm-to-array trampolines never bake in direct calls to Wasm functions;
the strong invariant is that Wasm functions never execute from
`EngineCode` directly. Some parts of the component runtime would also
have to be substantially refactored if we wanted to do away with this
exception.

The per-`Store` module registry is substantially refactored in this PR.
I got rid of the modules-without-code distinction (the case where a
module only has trampolines and no defined functions still works fine),
and organized the BTreeMaps to key on start address rather than end
address, which I find a little more intuitive (one then queries with the
dual to the range -- 0-up-to-PC and last entry found).

[here]: bytecodealliance#12051 (review)
@cfallin cfallin force-pushed the you-get-a-code-segment-you-get-a-code-segment-everyone-gets-a-code-segment branch from 9339956 to cfdfea1 Compare November 30, 2025 00:05
@cfallin
Copy link
Member Author

cfallin commented Nov 30, 2025

OK, I've now refactored in the direction we discussed above -- I feel a lot better about the types ensuring we're not leaking the wrong kind of function pointers (i.e. not executing the private copy) now!

I've updated the initial comment based on the new comment message above -- please see for some of the caveats and reasoning. Let me know what you think!

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this all looks great, thanks for pushing on the refactor!

this is very deeply invasive, and I worry that (for example) indirect function calls are going to be penalized

I didn't run across this myself, so I wanted to clarify -- is this still present or did this end up getting resolved during refactoring?

Comment on lines 54 to 76
#[derive(Default)]
pub struct ModuleRegistry {
// Keyed by the end address of a `CodeObject`.
//
// The value here is the start address and the information about what's
// loaded at that address.
loaded_code: BTreeMap<usize, (usize, LoadedCode)>,

// Preserved for keeping data segments alive or similar
modules_without_code: Vec<Module>,
/// StoreCode and Modules associated with it.
///
/// Keyed by the start address of the `StoreCode`. We maintain the
/// invariant of no overlaps on insertion. We use a range query to
/// find the StoreCode for a given PC: take the range `0..=pc`,
/// then take the last element of the range. That picks the
/// highest start address <= the query, and we can check whether
/// it contains the address.
loaded_code: BTreeMap<StoreCodePC, LoadedCode>,

/// Map from EngineCodePC start to StoreCodePC start. We use this
/// to memoize the store-code creation process: each EngineCode is
/// instantiated to a StoreCode only once per store.
store_code: BTreeMap<EngineCodePC, StoreCodePC>,

/// Modules instantiated in this registry.
///
/// Every module is placed in this map, but not every module will
/// be in a LoadedCode entry, because the module may have no text.
modules: BTreeMap<RegisteredModuleId, Module>,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is on the hot path of instantiation can you rerun the intantiation benchmarks we have in-repo (or some interesting subset of them) to see the impact on the extra maps here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure -- looks like about a 3% hit on a small module, sequential instantiation:

% cargo bench --bench instantiation -- sequential/pooling/wasi.wasm

[ baseline run ]

sequential/pooling/wasi.wasm
                        time:   [1.4748 µs 1.4770 µs 1.4795 µs]

[ run with this PR's changes ]

sequential/pooling/wasi.wasm
                        time:   [1.5173 µs 1.5184 µs 1.5196 µs]
                        change: [+2.9095% +3.0865% +3.2436%] (p = 0.00 < 0.05)
                        Performance has regressed.

Since this is constant-overhead-per-module, the operative number is the extra ~40 ns; that seems OK to me?

@cfallin
Copy link
Member Author

cfallin commented Dec 2, 2025

[indirect function calls]

I didn't run across this myself, so I wanted to clarify -- is this still present or did this end up getting resolved during refactoring?

I was worried about the additional overhead of the EngineCode -> StoreCode lookup in Instance::get_func_ref but that's one BTreeMap lookup, in a BTreeMap likely to have only one or a few entries, and only once per table slot invoked so perhaps not too bad? Actually, let's benchmark -- Sightglass says:

$ RUST_LOG=info taskset 1 target/release/sightglass-cli benchmark -e ../wasmtime/old.so -e ../wasmtime/new.so --iterations-per-process 5 --processes 2 benchmarks/default.suite
[ ... ]
execution :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [71002288 72007774.20 72896266] new.so
  [70196726 71271162.70 73367632] old.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [449806273 459920466.40 477878694] new.so
  [455205740 461003924.50 470907233] old.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4736966 4816051.60 5011779] new.so
  [4713058 4819706.60 5088233] old.so

@cfallin cfallin force-pushed the you-get-a-code-segment-you-get-a-code-segment-everyone-gets-a-code-segment branch from 458629f to ca496be Compare December 2, 2025 03:38
…abled: this always happens so we cannot error out.
@cfallin cfallin force-pushed the you-get-a-code-segment-you-get-a-code-segment-everyone-gets-a-code-segment branch from ca496be to b22a38e Compare December 2, 2025 03:38
@cfallin
Copy link
Member Author

cfallin commented Dec 2, 2025

OK, I think this is ready for another look -- thanks @alexcrichton!

@github-actions github-actions bot added the wasmtime:c-api Issues pertaining to the C API. label Dec 2, 2025
@cfallin cfallin added the wasmtime:debugging Issues related to debugging of JIT'ed code label Dec 2, 2025
@fitzgen
Copy link
Member

fitzgen commented Dec 2, 2025

Actually, let's benchmark -- Sightglass says:

2 processes x 5 iterations per process = 10 total samples per group. You can't generally do significance tests with less than 30 samples per group (and you probably want more than that for the cycles measure, since it is so noisy).

If you're going to benchmark, then you should do it properly :-p

If the amount of time taken to run the benchmarks was a concern, then you could pass --benchmark-phase execution to avoid re-compiling on each iteration.

Also, fwiw, we have some call-indirect specific micro benchmarks here which might be better for measuring this specific thing.

@cfallin
Copy link
Member Author

cfallin commented Dec 2, 2025

OK, cool, I'll rerun this in a bit. Indeed I only had a small sliver of time last night and didn't want to wait hours for results. Good to know that one can re-use compilations, thanks.

My intent with using the real-program benchmarks rather than microbenchmarks was to measure the effect of call-indirects in proper context, but I can do the microbenchmarks too.

@cfallin
Copy link
Member Author

cfallin commented Dec 2, 2025

Here's the data with 100 data points:

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 50870.29 ± 39152.71 (confidence = 99%)

  old.so is 1.00x to 1.02x faster than new.so!

  [4490103 4689894.33 5160645] new.so
  [4502186 4639024.04 4995740] old.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3902443.50 ± 1435000.11 (confidence = 99%)

  old.so is 1.01x to 1.01x faster than new.so!

  [452564035 458920590.84 468422822] new.so
  [447394704 455018147.34 464078661] old.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [69507823 70879740.15 73834440] new.so
  [69568496 71123992.19 74071155] old.so

So a slight slowdown in SpiderMonkey (1%), for example, which makes heavier use of virtual calls.

I might have some ideas for refactoring the ModuleRegistry::store_code lookup to return a thin wrapper over EngineCode when we aren't configured to do private cloning, which would eliminate the extra lookup -- let me see if that helps...

@cfallin
Copy link
Member Author

cfallin commented Dec 2, 2025

A little more experimentation: I made a small tweak so that we don't do the StoreCode lookup from EngineCode when debugging is not enabled, but rather directly use the EngineCode's image. (Bikeshed the safety type wrappers maybe but the point is the perf experiment)

It turns out that this doesn't help: old.so is baseline, new.so is this PR, new2.so is the above commit (two separate pairwise runs; 10 processes and 50 data points per process):

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3293570.64 ± 801397.15 (confidence = 99%)

  old.so is 1.01x to 1.01x faster than new2.so!

  [444994186 457634102.82 472147181] new2.so
  [444580053 454340532.18 464774917] old.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1711875.58 ± 601308.43 (confidence = 99%)

  new.so is 1.00x to 1.01x faster than new2.so!

  [443090662 452795699.65 464339929] new.so
  [445620051 454507575.23 465311385] new2.so

(aside: Sightglass has a bug where with new.so and new2.so, it strips the common prefix and reports results for .so and 2.so; manually edited the above. No energy to go fix that at the moment)

So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think?

@fitzgen
Copy link
Member

fitzgen commented Dec 2, 2025

didn't want to wait hours for results

If I do a full run of default.suite it takes 27 mins on my old thinkpad, probably faster for you on a newer MBP, fwiw. Also faster if doing --benchmark-phase execution or if measuring instructions retired and you can drop to a small handful of iterations. But yeah 27 mins is not great.

So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think?

This is only the case for lazy table init, right? If it isn't lazy then we'd still be hitting these branches, but at instantiation time instead, and we would be doing them all at once, so presumably the branch predictor would be more useful in that case, no?

If this is only for lazy table init, then yeah seems fine by me. Although maybe Alex has more ideas here. I originally was thinking that the module registry could possibly contain a hash map instead of a btree map for the new member, but it sounds like the costs are not related to the data structure.

Slightly more concerning if this is a 1% slowdown regardless of lazy table init...

@cfallin
Copy link
Member Author

cfallin commented Dec 2, 2025

This is only the case for lazy table init, right?

It seems so, yeah -- flipping the default to eager table init in tunables, I get (100 data points, 10 proc * 10/proc):

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [453510422 463605372.90 509169192] new-eager-init.so
  [453961879 467730247.66 531244575] old-eager-init.so

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code all looks good to me, thanks for the updates!

Questions on perf though, so it sounds like:

  • Instantiation-wise this is measured as a 3% regression.
  • call_indirect-wise it's measured as a "no meaningful change"

Is that right?

I would view this as adding relatively low-hanging fruit to the "perf tree" to pick in the future, and to me it's mostly a question of whether we pick that fruit here or not. The extra data structures can likely be optimized with fewer lookups/insertions or something like that when debugging enabled is my thinking.

We're also not really measuring heap impact (IIRC B-Trees are relatively bad for that?) which might be another consideration here. Basically I would be surprised if these exact data structures lived as-is for a long period of time past perf testing and such. I'd be more comfortable if the addition of debugging-related features didn't impact the data structures when debugging wasn't turned on, but it's a pretty minor impact here.

Basically I'm ok with this personally, but I think if we land this as-is it's worth opening an issue describing the possible optimization opportunity. More-or-less it feels like we shouldn't need 4 btree insertions when a single module is instantiated in a store in the fast path. That feels like it's pretty overkill relative to "lightweight instantiation". I realize we probably already have 3 insertions or something like that today, but this is basically a ripe area for optimization in the future if someone's interested.

@cfallin
Copy link
Member Author

cfallin commented Dec 3, 2025

Yep, that's right, and I'm happy to file an issue for followup. The refactor I did above to remove the StoreCode lookup for the get_func_ref path could probably be carried further to actually not materialize a true StoreCode at instantiation (instead building a view of text/range in ModuleWithCode when queried) in the non-debugging case; that should let us get back to status quo. It may also be worth optimizing the one-module case (?) -- BTree lookups in a tree of one entry are fast but there's still the allocation and initialization cost.

@cfallin
Copy link
Member Author

cfallin commented Dec 3, 2025

Followup filed at #12111.

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 3, 2025
@cfallin cfallin added this pull request to the merge queue Dec 3, 2025
Merged via the queue into bytecodealliance:main with commit 99ecf72 Dec 3, 2025
49 checks passed
@cfallin cfallin deleted the you-get-a-code-segment-you-get-a-code-segment-everyone-gets-a-code-segment branch December 3, 2025 01:41
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 6, 2025
This is a PR that puts together a bunch of earlier pieces (patchable
calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all
the prior debug event and instrumentation infrastructure) to implement
breakpoints in the guest debugger.

These are implemented in the way we have planned in bytecodealliance#11964: each
sequence point (location prior to a Wasm opcode) is now a patchable call
instruction, patched out (replaced with NOPs) by default. When patched
in, the breakpoint callsite calls a trampoline with the `patchable` ABI
which then invokes the `breakpoint` hostcall. That hostcall emits the
debug event and nothing else.

A few of the interesting bits in this PR include:
- Implementations of "unpublish" (switch permissions back to read/write
  from read/execute) for mmap'd code memory on all our platforms.
- Infrastructure in the frame-tables (debug info) metadata producer and
  parser to record "breakpoint patches".
- A tweak to the NOP metadata packaged with the `MachBuffer` to allow
  multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for
  example (did you know x86-64 had these?!) rather than five 1-byte
  NOPs.

This PR also implements single-stepping with a global-per-`Store` flag,
because at this point why not; it's a small additional bit of logic to
do *all* patches in all modules registered in the `Store` when that flag
is enabled.

A few realizations for future work:
- The need for an introspection API available to a debugger to see the
  modules within a component is starting to become clear; either that,
  or the "module and PC" location identifier for a breakpoint switches
  to a "module or component" sum type. Right now, the tests for this
  feature use only core modules. Extending to components should not
  actually be hard at all, we just need to build the API for it.
- The interaction between inlining and `patchable_call` is interesting:
  what happens if we inline a `patchable_call` at a `try_call` callsite?
  Right now, we do *not* update the `patchable_call` to a `try_call`,
  because there is no `patchable_try_call`; this is fine in the Wasmtime
  embedding in practice because we never (today!) throw exceptions from
  a breakpoint handler. This does suggest to me that maybe we should
  make patchability a property of any callsite, and allow try-calls to
  be patchable too (with the same restriction about no return values as
  the only restriction); but happy to discuss that one further.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 6, 2025
This is a PR that puts together a bunch of earlier pieces (patchable
calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all
the prior debug event and instrumentation infrastructure) to implement
breakpoints in the guest debugger.

These are implemented in the way we have planned in bytecodealliance#11964: each
sequence point (location prior to a Wasm opcode) is now a patchable call
instruction, patched out (replaced with NOPs) by default. When patched
in, the breakpoint callsite calls a trampoline with the `patchable` ABI
which then invokes the `breakpoint` hostcall. That hostcall emits the
debug event and nothing else.

A few of the interesting bits in this PR include:
- Implementations of "unpublish" (switch permissions back to read/write
  from read/execute) for mmap'd code memory on all our platforms.
- Infrastructure in the frame-tables (debug info) metadata producer and
  parser to record "breakpoint patches".
- A tweak to the NOP metadata packaged with the `MachBuffer` to allow
  multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for
  example (did you know x86-64 had these?!) rather than five 1-byte
  NOPs.

This PR also implements single-stepping with a global-per-`Store` flag,
because at this point why not; it's a small additional bit of logic to
do *all* patches in all modules registered in the `Store` when that flag
is enabled.

A few realizations for future work:
- The need for an introspection API available to a debugger to see the
  modules within a component is starting to become clear; either that,
  or the "module and PC" location identifier for a breakpoint switches
  to a "module or component" sum type. Right now, the tests for this
  feature use only core modules. Extending to components should not
  actually be hard at all, we just need to build the API for it.
- The interaction between inlining and `patchable_call` is interesting:
  what happens if we inline a `patchable_call` at a `try_call` callsite?
  Right now, we do *not* update the `patchable_call` to a `try_call`,
  because there is no `patchable_try_call`; this is fine in the Wasmtime
  embedding in practice because we never (today!) throw exceptions from
  a breakpoint handler. This does suggest to me that maybe we should
  make patchability a property of any callsite, and allow try-calls to
  be patchable too (with the same restriction about no return values as
  the only restriction); but happy to discuss that one further.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 7, 2025
This is a PR that puts together a bunch of earlier pieces (patchable
calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all
the prior debug event and instrumentation infrastructure) to implement
breakpoints in the guest debugger.

These are implemented in the way we have planned in bytecodealliance#11964: each
sequence point (location prior to a Wasm opcode) is now a patchable call
instruction, patched out (replaced with NOPs) by default. When patched
in, the breakpoint callsite calls a trampoline with the `patchable` ABI
which then invokes the `breakpoint` hostcall. That hostcall emits the
debug event and nothing else.

A few of the interesting bits in this PR include:
- Implementations of "unpublish" (switch permissions back to read/write
  from read/execute) for mmap'd code memory on all our platforms.
- Infrastructure in the frame-tables (debug info) metadata producer and
  parser to record "breakpoint patches".
- A tweak to the NOP metadata packaged with the `MachBuffer` to allow
  multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for
  example (did you know x86-64 had these?!) rather than five 1-byte
  NOPs.

This PR also implements single-stepping with a global-per-`Store` flag,
because at this point why not; it's a small additional bit of logic to
do *all* patches in all modules registered in the `Store` when that flag
is enabled.

A few realizations for future work:
- The need for an introspection API available to a debugger to see the
  modules within a component is starting to become clear; either that,
  or the "module and PC" location identifier for a breakpoint switches
  to a "module or component" sum type. Right now, the tests for this
  feature use only core modules. Extending to components should not
  actually be hard at all, we just need to build the API for it.
- The interaction between inlining and `patchable_call` is interesting:
  what happens if we inline a `patchable_call` at a `try_call` callsite?
  Right now, we do *not* update the `patchable_call` to a `try_call`,
  because there is no `patchable_try_call`; this is fine in the Wasmtime
  embedding in practice because we never (today!) throw exceptions from
  a breakpoint handler. This does suggest to me that maybe we should
  make patchability a property of any callsite, and allow try-calls to
  be patchable too (with the same restriction about no return values as
  the only restriction); but happy to discuss that one further.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API. wasmtime:debugging Issues related to debugging of JIT'ed code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants