-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Debug: create private code memories per store when debugging is enabled. #12051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug: create private code memories per store when debugging is enabled. #12051
Conversation
1db1437 to
36e1bc6
Compare
alexcrichton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments here and there, but a high-level thought I had here is that I was originally hoping that it would be possible to use Arc::get_mut to dynamically prove that it's safe to mutate the memory. In reading over this though we have quite a lot of clones in quite a lot of places so I'm realizing that it may not be quite as applicable as previously thought.
Another high-level thought though is that there's a fair bit of juggling of trying to make sure we're using the right code memory in the right context and it feels relatively cumbersome to maintain that everywhere.
Would it be possible to maybe do the deep clone of-a-sort on the Module itself to avoid having to juggle one-vs-the-other? I'd ideally like to keep the self-contained nature of Module as-is where callers don't have to worry about passing in the right object or making sure they're dynamically selecting the right object. One semi-rough idea is that we could change the representation of Module to be an enum which is either "pointer to the stuff" or "private code plus pointer to the stuff"
|
Thanks for your comments @alexcrichton! A high-level discussion point:
I think this is the In a little more detail: I did explore the "clone all the way up" approach before taking the current direction, and the main issue I came across was that (to say it bluntly) "it's a lie": in the presence of introspection APIs, e.g. the ability to get a module from an My thought was that the What do you think? |
And, actually, as a refinement on that: we could do some type-system shenanigans so that a "module with default code" is not allowed to return raw code pointers, only other (meta)data from the image, while a "module with code" obtained from an instance is. |
But it also sounds really nice... Spitballing without having dug into the PR yet: is it possible that some lsp actions could help make this mass refactor easier? |
|
Yeah, I guess "large refactor that affects a lot of callsites" was my reason for not doing it so far, but I'm happy to do it to get the right conceptual model. To be clear, that's |
fitzgen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To briefly repeat some comments from DM: I am in favor of the refactor discussed in this PR's main thread where
- we create a type separation between the shared code at the engine level and the (potentially private copy) code used at the store level
- all accessors for function pointers and such move to the store-level code type (or a
Moduleand store-level code pair type) - the act of creating the store-level code is the thing one place we are making a private copy or reusing the shared copy
bikeshedding names for these two new types:
EngineCodeandStoreCodeCodeModuleandCodeInstanceCodeFactoryandCodeInstancelol
|
Just an update on progress here: I'm midway through the refactor, and it's proving quite a lot more invasive than I had hoped. On the positive side, I've worked out a way to have truly unique ownership of a I'll keep plugging away at this, but just wanted to surface the cost of the above requested refactor :-) I will say that with the types as now defined, I feel much more reasonable about having not missed any place that would execute the wrong copy of code. |
This will allow patching code to implement e.g. breakpoints. (That is, for now the copies are redundant, but soon they will not be.) This change follows the discussion [here] and offline to define a few types that better encapsulate the distinction we want to enforce. Basically, there is almost never a bare `CodeMemory`; they are always wrapped in an `EngineCode` or `StoreCode`, the latter being a per-store instance of the former. Accessors are moved to the relevant place so that, for example, one cannot get a pointer to a Wasm function's body without being in the context of a `Store` where the containing module has been registered. The registry then returns a `ModuleWithCode` that boxes up a `Module` reference and `StoreCode` together for cases where we need both the metadata from the module and the raw code to derive something. The only case where we return raw code pointers to the `EngineCode` directly have to do with Wasm-to-array trampolines: in some cases, e.g. `InstancePre` pre-creating data structures with references to host functions, it breaks our expected performance characteristics to make the function pointers store-specific. This is fine as long as the Wasm-to-array trampolines never bake in direct calls to Wasm functions; the strong invariant is that Wasm functions never execute from `EngineCode` directly. Some parts of the component runtime would also have to be substantially refactored if we wanted to do away with this exception. The per-`Store` module registry is substantially refactored in this PR. I got rid of the modules-without-code distinction (the case where a module only has trampolines and no defined functions still works fine), and organized the BTreeMaps to key on start address rather than end address, which I find a little more intuitive (one then queries with the dual to the range -- 0-up-to-PC and last entry found). [here]: bytecodealliance#12051 (review)
36e1bc6 to
6af982c
Compare
This will allow patching code to implement e.g. breakpoints. (That is, for now the copies are redundant, but soon they will not be.) This change follows the discussion [here] and offline to define a few types that better encapsulate the distinction we want to enforce. Basically, there is almost never a bare `CodeMemory`; they are always wrapped in an `EngineCode` or `StoreCode`, the latter being a per-store instance of the former. Accessors are moved to the relevant place so that, for example, one cannot get a pointer to a Wasm function's body without being in the context of a `Store` where the containing module has been registered. The registry then returns a `ModuleWithCode` that boxes up a `Module` reference and `StoreCode` together for cases where we need both the metadata from the module and the raw code to derive something. The only case where we return raw code pointers to the `EngineCode` directly have to do with Wasm-to-array trampolines: in some cases, e.g. `InstancePre` pre-creating data structures with references to host functions, it breaks our expected performance characteristics to make the function pointers store-specific. This is fine as long as the Wasm-to-array trampolines never bake in direct calls to Wasm functions; the strong invariant is that Wasm functions never execute from `EngineCode` directly. Some parts of the component runtime would also have to be substantially refactored if we wanted to do away with this exception. The per-`Store` module registry is substantially refactored in this PR. I got rid of the modules-without-code distinction (the case where a module only has trampolines and no defined functions still works fine), and organized the BTreeMaps to key on start address rather than end address, which I find a little more intuitive (one then queries with the dual to the range -- 0-up-to-PC and last entry found). [here]: bytecodealliance#12051 (review)
6af982c to
9339956
Compare
This will allow patching code to implement e.g. breakpoints. (That is, for now the copies are redundant, but soon they will not be.) This change follows the discussion [here] and offline to define a few types that better encapsulate the distinction we want to enforce. Basically, there is almost never a bare `CodeMemory`; they are always wrapped in an `EngineCode` or `StoreCode`, the latter being a per-store instance of the former. Accessors are moved to the relevant place so that, for example, one cannot get a pointer to a Wasm function's body without being in the context of a `Store` where the containing module has been registered. The registry then returns a `ModuleWithCode` that boxes up a `Module` reference and `StoreCode` together for cases where we need both the metadata from the module and the raw code to derive something. The only case where we return raw code pointers to the `EngineCode` directly have to do with Wasm-to-array trampolines: in some cases, e.g. `InstancePre` pre-creating data structures with references to host functions, it breaks our expected performance characteristics to make the function pointers store-specific. This is fine as long as the Wasm-to-array trampolines never bake in direct calls to Wasm functions; the strong invariant is that Wasm functions never execute from `EngineCode` directly. Some parts of the component runtime would also have to be substantially refactored if we wanted to do away with this exception. The per-`Store` module registry is substantially refactored in this PR. I got rid of the modules-without-code distinction (the case where a module only has trampolines and no defined functions still works fine), and organized the BTreeMaps to key on start address rather than end address, which I find a little more intuitive (one then queries with the dual to the range -- 0-up-to-PC and last entry found). [here]: bytecodealliance#12051 (review)
9339956 to
cfdfea1
Compare
|
OK, I've now refactored in the direction we discussed above -- I feel a lot better about the types ensuring we're not leaking the wrong kind of function pointers (i.e. not executing the private copy) now! I've updated the initial comment based on the new comment message above -- please see for some of the caveats and reasoning. Let me know what you think! |
alexcrichton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this all looks great, thanks for pushing on the refactor!
this is very deeply invasive, and I worry that (for example) indirect function calls are going to be penalized
I didn't run across this myself, so I wanted to clarify -- is this still present or did this end up getting resolved during refactoring?
| #[derive(Default)] | ||
| pub struct ModuleRegistry { | ||
| // Keyed by the end address of a `CodeObject`. | ||
| // | ||
| // The value here is the start address and the information about what's | ||
| // loaded at that address. | ||
| loaded_code: BTreeMap<usize, (usize, LoadedCode)>, | ||
|
|
||
| // Preserved for keeping data segments alive or similar | ||
| modules_without_code: Vec<Module>, | ||
| /// StoreCode and Modules associated with it. | ||
| /// | ||
| /// Keyed by the start address of the `StoreCode`. We maintain the | ||
| /// invariant of no overlaps on insertion. We use a range query to | ||
| /// find the StoreCode for a given PC: take the range `0..=pc`, | ||
| /// then take the last element of the range. That picks the | ||
| /// highest start address <= the query, and we can check whether | ||
| /// it contains the address. | ||
| loaded_code: BTreeMap<StoreCodePC, LoadedCode>, | ||
|
|
||
| /// Map from EngineCodePC start to StoreCodePC start. We use this | ||
| /// to memoize the store-code creation process: each EngineCode is | ||
| /// instantiated to a StoreCode only once per store. | ||
| store_code: BTreeMap<EngineCodePC, StoreCodePC>, | ||
|
|
||
| /// Modules instantiated in this registry. | ||
| /// | ||
| /// Every module is placed in this map, but not every module will | ||
| /// be in a LoadedCode entry, because the module may have no text. | ||
| modules: BTreeMap<RegisteredModuleId, Module>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is on the hot path of instantiation can you rerun the intantiation benchmarks we have in-repo (or some interesting subset of them) to see the impact on the extra maps here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure -- looks like about a 3% hit on a small module, sequential instantiation:
% cargo bench --bench instantiation -- sequential/pooling/wasi.wasm
[ baseline run ]
sequential/pooling/wasi.wasm
time: [1.4748 µs 1.4770 µs 1.4795 µs]
[ run with this PR's changes ]
sequential/pooling/wasi.wasm
time: [1.5173 µs 1.5184 µs 1.5196 µs]
change: [+2.9095% +3.0865% +3.2436%] (p = 0.00 < 0.05)
Performance has regressed.
Since this is constant-overhead-per-module, the operative number is the extra ~40 ns; that seems OK to me?
…n it cannot be known
…ebugging-support builds.
I was worried about the additional overhead of the |
458629f to
ca496be
Compare
…abled: this always happens so we cannot error out.
ca496be to
b22a38e
Compare
|
OK, I think this is ready for another look -- thanks @alexcrichton! |
2 processes x 5 iterations per process = 10 total samples per group. You can't generally do significance tests with less than 30 samples per group (and you probably want more than that for the If you're going to benchmark, then you should do it properly :-p If the amount of time taken to run the benchmarks was a concern, then you could pass Also, fwiw, we have some call-indirect specific micro benchmarks here which might be better for measuring this specific thing. |
|
OK, cool, I'll rerun this in a bit. Indeed I only had a small sliver of time last night and didn't want to wait hours for results. Good to know that one can re-use compilations, thanks. My intent with using the real-program benchmarks rather than microbenchmarks was to measure the effect of call-indirects in proper context, but I can do the microbenchmarks too. |
|
Here's the data with 100 data points: So a slight slowdown in SpiderMonkey (1%), for example, which makes heavier use of virtual calls. I might have some ideas for refactoring the |
|
A little more experimentation: I made a small tweak so that we don't do the StoreCode lookup from EngineCode when debugging is not enabled, but rather directly use the EngineCode's image. (Bikeshed the safety type wrappers maybe but the point is the perf experiment) It turns out that this doesn't help: (aside: Sightglass has a bug where with So perhaps the ~1% overhead is in the additional plumbing in general on the table funcref init path; that's unavoidable unless we monomorphize the whole funcref-related runtime (libcalls on down) on debug and no-debug versions I think. Maybe we eat the 1%. What do you think? |
If I do a full run of
This is only the case for lazy table init, right? If it isn't lazy then we'd still be hitting these branches, but at instantiation time instead, and we would be doing them all at once, so presumably the branch predictor would be more useful in that case, no? If this is only for lazy table init, then yeah seems fine by me. Although maybe Alex has more ideas here. I originally was thinking that the module registry could possibly contain a hash map instead of a btree map for the new member, but it sounds like the costs are not related to the data structure. Slightly more concerning if this is a 1% slowdown regardless of lazy table init... |
It seems so, yeah -- flipping the default to eager table init in tunables, I get (100 data points, 10 proc * 10/proc): |
alexcrichton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code all looks good to me, thanks for the updates!
Questions on perf though, so it sounds like:
- Instantiation-wise this is measured as a 3% regression.
- call_indirect-wise it's measured as a "no meaningful change"
Is that right?
I would view this as adding relatively low-hanging fruit to the "perf tree" to pick in the future, and to me it's mostly a question of whether we pick that fruit here or not. The extra data structures can likely be optimized with fewer lookups/insertions or something like that when debugging enabled is my thinking.
We're also not really measuring heap impact (IIRC B-Trees are relatively bad for that?) which might be another consideration here. Basically I would be surprised if these exact data structures lived as-is for a long period of time past perf testing and such. I'd be more comfortable if the addition of debugging-related features didn't impact the data structures when debugging wasn't turned on, but it's a pretty minor impact here.
Basically I'm ok with this personally, but I think if we land this as-is it's worth opening an issue describing the possible optimization opportunity. More-or-less it feels like we shouldn't need 4 btree insertions when a single module is instantiated in a store in the fast path. That feels like it's pretty overkill relative to "lightweight instantiation". I realize we probably already have 3 insertions or something like that today, but this is basically a ripe area for optimization in the future if someone's interested.
|
Yep, that's right, and I'm happy to file an issue for followup. The refactor I did above to remove the |
|
Followup filed at #12111. |
This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.
This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.
This is a PR that puts together a bunch of earlier pieces (patchable calls in bytecodealliance#12061 and bytecodealliance#12101, private copies of code in bytecodealliance#12051, and all the prior debug event and instrumentation infrastructure) to implement breakpoints in the guest debugger. These are implemented in the way we have planned in bytecodealliance#11964: each sequence point (location prior to a Wasm opcode) is now a patchable call instruction, patched out (replaced with NOPs) by default. When patched in, the breakpoint callsite calls a trampoline with the `patchable` ABI which then invokes the `breakpoint` hostcall. That hostcall emits the debug event and nothing else. A few of the interesting bits in this PR include: - Implementations of "unpublish" (switch permissions back to read/write from read/execute) for mmap'd code memory on all our platforms. - Infrastructure in the frame-tables (debug info) metadata producer and parser to record "breakpoint patches". - A tweak to the NOP metadata packaged with the `MachBuffer` to allow multiple NOP sizes. This lets us use one 5-byte NOP on x86-64, for example (did you know x86-64 had these?!) rather than five 1-byte NOPs. This PR also implements single-stepping with a global-per-`Store` flag, because at this point why not; it's a small additional bit of logic to do *all* patches in all modules registered in the `Store` when that flag is enabled. A few realizations for future work: - The need for an introspection API available to a debugger to see the modules within a component is starting to become clear; either that, or the "module and PC" location identifier for a breakpoint switches to a "module or component" sum type. Right now, the tests for this feature use only core modules. Extending to components should not actually be hard at all, we just need to build the API for it. - The interaction between inlining and `patchable_call` is interesting: what happens if we inline a `patchable_call` at a `try_call` callsite? Right now, we do *not* update the `patchable_call` to a `try_call`, because there is no `patchable_try_call`; this is fine in the Wasmtime embedding in practice because we never (today!) throw exceptions from a breakpoint handler. This does suggest to me that maybe we should make patchability a property of any callsite, and allow try-calls to be patchable too (with the same restriction about no return values as the only restriction); but happy to discuss that one further.
This will allow patching code to implement e.g. breakpoints. (That is, for now the copies are redundant, but soon they will not be.)
This change follows the discussion here and offline to define a few types that better encapsulate the distinction we want to enforce. Basically, there is almost never a bare
CodeMemory; they are always wrapped in anEngineCodeorStoreCode, the latter being a per-store instance of the former. Accessors are moved to the relevant place so that, for example, one cannot get a pointer to a Wasm function's body without being in the context of aStorewhere the containing module has been registered. The registry then returns aModuleWithCodethat boxes up aModulereference andStoreCodetogether for cases where we need both the metadata from the module and the raw code to derive something.The only case where we return raw code pointers to the
EngineCodedirectly have to do with Wasm-to-array trampolines: in some cases, e.g.InstancePrepre-creating data structures with references to host functions, it breaks our expected performance characteristics to make the function pointers store-specific. This is fine as long as the Wasm-to-array trampolines never bake in direct calls to Wasm functions; the strong invariant is that Wasm functions never execute fromEngineCodedirectly. Some parts of the component runtime would also have to be substantially refactored if we wanted to do away with this exception.The per-
Storemodule registry is substantially refactored in this PR. I got rid of the modules-without-code distinction (the case where a module only has trampolines and no defined functions still works fine), and organized the BTreeMaps to key on start address rather than end address, which I find a little more intuitive (one then queries with the dual to the range -- 0-up-to-PC and last entry found).