-
Notifications
You must be signed in to change notification settings - Fork 679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AOT] Memory Info Restore Mechanism with Better Performance #4113
base: main
Are you sure you want to change the base?
[AOT] Memory Info Restore Mechanism with Better Performance #4113
Conversation
Thank you for the keen observation and the intriguing analysis of the root cause. IIUC, the rationale behind storing // in aot_check_memory_overflow()
/* Get memory base address and memory data size */
if (func_ctx->mem_space_unchanged
#if WASM_ENABLE_SHARED_MEMORY != 0
|| is_shared_memory
#endif
) {
mem_base_addr = func_ctx->mem_info[0].mem_base_addr; // This branch should be used in the majority of cases.
}
else {
if (!(mem_base_addr = LLVMBuildLoad2(
comp_ctx->builder, OPQ_PTR_TYPE,
func_ctx->mem_info[0].mem_base_addr, "mem_base"))) {
aot_set_last_error("llvm build load failed.");
goto fail;
}
} Using the address of the memory base address does not allow optimization passes to recognize the pattern and decide to eliminate superfluous load instructions. However, I concur that the conditions for setting // in create_memory_info
bool mem_space_unchanged = true; // (!func->has_op_memory_grow && !func->has_op_func_call) || (!module->possible_memory_grow); Therefore, I believe the concept of "reloading the base address when the memory might change" is excellent. If you're in agreement with my perspective, we can begin refactoring the PR by concentrating on "reloading the base address when the memory might change" and eliminating "keeping the address of the memory base address." |
@lum1n0us Completely agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're in agreement with my perspective, we can begin refactoring the PR by concentrating on "reloading the base address when the memory might change" and eliminating "keeping the address of the memory base address."
I agree. Let me know what I can do to help with the refactoring. |
I believe the approach should be:
|
You're correct; having readable names for AOT functions in the IR and generated code would indeed make debugging more user-friendly. However, the reality is that .wasm files often lack a named section. Typically, in the interest of minimizing binary size, debug information is stripped, which means we need at least two naming systems to handle scenarios with and without a name section. On the other hand, generated function names are a common assumption across AOT and JIT running modes and their supporting tools. Therefore, unless there's a comprehensive solution available, we might prefer to stick with additional scripts for the time being. |
if (!(mem_base_addr = LLVMBuildLoad2( | ||
comp_ctx->builder, OPQ_PTR_TYPE, | ||
func_ctx->mem_info[0].mem_base_addr, "mem_base"))) { | ||
func_ctx->mem_info[0].mem_base_addr, "mem_base_addr"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the main purpose of this PR is to minimize or eliminate load
instructions in memory operations. However, the changes eliminated all the fast accesses(if
branch) but retained the slow ones(else
branch). Does this actually address the original problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two else
branches are different. In this PR, mem_info[0].mem_base_addr
comes from LLVMBuildAlloca
, which allocates memory on the stack (often in registers due to mem2reg
optimization or in cache), whereas the previous version loads from global memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's become more confusing.
There is a critical need to pass the base address of linear memory in various functions. Therefore, it's necessary to have some sort of global variable to hold this address.
In the original code, the process is as follows: the base address is always retrieved from the global variable, stored in a temporary variable mem_base_addr
, and then this temporary variable is used to compute the final address.
%mem_base_addr_offset = getelementptr inbounds i8, ptr %aot_inst, i32 376
%mem_base_addr = load ptr, ptr %mem_base_addr_offset, align 8
;; when using
%maddr = getelementptr inbounds i8, ptr %mem_base_addr, i64 %offset1
In the PR, a new local variable, mem_base_addr
, is introduced to hold the base address after obtaining it from the temporary variable mem_base_addr1
. Although loading from a local variable isn't a significant issue, it does raise a small question as to why this is necessary. After all, the base address is already present in the temporary variable(if using the original design).
%mem_base_addr = alloca ptr, align 8
%mem_base_addr_offset = getelementptr inbounds i8, ptr %aot_inst, i32 376
%mem_base_addr1 = load ptr, ptr %mem_base_addr_offset, align 8
store ptr %mem_base_addr1, ptr %mem_base_addr, align 8
%mem_base_addr9 = load ptr, ptr %mem_base_addr, align 8
%maddr = getelementptr inbounds i8, ptr %mem_base_addr9, i64 %offset1
And after changes to linear memory, such as memory.grow
, it is still necessary to reload values from the global variable, then from the temporary variable, and finally save them to the local variable. This doesn't appear to be more optimized, in my view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your thought is correct, but you have overlooked the optimization of LLVMBuildAlloca
variables by mem2reg
.
In this PR, mem_info[0].mem_base_addr comes from LLVMBuildAlloca, which allocates memory on the stack (often in registers due to mem2reg optimization or in cache).
For example:
define i32 @foo(i32 %x) {
entry:
%y = alloca i32
store i32 %x, i32* %y
%val = load i32, i32* %y
ret i32 %val
}
The load/store
of %y
be optimized:
define i32 @foo(i32 %x) {
entry:
ret i32 %x
}
Similarly,
%mem_base_addr = alloca ptr, align 8
%mem_base_addr_offset = getelementptr inbounds i8, ptr %aot_inst, i32 376
%mem_base_addr1 = load ptr, ptr %mem_base_addr_offset, align 8
store ptr %mem_base_addr1, ptr %mem_base_addr, align 8
%mem_base_addr9 = load ptr, ptr %mem_base_addr, align 8
%maddr = getelementptr inbounds i8, ptr %mem_base_addr9, i64 %offset1
can be optimized to:
%mem_base_addr_offset = getelementptr inbounds i8, ptr %aot_inst, i32 376
%mem_base_addr1 = load ptr, ptr %mem_base_addr_offset, align 8
%maddr = getelementptr inbounds i8, ptr %mem_base_addr1, i64 %offset1
In this PR, if you dump the IR of substr.wasm above, you will see that the load/store operations of %mem_base_addr
are also optimized. As a result, the final outcome remains the same as in the original code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've devised a straightforward test case to assess the enhancement. Here is the code:
(module
(memory 5 10)
(func $store32 (export "store32") (param i32 i32)
(i32.store (local.get 0) (local.get 1))
)
(func $load32 (export "load32") (param i32) (result i32)
(i32.load (local.get 0))
)
(func (export "load_store") (param i32 i32) (result i32)
(local i32)
(i32.load (local.get 0))
(local.tee 2)
(i32.store (local.get 1))
(i32.load (local.get 1))
(local.get 2)
(i32.eq)
)
(func (export "load_grow_store") (param i32 i32) (result i32)
(local i32)
(i32.load (local.get 0))
(local.tee 2)
(i32.store (local.get 1))
(memory.grow (i32.const 1))
(drop)
(i32.load (local.get 1))
(local.get 2)
(i32.eq)
)
(func (export "load_store_w_func") (param i32 i32) (result i32)
(local i32)
(local.get 0)
(call $load32)
(local.tee 2)
(local.get 1)
(call $store32)
(i32.load (local.get 1))
(local.get 2)
(i32.eq)
)
(func (export "load_grow_store_w_func") (param i32 i32) (result i32)
(local i32)
(local.get 0)
(call $load32)
(local.tee 2)
(local.get 1)
(call $store32)
(memory.grow (i32.const 1))
(drop)
(i32.load (local.get 1))
(local.get 2)
(i32.eq)
)
)
And I used the following command: --bounds-checks=1 --format=llvmir-op
, to create optimized llvmir. The --bounds-checks=1
is employed to apply the noinline
attribute.
Several intriguing findings emerged from the comparison of the before and after scenarios:
- Look at f0, f1, and f2. These are elementary cases involving load and store. As previously mentioned, the mem2reg optimization refines alloca variables, allowing the revised version to produce no additional IR compared to the original version.
- Now, consider f4 and f5. I believed they presented issues that this PR aims to address. Clearly, as seen in f5, there is no necessity to reload the memory base address after calling f1, as there is no memory growth in f4. This PR should eliminate that redundant loading. However, the modified version maintains the status quo.
🆙 If I'm mistaken, please correct me.
This leads to my confusion: if there is no difference for basic cases and no enhancement for redundant loading, what is the rationale for changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is to eliminate redundant memory info loads between multiple load/store instructions when the memory remains unchanged.
I use the following WAST example to compare the (optimized) IR generated by this PR with the original code, identifying specific scenarios where this optimization applies. I don’t need the noinline
attribute, so I only used the --format=llvmir-op
option.
(module
(memory 5 10)
(func (export "load_load") (param i32 i32) (result i32)
(i32.load (local.get 0))
(i32.load (local.get 1))
(i32.eq)
(memory.grow (i32.const 1))
(drop)
)
(func (export "load_store") (param i32 i32)
(i32.load (local.get 0))
(i32.store (local.get 1))
(memory.grow (i32.const 1))
(drop)
)
(func (export "store_store") (param i32 i32)
(i32.store (local.get 0) (i32.const 42))
(i32.store (local.get 1) (i32.const 42))
(memory.grow (i32.const 1))
(drop)
)
(func (export "store_load") (param i32 i32) (result i32)
(i32.store (local.get 0) (i32.const 42))
(i32.load (local.get 1))
(memory.grow (i32.const 1))
(drop)
)
)
In f3 and f4, the IR generated by this PR is different.
This PR primarily optimizes store-store and store-load scenarios. As for load-load and load-store scenarios, I believe they might have been optimized.
In your example, f0\f1 shows this PR won't produce no addtional IR compared to the original version. The memory in f2 is unchanged. f3 is the load-store case. f4 and f5 contains two call and one load instruction. Therefore, their IRs remains the same.
Let me summarize it for us. IMU, there are several scenarios we need to examine closely, such as single-load, single-store, load-load, load-store, store-store, store-load, load-store-grow-load-store, load-store-call-load-store, and load-store-call-grow-load-store. The rationale for the last three scenarios is that in the original implementation, the key condition for controlling the reloading of the memory base address is After merging both test scripts, I believe we can address all the aforementioned cases. For your information, since the cases are quite straightforward, particularly as functions being called, the compilation process will likely inline these functions, causing the last two cases to lose the call. This is the reason I recommend using --bounds-checks=1 to turn off the inline optimization.
N means no redundant load/store PR has improved two cases and left two cases involving a === Use BEFORE:
AFTER:
|
FYI: Test cases (module
(memory 5 10)
(func $store32 (export "store32") (param i32 i32)
(i32.store (local.get 0) (local.get 1))
)
(func $load32 (export "load32") (param i32) (result i32)
(i32.load (local.get 0))
)
;; 2
(func (export "load_load") (param i32 i32) (result i32)
(i32.load (local.get 0))
(i32.load (local.get 1))
(i32.eq)
(memory.grow (i32.const 1))
(drop)
)
(func (export "load_store") (param i32 i32)
(i32.load (local.get 0))
(i32.store (local.get 1))
(memory.grow (i32.const 1))
(drop)
)
(func (export "store_store") (param i32 i32)
(i32.store (local.get 0) (i32.const 42))
(i32.store (local.get 1) (i32.const 42))
(memory.grow (i32.const 1))
(drop)
)
(func (export "store_load") (param i32 i32) (result i32)
(i32.store (local.get 0) (i32.const 42))
(i32.load (local.get 1))
(memory.grow (i32.const 1))
(drop)
)
;; 6
(func (export "load_store_grow_load_store") (param i32 i32) (result i32)
(local i32)
(i32.store (local.get 0) (i32.const 42))
(i32.load (local.get 1))
(local.set 2)
(memory.grow (i32.const 1))
(drop)
(i32.store (local.get 0) (i32.const 42))
(i32.load (local.get 1))
)
(func (export "load_store_call_load_store") (param i32 i32) (result i32)
(local i32)
(i32.store (local.get 0) (i32.const 42))
(i32.load (local.get 1))
(local.set 2)
(local.get 0)
(call $load32)
(local.tee 2)
(local.get 1)
(call $store32)
(i32.store (local.get 0) (i32.const 42))
(i32.load (local.get 1))
)
(func (export "load_store_call_grow_load_store") (param i32 i32) (result i32)
(local i32)
(i32.store (local.get 0) (i32.const 42))
(i32.load (local.get 1))
(local.set 2)
(local.get 0)
(call $load32)
(local.tee 2)
(local.get 1)
(call $store32)
(memory.grow (i32.const 1))
(drop)
(i32.store (local.get 0) (i32.const 42))
(i32.load (local.get 1))
)
) |
IIUC, the key condition is |
I found that my program runs slower in WAMR AOT mode compared to other WASM runtimes, e.g. WAVM. I compared their LLVM IRs and found that WAMR emits more load operations of memory base.
In WAMR, functions with
mem_space_unchanged
keep memory base address inmem_base_addr
ofAOTMemInfo
, while others keep the address of memory base address in that field. When emit instructions like load/store, the former use the base address directly, while the later should load base address from its address at first. This reload is redundant when there is no possibility of changing memory between two consecutive load/store instructions:Optimization passes won’t recognize this redundancy because the reloaded memory base is accessed within the context.
In WAVM, the base address is reloaded when the memory possibly changes, e.g. after calling another function or after
memory.grow
. This can be redundant if there are no subsequent load/store instructions, but the dead code elimination pass handles this:Performance
Here is a sample C++ program
substr.cc
:Compiled with emcc (version:
3.1.59 (0e4c5994eb5b8defd38367a416d0703fd506ad81)
)Then ran wamrc and iwasm(linux) and compared the performance:
product-mini/platforms/posix/main.c:
result:
commit e3dcf4f
commit 3f268e5
IR(optimized) comparison: