-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vectorize and parallelize does not provide speedups in Raspberry PI ARM Cortex A72 64 bit target #8561
Comments
It looks like your generator is memcpy-like in terms of its performance profile. The single multiply-add instruction isn't significant compared to the loads and stores. So it may just be bandwidth-limited, and running at the maximum possible speed for moving that much memory around. How does the performance compare to a memcpy of the same total amount of data? You could also try adding a lot more math on the same data (e.g. a bunch of sqrts) and seeing if that makes vectorization and parallelization give a speed-up. If so, that's a sign that the original is memory-bandwidth limited. If you're worrying that Halide isn't actually vectorizing or parallelizing the code, run it in a profiler like perf and see if the hot loop uses vector instructions and if all CPU cores are used. |
Hi Andrew, Below are the results of additional experiments I did today. Do you think this is a Halide issue or there is something we can do from our side? I did the suggestion to try memcpy, extra math (trigonometry and square root) and also tried the two stage blur. All of them gave me a 1x speedup when changing the vectorization except the memcpy option that gave a 1.3x speedup (with a high variance). Still 1.3x is way lower than the around 9x that I get in AMD with SIMD. Also the speedup I get by adding vectorization for the original algorithm with Raspberry PI 32 bit is 1.93x. I would expect something around this range or better for 64 bits. For the case of parallelization I got a 3.81x speedup for the extra math option, but close to 1x for the memcpy and two stage blur. This is strange. I can’t use perf tool in these experiments because the generated Halide code is compiled and executed within Simulink. I don’t have a standalone executable. I am disabling LLVM loop optimizations in all experiments below. Original algorithm (input size = 300x200x100): Memcpy option (input size = 300x200x100):
Extra math option (input size = 300x200x100):
Two stage blur (input size = 640x480 with 1x3 and 3x1 kernels). In here I used your autoscheduler and subtracted vectorization and parallelization):
Thank you, |
In the memcpy option, I mean to call memcpy instead of any Halide-generated code, on the same total amount of data, to get a sense of the peak memory bandwidth on the device. I suspect you've hit it. It looks like you're loading 72 mb of data from dram and sending 24mb back, which is roughly equivalent to a 64mb memcpy. This link indicates that you can hit about 5500 MB/s on a raspberry pi 4: https://forums.raspberrypi.com/viewtopic.php?t=271121 when doing something memcpy-like. So 64mb would take around 12ms. You report 1*10^7 but I'm not sure of the units. If it's nanoseconds that would make sense and this is as fast as it's going to get. If it's microseconds then something is badly wrong. You would not expect a memcpy or anything that looks sort of like a memcpy to get faster with parallelism or vectorization on many arm machines. It's limited by the memory bus, which is shared between the cores, not the amount of math done, so parallelizing and vectorizing does nothing. To optimize something that's bandwidth-limited, it's In the extra math option, atan, sin, and cos are not vectorizable by Halide, so you wouldn't expect a speed-up from vectorization. Try sticking to *, +, -, sqrt. |
The units I am using are Simulink Software in the Loop ticks - https://www.mathworks.com/help/ecoder/ref/coder.profile.executiontimesection.executiontimeinticks.html Does Halide perform well on Raspberry PI 4 with 64 bit and leverages well its SIMD and parallelization features? If yes, please let me know if you have any recommended Halide generator flags (different from what I showed above) or any compilation flag/environment variable that could help. The experiments I shared with you above are small on purpose because I was trying to figure out why Halide in Raspberry PI ARM 64 bit was not performing as well as in our AMD and Intel host machines in a set of convolutional neural networks. We are using Halide through Simulink Halide code generation and comparing its performance with the Simulink predict block. Halide outperformed the alternative in some cases for Intel/AMD, but never in ARM. To figure out what was happening with ARM I did a small convolution-relu-maxpooling chain of size 256x256x64. Through ablation experiments I found that in AMD the vectorize and parallel primitives had a speedup of 7.51x and 9.06x respectively. For this model in AMD I got a Halide speedup over the alternative (Simulink predict block) of 1.67x. In the case of Raspberry PI 4 ARM 64 bit is 1.18x and in the case of Raspberry PI 3 ARM 32 bit the speedup is 0.67x. I could not do the same ablation experiment with this model on ARM due to a separate deployment issue. Therefore, I did the ablation experiments which I mentioned in my first message in which I found strange that the SIMD speedups was at best 1.3x in ARM 64, when in AMD was around 8x. Even in ARM 32 bit I got a SIMD speedup of 1.93x. I was expecting the ARM 64 speedup to be better than ARM 32. As mentioned in my previous message, the Raspberry PI 4 64 bit instruction set was labeled as ASIMD and Raspberry PI 3 32 bit was labeled as Neon. Not sure if this makes a difference. |
Yes, Halide should perform well on Cortex A72 cores using the arm-64 instruction set. We have lots of production usage of Halide on arm-64 targets, and I believe our two linux arm buildbots are orange pis with Cortex A72s. It's just that vectorizing and parallelizing memory-bandwidth-limited code doesn't do anything to help performance. If you're trying to schedule a convolution-relu-maxpool, then that's more complicated than just vectorizing and parallelizing the code. The goal there is to tile so that each loaded value is reused as many times as possible, but without running out of registers. apps/conv_layer has a good arm-64 schedule for conv+relu which may be useful as a reference. |
The Halide code at the bottom of this message does not give any speedup when adding vectorize and parallel constructs in the Raspberry PI ARM Cortex A72 64 bit target, although it does for a Raspberry PI ARM Cortex A53 32 bit target.
The command I am using to compile is this (where rtb_ImpAsg_InsertedFor_O_halide is the compiled generator class):
rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-64-linux-enable_llvm_loop_opt-no_runtime-strict_float
When adding the vectorize scheduling primitive (replacing the schedule method with the one below), I get a speedup of 1.00x. That is, there is no difference.
Taking the vectorize out, and adding the parallel scheduling primitive (using the schedule method below), I get a speedown of 0.83x.
If I remove LLVM optimizations (by taking out the enable_llvm_loop_opt flag) the vectorize and parallel addition speedups become 1.03x and 0.83x which is basically the same. The resulting command is:
rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-64-linux-no_runtime-strict_float
I did the same experiments in a Raspberry PI ARM Cortex A53 32 bit target, by changing the arm-64 by arm-64, which gives the command below. Then did the same experiment mentioned above and got a vectorize speedup of 1.27x and 1.48x.
rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-32-linux-enable_llvm_loop_opt-no_runtime-strict_float
If the LLVM loop optimizations are disabled (removing enable_llvm_loop_op, which yields the command below) the vectorize speedup is 1.93x and parallel speedup is 2.61x.
rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-32-linux-no_runtime-strict_float
With or without LLVM loop optimizations it is obvious that vectorize and parallel are making a difference in the Raspberry PI ARM Cortex A53 32 bit target, unlike the 64 bit target.
One thing to note is that the Raspberry PI ARM Cortex A53 32 bit target had Neon on their instruction set list. While the Raspberry PI ARM Cortex A72 64 bit target lists ASIM, rather than Neon. Although this should not make a difference, and also does not explain why parallelization does not kick in. Both devices (32 and 64 bit Raspberry PI's) have 4 cores.
Is this an issue with Halide, or there is a missing flag above?
Halide code:
#include "Halide.h"
#include <stdio.h>
#include
using namespace Halide;
class rtb_ImpAsg_InsertedFor_O_halide_generator : public Halide::Generator <rtb_ImpAsg_InsertedFor_O_halide_generator> {
};
HALIDE_REGISTER_GENERATOR(rtb_ImpAsg_InsertedFor_O_halide_generator, rtb_ImpAsg_InsertedFor_O_halide_gen)
The text was updated successfully, but these errors were encountered: