vectorize and parallelize does not provide speedups in Raspberry PI ARM Cortex A72 64 bit target

The Halide code at the bottom of this message does not give any speedup when adding vectorize and parallel constructs in the Raspberry PI ARM Cortex A72 64 bit target, although it does for a  Raspberry PI ARM Cortex A53 32 bit target.

The command I am using to compile is this (where rtb_ImpAsg_InsertedFor_O_halide is the compiled generator class):

`rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-64-linux-enable_llvm_loop_opt-no_runtime-strict_float`

When adding the vectorize scheduling primitive (replacing the schedule method with the one below), I get a speedup of 1.00x. That is, there is no difference.

        void schedule() {
            rtb_ImpAsg_InsertedFor_Out1_at_fcn
                .split(d3, d3, d3i, 13, TailStrategy::ShiftInwards)
                .split(d1, d1, d1i, 4, TailStrategy::ShiftInwards)
                .vectorize(d1i)
                .compute_root()
                .reorder({d1i, d1, d2, d3i, d3});
        }

Taking the vectorize out, and adding the parallel scheduling primitive (using the schedule method below), I get a speedown of 0.83x.

        void schedule() {
            rtb_ImpAsg_InsertedFor_Out1_at_fcn
                .split(d3, d3, d3i, 13, TailStrategy::ShiftInwards)
                .split(d1, d1, d1i, 4, TailStrategy::ShiftInwards)
                .compute_root()
                .reorder({d1i, d1, d2, d3i, d3})
                .parallel(d3);
        }

If I remove LLVM optimizations (by taking out the enable_llvm_loop_opt flag) the vectorize and parallel addition speedups become 1.03x and 0.83x which is basically the same. The resulting command is:

`rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-64-linux-no_runtime-strict_float`

I did the same experiments in a Raspberry PI ARM Cortex A53 32 bit target, by changing the arm-64 by arm-64, which gives the command below. Then did the same experiment mentioned above and got a vectorize speedup of 1.27x and 1.48x.

`rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-32-linux-enable_llvm_loop_opt-no_runtime-strict_float`

If the LLVM loop optimizations are disabled (removing enable_llvm_loop_op, which yields the command below) the vectorize speedup is 1.93x and parallel speedup is 2.61x.

`rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-32-linux-no_runtime-strict_float`

With or without LLVM loop optimizations it is obvious that vectorize and parallel are making a difference in the Raspberry PI ARM Cortex A53 32 bit target, unlike the 64 bit target.

One thing to note is that the Raspberry PI ARM Cortex A53 32 bit target had Neon on their instruction set list. While the Raspberry PI ARM Cortex A72 64 bit target lists ASIM, rather than Neon. Although this should not make a difference, and also does not explain why parallelization does not kick in. Both devices (32 and 64 bit Raspberry PI's) have 4 cores.

Is this an issue with Halide, or there is a missing flag above?

Halide code:

#include "Halide.h"
#include <stdio.h>
#include <climits>
using namespace Halide;

class rtb_ImpAsg_InsertedFor_O_halide_generator : public Halide::Generator <rtb_ImpAsg_InsertedFor_O_halide_generator> {

    public:
        Input<Buffer<float>> imageinput1_in{"imageinput1_in", 3};
        Input<Buffer<float>> imageinput2_in{"imageinput2_in", 3};
        Input<Buffer<float>> imageinput3_in{"imageinput3_in", 3};
        Output<Buffer<float>> rtb_ImpAsg_InsertedFor_Out1_at_fcn{"rtb_ImpAsg_InsertedFor_Out1_at_fcn", 3};

        void generate() {
            rtb_ImpAsg_InsertedFor_Out1_at_fcn(d1, d2, d3) = imageinput1_in(d1, d2, d3) * imageinput2_in(d1, d2, d3) + imageinput3_in(d1, d2, d3);
        }

        void schedule() {
            rtb_ImpAsg_InsertedFor_Out1_at_fcn
                .split(d3, d3, d3i, 13, TailStrategy::ShiftInwards)
                .split(d1, d1, d1i, 4, TailStrategy::ShiftInwards)
                .compute_root()
                .reorder({d1i, d1, d2, d3i, d3});
        }

    private:
        Var d1{"d1"};
        Var d2{"d2"};
        Var d3{"d3"};
        Var d3i{"d3i"};
        Var d1i{"d1i"};
};
HALIDE_REGISTER_GENERATOR(rtb_ImpAsg_InsertedFor_O_halide_generator, rtb_ImpAsg_InsertedFor_O_halide_gen)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vectorize and parallelize does not provide speedups in Raspberry PI ARM Cortex A72 64 bit target #8561

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vectorize and parallelize does not provide speedups in Raspberry PI ARM Cortex A72 64 bit target #8561

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions