Skip to content

vectorize and parallelize does not provide speedups in Raspberry PI ARM Cortex A72 64 bit target #8561

Open
@ivangarcia44

Description

@ivangarcia44

The Halide code at the bottom of this message does not give any speedup when adding vectorize and parallel constructs in the Raspberry PI ARM Cortex A72 64 bit target, although it does for a Raspberry PI ARM Cortex A53 32 bit target.

The command I am using to compile is this (where rtb_ImpAsg_InsertedFor_O_halide is the compiled generator class):

rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-64-linux-enable_llvm_loop_opt-no_runtime-strict_float

When adding the vectorize scheduling primitive (replacing the schedule method with the one below), I get a speedup of 1.00x. That is, there is no difference.

    void schedule() {
        rtb_ImpAsg_InsertedFor_Out1_at_fcn
            .split(d3, d3, d3i, 13, TailStrategy::ShiftInwards)
            .split(d1, d1, d1i, 4, TailStrategy::ShiftInwards)
            .vectorize(d1i)
            .compute_root()
            .reorder({d1i, d1, d2, d3i, d3});
    }

Taking the vectorize out, and adding the parallel scheduling primitive (using the schedule method below), I get a speedown of 0.83x.

    void schedule() {
        rtb_ImpAsg_InsertedFor_Out1_at_fcn
            .split(d3, d3, d3i, 13, TailStrategy::ShiftInwards)
            .split(d1, d1, d1i, 4, TailStrategy::ShiftInwards)
            .compute_root()
            .reorder({d1i, d1, d2, d3i, d3})
            .parallel(d3);
    }

If I remove LLVM optimizations (by taking out the enable_llvm_loop_opt flag) the vectorize and parallel addition speedups become 1.03x and 0.83x which is basically the same. The resulting command is:

rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-64-linux-no_runtime-strict_float

I did the same experiments in a Raspberry PI ARM Cortex A53 32 bit target, by changing the arm-64 by arm-64, which gives the command below. Then did the same experiment mentioned above and got a vectorize speedup of 1.27x and 1.48x.

rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-32-linux-enable_llvm_loop_opt-no_runtime-strict_float

If the LLVM loop optimizations are disabled (removing enable_llvm_loop_op, which yields the command below) the vectorize speedup is 1.93x and parallel speedup is 2.61x.

rtb_ImpAsg_InsertedFor_O_halide -f rtb_ImpAsg_InsertedFor_O_halide_pipeline -g rtb_ImpAsg_InsertedFor_O_halide_gen -e h,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=arm-32-linux-no_runtime-strict_float

With or without LLVM loop optimizations it is obvious that vectorize and parallel are making a difference in the Raspberry PI ARM Cortex A53 32 bit target, unlike the 64 bit target.

One thing to note is that the Raspberry PI ARM Cortex A53 32 bit target had Neon on their instruction set list. While the Raspberry PI ARM Cortex A72 64 bit target lists ASIM, rather than Neon. Although this should not make a difference, and also does not explain why parallelization does not kick in. Both devices (32 and 64 bit Raspberry PI's) have 4 cores.

Is this an issue with Halide, or there is a missing flag above?

Halide code:

#include "Halide.h"
#include <stdio.h>
#include
using namespace Halide;

class rtb_ImpAsg_InsertedFor_O_halide_generator : public Halide::Generator <rtb_ImpAsg_InsertedFor_O_halide_generator> {

public:
    Input<Buffer<float>> imageinput1_in{"imageinput1_in", 3};
    Input<Buffer<float>> imageinput2_in{"imageinput2_in", 3};
    Input<Buffer<float>> imageinput3_in{"imageinput3_in", 3};
    Output<Buffer<float>> rtb_ImpAsg_InsertedFor_Out1_at_fcn{"rtb_ImpAsg_InsertedFor_Out1_at_fcn", 3};

    void generate() {
        rtb_ImpAsg_InsertedFor_Out1_at_fcn(d1, d2, d3) = imageinput1_in(d1, d2, d3) * imageinput2_in(d1, d2, d3) + imageinput3_in(d1, d2, d3);
    }

    void schedule() {
        rtb_ImpAsg_InsertedFor_Out1_at_fcn
            .split(d3, d3, d3i, 13, TailStrategy::ShiftInwards)
            .split(d1, d1, d1i, 4, TailStrategy::ShiftInwards)
            .compute_root()
            .reorder({d1i, d1, d2, d3i, d3});
    }

private:
    Var d1{"d1"};
    Var d2{"d2"};
    Var d3{"d3"};
    Var d3i{"d3i"};
    Var d1i{"d1i"};

};
HALIDE_REGISTER_GENERATOR(rtb_ImpAsg_InsertedFor_O_halide_generator, rtb_ImpAsg_InsertedFor_O_halide_gen)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions