Skip to content

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303

Open
abhijain1204fujitsu wants to merge 9 commits intoopenvinotoolkit:masterfrom
MonakaResearch:GatherMatmul-on-ARM-with-Kleidiai
Open

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303
abhijain1204fujitsu wants to merge 9 commits intoopenvinotoolkit:masterfrom
MonakaResearch:GatherMatmul-on-ARM-with-Kleidiai

Conversation

@abhijain1204fujitsu
Copy link
Copy Markdown
Contributor

@abhijain1204fujitsu abhijain1204fujitsu commented Feb 25, 2026

[ About ]

  • Enable Gathermatmul and GatherMatmul-Compressed on ARM.
  • Bug fix [ require reorders before packing if weight matrix transposed. ] related to KleidiAI execution of matmul in F32 precision.
  • Additional memory consumption due to weight duplication is managed.

[ Design ]

  • Scratchpad control is moved from GatherMatmul node level to executor level to support NUMA based Expert Parallelism in future.

[Benchmark Results]
image (3) (1)

image (4) (1)

**Results are measured on single socket Graviton4 machine [ 96 cores ]
Kleidiai support is enabled and tested for F32, INT8 and INT4 precisions. For F32 OneDNN is made the default.

This work is contributed by @ashwins990 and @abhijain1204fujitsu

@abhijain1204fujitsu abhijain1204fujitsu requested review from a team as code owners February 25, 2026 03:14
@github-actions github-actions Bot added the category: CPU OpenVINO CPU plugin label Feb 25, 2026
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Feb 25, 2026
@alvoron alvoron self-assigned this Mar 4, 2026
@maxnick maxnick modified the milestones: 2026.0, 2026.1 Mar 4, 2026
@maxnick maxnick self-assigned this Mar 4, 2026
Comment on lines +80 to +94
#else

ov::element::Type getRuntimePrecision() const override;
Algorithm algorithm = Algorithm::GatherMatmulDefault;
size_t numExperts = 0;

std::vector<ExecutorPtr> executor;
std::vector<MemoryArgs> memArgsFC;

MemoryPtr m_weightsMemory = nullptr;
MemoryPtr m_tmpInpBuffer = nullptr;
MemoryDescPtr m_tmpInputDesc = nullptr;
MemoryDescPtr m_tmpOutputDesc = nullptr;

#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some fields are clearly duplicated between if and else branches. Should we narrow the scope?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this file is a temporal solution, and ARM specific implementation will be moved to corresponding executor.

continue;
}

parallel_for(num_valid_rows, [&](size_t m) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use CpuParallel class in such contexts to align the implementation with the x64 approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we reuse the exiting x64 test via moving it to the common scope and enabling corresponding instances for arm?

@ashwins990
Copy link
Copy Markdown
Contributor

Hi @maxnick. Thanks for the comment

I have modified the implementation, moving some of the Gathermatmul logic to KleidiAIExecutor, also keeping the executor interface light as discussed. Also I have integrated the logic in the same file "gathermatmul.cpp" and reuse existing x86 code.

I will update this PR with the new refactored logic in the coming week once its approved internally.

Will move the relevant tests to common scope as well.

@praasz praasz modified the milestones: 2026.1, 2026.2 Mar 20, 2026
@abhijain1204fujitsu abhijain1204fujitsu force-pushed the GatherMatmul-on-ARM-with-Kleidiai branch from c4f4a52 to 71109bc Compare March 24, 2026 11:47
@ashwins990
Copy link
Copy Markdown
Contributor

Hi @maxnick, I have made the requested changes and fixed the test cases, please review the PR. Thanks!


TEST_P(MoECompressedWeightsSubgraphTest, CompareWithRefs) {
SKIP_IF_CURRENT_TEST_IS_DISABLED()
#ifndef OPENVINO_ARCH_X86
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPENVINO_ARCH_X86_64 should be checked here as well.

reduce_node->clone_with_new_inputs({new_final_mul->output(0), reduce_node->input_value(1)});
ov::copy_runtime_info(reduce_node, new_reduce_node);
new_reduce_sum =
squeeze_node->clone_with_new_inputs({new_reduce_node->output(0), new_reduce_node->input_value(1)});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming that squeeze axes equal reduce axes, so we can use new_reduce_node->input_value(1) in new_reduce_sum initialization? Shouldn't we use squeeze_node->input_value(1) here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alvoron, I have made the changes as per your suggestion. Thanks!

yes, It was based on the assumption that sqeeze axis and reduce axis are the same. But since we do not intent to make any modification to the original pattern here, so I think it makes sense here, to not have this assumption.

@abhijain1204fujitsu abhijain1204fujitsu force-pushed the GatherMatmul-on-ARM-with-Kleidiai branch from 4a59819 to 6aa7b53 Compare April 6, 2026 09:11
@maxnick maxnick added the platform: arm OpenVINO on ARM / ARM64 label Apr 10, 2026
@abhijain1204fujitsu abhijain1204fujitsu force-pushed the GatherMatmul-on-ARM-with-Kleidiai branch from 6aa7b53 to c48e78b Compare April 15, 2026 15:44
@abhijain1204fujitsu abhijain1204fujitsu requested a review from a team as a code owner April 15, 2026 15:44
@github-actions github-actions Bot added the category: transformations OpenVINO Runtime library - Transformations label Apr 15, 2026
@ashwins990
Copy link
Copy Markdown
Contributor

Hi @maxnick @alvoron. The conflict with master is resolved. Please review the code. Thanks !

@maxnick maxnick requested a review from Copilot April 16, 2026 08:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Enables GatherMatmul / GatherMatmul-Compressed on ARM by wiring the MoE-to-GatherMatmul transformation through the common CPU pass pipeline and adding a KleidiAI-backed execution path for ARM64.

Changes:

  • Register MoE→GatherMatmul and GatherMatmul→Compressed conversions for non-x86 CPU builds.
  • Add ARM64 KleidiAI execution path for GatherMatmul (including compressed weights constraints) and fix handling of transposed weights before packing.
  • Extend MoE transformation + tests to also match ReduceSum implemented as keep_dims=true followed by Squeeze.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/x64/moe.cpp Adjusts ARM-specific thresholds, skips, and config/parameter generation for compressed MoE tests.
src/plugins/intel_cpu/src/transformations/cpu_opset/convert_to_cpu_specific_opset.hpp Registers MoE conversion passes for all CPU arches (not only x64).
src/plugins/intel_cpu/src/nodes/gathermatmul.h Adds kernel-type selection (OneDNN vs KleidiAI) and ARM-only executor state.
src/plugins/intel_cpu/src/nodes/gathermatmul.cpp Implements ARM64 KleidiAI primitive creation/prepare/execute paths and ARM compressed-type gating.
src/plugins/intel_cpu/src/nodes/executors/kleidiai/kleidiai_mm.hpp Adds GatherMatmul-specific mode + API to provide gather/scatter index map.
src/plugins/intel_cpu/src/nodes/executors/kleidiai/kleidiai_mm.cpp Implements GatherMatmul gather/scatter flow + fixes repack path for transposed weights.
src/common/transformations/tests/common_optimizations/convert_tiled_moe_block_to_gather_matmuls_test.cpp Expands test matrix to cover ReduceSum keep-dims + Squeeze form.
src/common/transformations/src/transformations/common_optimizations/convert_tiled_moe_block_to_gather_matmuls.cpp Updates pattern to match both ReduceSum forms and clones reduce+squeeze appropriately.

Comment on lines +260 to +276
if (pm.find(p.reduceSum_squeeze) != pm.end()) {
const auto reduce_node = pm.at(p.reduceSum_keepDims).get_node_shared_ptr();
const auto squeeze_node = pm.at(p.reduceSum_squeeze).get_node_shared_ptr();
const auto new_reduce_node =
reduce_node->clone_with_new_inputs({new_final_mul->output(0), reduce_node->input_value(1)});
ov::copy_runtime_info(reduce_node, new_reduce_node);
new_reduce_sum =
squeeze_node->clone_with_new_inputs({new_reduce_node->output(0), squeeze_node->input_value(1)});
ov::copy_runtime_info(squeeze_node, new_reduce_sum);
} else {
const auto reduce_node = pm.at(p.reduceSum_noKeepDims).get_node_shared_ptr();
new_reduce_sum =
reduce_node->clone_with_new_inputs({new_final_mul->output(0), reduce_node->input_value(1)});
ov::copy_runtime_info(reduce_node, new_reduce_sum);
}

new_reduce_sum->set_friendly_name(p.reduce_sum->get_friendly_name());
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.reduce_sum is a pattern node (an OR of two alternatives), so p.reduce_sum->get_friendly_name() is not guaranteed to match the friendly name of the actually matched runtime node. Prefer taking the friendly name from the matched node (pm.at(p.reduceSum_squeeze) or pm.at(p.reduceSum_noKeepDims)), and (in the keep-dims path) also consider preserving the ReduceSum node name separately if name-stability is important for debugging/telemetry.

Suggested change
if (pm.find(p.reduceSum_squeeze) != pm.end()) {
const auto reduce_node = pm.at(p.reduceSum_keepDims).get_node_shared_ptr();
const auto squeeze_node = pm.at(p.reduceSum_squeeze).get_node_shared_ptr();
const auto new_reduce_node =
reduce_node->clone_with_new_inputs({new_final_mul->output(0), reduce_node->input_value(1)});
ov::copy_runtime_info(reduce_node, new_reduce_node);
new_reduce_sum =
squeeze_node->clone_with_new_inputs({new_reduce_node->output(0), squeeze_node->input_value(1)});
ov::copy_runtime_info(squeeze_node, new_reduce_sum);
} else {
const auto reduce_node = pm.at(p.reduceSum_noKeepDims).get_node_shared_ptr();
new_reduce_sum =
reduce_node->clone_with_new_inputs({new_final_mul->output(0), reduce_node->input_value(1)});
ov::copy_runtime_info(reduce_node, new_reduce_sum);
}
new_reduce_sum->set_friendly_name(p.reduce_sum->get_friendly_name());
std::string new_reduce_sum_friendly_name;
if (pm.find(p.reduceSum_squeeze) != pm.end()) {
const auto reduce_node = pm.at(p.reduceSum_keepDims).get_node_shared_ptr();
const auto squeeze_node = pm.at(p.reduceSum_squeeze).get_node_shared_ptr();
const auto new_reduce_node =
reduce_node->clone_with_new_inputs({new_final_mul->output(0), reduce_node->input_value(1)});
ov::copy_runtime_info(reduce_node, new_reduce_node);
new_reduce_node->set_friendly_name(reduce_node->get_friendly_name());
new_reduce_sum =
squeeze_node->clone_with_new_inputs({new_reduce_node->output(0), squeeze_node->input_value(1)});
ov::copy_runtime_info(squeeze_node, new_reduce_sum);
new_reduce_sum_friendly_name = squeeze_node->get_friendly_name();
} else {
const auto reduce_node = pm.at(p.reduceSum_noKeepDims).get_node_shared_ptr();
new_reduce_sum =
reduce_node->clone_with_new_inputs({new_final_mul->output(0), reduce_node->input_value(1)});
ov::copy_runtime_info(reduce_node, new_reduce_sum);
new_reduce_sum_friendly_name = reduce_node->get_friendly_name();
}
new_reduce_sum->set_friendly_name(new_reduce_sum_friendly_name);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated these changes in the latest commit.

Comment on lines +120 to +126
p.reduceSum_keepDims = pattern::wrap_type<ov::op::v1::ReduceSum>({p.mul3, pattern::any_input()},
pattern::consumers_count(1),
{{"keep_dims", true}});
p.reduceSum_squeeze = pattern::wrap_type<ov::op::v0::Squeeze>({p.reduceSum_keepDims, pattern::any_input()});
p.reduceSum_noKeepDims = pattern::wrap_type<ov::op::v1::ReduceSum>({p.mul3, pattern::any_input()},
pattern::consumers_count(1),
{{"keep_dims", false}});
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added pattern::consumers_count(1) constraint can make the transformation stop matching if ReduceSum has >1 consumer in real graphs (even if the subgraph is otherwise valid). If the single-consumer requirement is not strictly necessary for correctness, consider removing/relaxing it to avoid unintentionally disabling the MoE→GatherMatmul optimization.

Suggested change
p.reduceSum_keepDims = pattern::wrap_type<ov::op::v1::ReduceSum>({p.mul3, pattern::any_input()},
pattern::consumers_count(1),
{{"keep_dims", true}});
p.reduceSum_squeeze = pattern::wrap_type<ov::op::v0::Squeeze>({p.reduceSum_keepDims, pattern::any_input()});
p.reduceSum_noKeepDims = pattern::wrap_type<ov::op::v1::ReduceSum>({p.mul3, pattern::any_input()},
pattern::consumers_count(1),
{{"keep_dims", false}});
p.reduceSum_keepDims =
pattern::wrap_type<ov::op::v1::ReduceSum>({p.mul3, pattern::any_input()}, {{"keep_dims", true}});
p.reduceSum_squeeze = pattern::wrap_type<ov::op::v0::Squeeze>({p.reduceSum_keepDims, pattern::any_input()});
p.reduceSum_noKeepDims =
pattern::wrap_type<ov::op::v1::ReduceSum>({p.mul3, pattern::any_input()}, {{"keep_dims", false}});

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not made this change as we are matching only for consumers_count(1)

void moveMemToNumaNode(int numaNodeID) override;

void setKaiExecutorImplAsGatherMatmul();
void set_gather_idx(std::vector<std::pair<int32_t, int32_t>> idxMap);
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_gather_idx takes the vector by value, which forces an extra copy when the caller passes an lvalue. Prefer taking std::vector<...>&& (and std::move at the callsite) or const std::vector<...>& (and copying only if needed) to reduce per-iteration overhead in GatherMatmul execution.

Suggested change
void set_gather_idx(std::vector<std::pair<int32_t, int32_t>> idxMap);
void set_gather_idx(const std::vector<std::pair<int32_t, int32_t>>& idxMap);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated these changes in the latest commit.

Comment on lines +1092 to +1096
auto gather_idx_expertOffset = gather_idx_map.begin() + gather_axis_index * M;
std::vector<std::pair<int32_t, int32_t>> kai_gather_idx(gather_idx_expertOffset,
gather_idx_expertOffset + num_valid_rows);
executor[gather_axis_index]->set_gather_idx(kai_gather_idx);
executor[gather_axis_index]->execute(memArgs[gather_axis_index]);
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates and copies kai_gather_idx on every execute() call for every expert. To reduce overhead, consider reusing a per-expert buffer (e.g., store a vector in the GatherMatmul node, clear() + reserve() + assign()), and then pass/move it into the executor (especially if set_gather_idx is changed to accept an rvalue-reference).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not required.

Comment on lines +346 to +348
const auto srcPrc = dstDesc->getPrecision();
m_tmpInputDesc = creatorsMap.at(LayoutType::ncsp)->createSharedDesc(srcPrc, Shape({M, K}));
m_tmpOutputDesc = creatorsMap.at(LayoutType::ncsp)->createSharedDesc(srcPrc, Shape({M, N}));
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GatherMatmul, the temporary input buffer represents gathered activations, so its precision should be derived from the source tensor (ARG_SRC), not from the destination descriptor. Using dstDesc->getPrecision() could silently create mismatched temp buffers if output precision diverges from input; prefer memory.at(ARG_SRC)->getDescPtr()->getPrecision() here.

Suggested change
const auto srcPrc = dstDesc->getPrecision();
m_tmpInputDesc = creatorsMap.at(LayoutType::ncsp)->createSharedDesc(srcPrc, Shape({M, K}));
m_tmpOutputDesc = creatorsMap.at(LayoutType::ncsp)->createSharedDesc(srcPrc, Shape({M, N}));
const auto srcPrc = memory.at(ARG_SRC)->getDescPtr()->getPrecision();
const auto dstPrc = dstDesc->getPrecision();
m_tmpInputDesc = creatorsMap.at(LayoutType::ncsp)->createSharedDesc(srcPrc, Shape({M, K}));
m_tmpOutputDesc = creatorsMap.at(LayoutType::ncsp)->createSharedDesc(dstPrc, Shape({M, N}));

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated these changes in the latest commit.

Comment on lines +73 to +75
namespace {
// TODO: OffsetHelper is common util function. Move it to some common location
class OffsetHelper {
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OffsetHelper logic is duplicated (there is a very similar helper in GatherMatmul). Since correct offset computation is easy to get subtly wrong over time (broadcasting, bitwidth, stride semantics), it would be safer to centralize this helper in a shared utility and reuse it in both places.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently kept as such. Please suggest a common path if needs to be moved.

@maxnick maxnick requested a review from praasz April 21, 2026 12:38
@ashwins990
Copy link
Copy Markdown
Contributor

Hi @maxnick, I have fixed the clang formatting, which was failing earlier. Thanks.

@abhijain1204fujitsu abhijain1204fujitsu force-pushed the GatherMatmul-on-ARM-with-Kleidiai branch from ca6191b to 80491e5 Compare May 5, 2026 09:43
@abhijain1204fujitsu abhijain1204fujitsu force-pushed the GatherMatmul-on-ARM-with-Kleidiai branch from 80491e5 to d16cc01 Compare May 7, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations ExternalPR External contributor platform: arm OpenVINO on ARM / ARM64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants