[Metal backend] Optimize `While/BatchMatMul` OP via 4x4 Register Blocking by rainyl · Pull Request #4282 · alibaba/MNN

rainyl · 2026-03-18T14:51:29Z

Environment

MacMini M4 10 cores CPU/GPU
macOS Sequoia 15.7.1 arm64
MNN master branch, hash: 296c949

Motivation

When benchmarking YOLOv8-Worldv2 models, I found their latencies on Metal (80 ms) backend were much slower than CPU (35 ms).

The timeProfile tools shows that the While (actually BatchMatMul is calling While) OP takes 59 ms on Metal backend while CPU only takes 7 ms

// CPU
Sort by time cost !
Node Type       Avg(ms)         %               Called times    Flops Rate
Interp          0.065950        0.222374        2.000000        0.014636
Pooling         0.099480        0.335433        3.000000        0.045738
Softmax         0.500190        1.686569        1.000000        0.006403
Reduction       0.513350        1.730942        4.000000        0.000172
BinaryOp        0.526186        1.774224        22.000000       0.145324
Raster          2.501564        8.434916        65.000000       0.512043
While           7.246778        24.435104       15.000000       37.211399
UnaryOp         7.369541        24.849043       62.000000       0.231227
Convolution     10.834553       36.532574       68.000000       61.830284
total time : 29.657242 ms, total mflops : 8006.625000
main, 171, cost time: 3107.464111 ms

// Metal
Sort by time cost !
Node Type       Avg(ms)         %               Called times    Flops Rate
Softmax         0.213490        0.196185        1.000000        0.006403
Interp          0.362900        0.333485        2.000000        0.014636
Pooling         0.499050        0.458599        3.000000        0.045738
Reduction       1.042630        0.958118        4.000000        0.000172
BinaryOp        4.154682        3.817919        22.000000       0.145324
UnaryOp         10.564444       9.708127        62.000000       0.231227
Raster          14.910487       13.701895       65.000000       0.512043
Convolution     17.441622       16.027864       68.000000       61.830284
While           59.631500       54.797977       15.000000       37.211399
total time : 108.820618 ms, total mflops : 8006.625000
main, 171, cost time: 11018.358398 ms

Then I asked Gemini and found that the implementation of loop_matmul in source/backend/metal/MetalLoop.mm is naive and not well optimized.

Solution

Optimize loop_matmul using 4x4 register blocking and 2D dispatch grid.

Benchmarking

// Metal with optimized loop_matmul
Sort by time cost !
Node Type       Avg(ms)         %               Called times    Flops Rate
Softmax         0.205580        0.373207        1.000000        0.006403
Interp          0.355230        0.644879        2.000000        0.014636
Pooling         0.486690        0.883529        3.000000        0.045738
Reduction       1.013280        1.839493        4.000000        0.000172
BinaryOp        4.007656        7.275434        22.000000       0.145324
While           7.431837        13.491636       15.000000       37.211399
UnaryOp         10.142984       18.413410       62.000000       0.231227
Raster          14.494930       26.313866       65.000000       0.512043
Convolution     16.947138       30.765564       68.000000       61.830284
total time : 55.084766 ms, total mflops : 8006.625000
main, 171, cost time: 5642.733887 ms

In which the cost of While OP is reduced by 59.63/7.43=8x faster.

Implement safe and unsafe paths in matrix multiplication kernel. The unsafe path is used when indices are guaranteed to be within bounds for performance, while the safe path includes conditional checks and zero-padding for boundary elements to prevent invalid memory access.

rainyl added 2 commits March 18, 2026 17:38

optimze While/BatchMatMul OP for metal backend

3eca53a

wangzhaode assigned bitxsw93 Mar 19, 2026

bitxsw93 approved these changes Mar 19, 2026

View reviewed changes

wangzhaode merged commit 160c758 into alibaba:master Mar 19, 2026
7 checks passed

rainyl deleted the metal-while-optimize branch March 19, 2026 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metal backend] Optimize `While/BatchMatMul` OP via 4x4 Register Blocking#4282

[Metal backend] Optimize `While/BatchMatMul` OP via 4x4 Register Blocking#4282
wangzhaode merged 2 commits intoalibaba:masterfrom
rainyl:metal-while-optimize

rainyl commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rainyl commented Mar 18, 2026

Environment

Motivation

Solution

Benchmarking

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants