Skip to content

[Metal backend] Optimize While/BatchMatMul OP via 4x4 Register Blocking#4282

Merged
wangzhaode merged 2 commits intoalibaba:masterfrom
rainyl:metal-while-optimize
Mar 19, 2026
Merged

[Metal backend] Optimize While/BatchMatMul OP via 4x4 Register Blocking#4282
wangzhaode merged 2 commits intoalibaba:masterfrom
rainyl:metal-while-optimize

Conversation

@rainyl
Copy link
Contributor

@rainyl rainyl commented Mar 18, 2026

Environment

  • MacMini M4 10 cores CPU/GPU
  • macOS Sequoia 15.7.1 arm64
  • MNN master branch, hash: 296c949

Motivation

When benchmarking YOLOv8-Worldv2 models, I found their latencies on Metal (80 ms) backend were much slower than CPU (35 ms).

The timeProfile tools shows that the While (actually BatchMatMul is calling While) OP takes 59 ms on Metal backend while CPU only takes 7 ms

// CPU
Sort by time cost !
Node Type       Avg(ms)         %               Called times    Flops Rate
Interp          0.065950        0.222374        2.000000        0.014636
Pooling         0.099480        0.335433        3.000000        0.045738
Softmax         0.500190        1.686569        1.000000        0.006403
Reduction       0.513350        1.730942        4.000000        0.000172
BinaryOp        0.526186        1.774224        22.000000       0.145324
Raster          2.501564        8.434916        65.000000       0.512043
While           7.246778        24.435104       15.000000       37.211399
UnaryOp         7.369541        24.849043       62.000000       0.231227
Convolution     10.834553       36.532574       68.000000       61.830284
total time : 29.657242 ms, total mflops : 8006.625000
main, 171, cost time: 3107.464111 ms
// Metal
Sort by time cost !
Node Type       Avg(ms)         %               Called times    Flops Rate
Softmax         0.213490        0.196185        1.000000        0.006403
Interp          0.362900        0.333485        2.000000        0.014636
Pooling         0.499050        0.458599        3.000000        0.045738
Reduction       1.042630        0.958118        4.000000        0.000172
BinaryOp        4.154682        3.817919        22.000000       0.145324
UnaryOp         10.564444       9.708127        62.000000       0.231227
Raster          14.910487       13.701895       65.000000       0.512043
Convolution     17.441622       16.027864       68.000000       61.830284
While           59.631500       54.797977       15.000000       37.211399
total time : 108.820618 ms, total mflops : 8006.625000
main, 171, cost time: 11018.358398 ms

Then I asked Gemini and found that the implementation of loop_matmul in source/backend/metal/MetalLoop.mm is naive and not well optimized.

Solution

Optimize loop_matmul using 4x4 register blocking and 2D dispatch grid.

Benchmarking

// Metal with optimized loop_matmul
Sort by time cost !
Node Type       Avg(ms)         %               Called times    Flops Rate
Softmax         0.205580        0.373207        1.000000        0.006403
Interp          0.355230        0.644879        2.000000        0.014636
Pooling         0.486690        0.883529        3.000000        0.045738
Reduction       1.013280        1.839493        4.000000        0.000172
BinaryOp        4.007656        7.275434        22.000000       0.145324
While           7.431837        13.491636       15.000000       37.211399
UnaryOp         10.142984       18.413410       62.000000       0.231227
Raster          14.494930       26.313866       65.000000       0.512043
Convolution     16.947138       30.765564       68.000000       61.830284
total time : 55.084766 ms, total mflops : 8006.625000
main, 171, cost time: 5642.733887 ms

In which the cost of While OP is reduced by 59.63/7.43=8x faster.

rainyl added 2 commits March 18, 2026 17:38
Implement safe and unsafe paths in matrix multiplication kernel. The unsafe path is used when indices are guaranteed to be within bounds for performance, while the safe path includes conditional checks and zero-padding for boundary elements to prevent invalid memory access.
@wangzhaode wangzhaode merged commit 160c758 into alibaba:master Mar 19, 2026
7 checks passed
@rainyl rainyl deleted the metal-while-optimize branch March 19, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants