You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used taskset 0f to limit it to only 4 CPU cores. The model file qwen2.5-0.5b-instruct-fp16.gguf means that weights are stored in FP16 format and activations are computed in FP32.
I modified the ggml_graph_compute_thread function in ggml/src/ggml-cpu/ggml-cpu.c to log information such as the operator name, shape, and execution time. Here is the code:
I wrote a script to analyze the output, which shows: X*Y (896 1 1 1) * (896 151936 1 1): 4.616333 ms
This means that multiplying an activation tensor of shape (896 1 1 1) (FP32) with a weight matrix of shape (896 151936 1 1) (FP16) takes an average of 4.616333 ms .
To reproduce this operation, I built a minimal test graph using GGML: tests/mytest.cpp
#include<ggml-cpu.h>
#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include<unistd.h>
#include<sys/resource.h>
#include"ggml.h"
#include"time_helper.hpp"voidtest_f16_matmul() {
ggml_backend_load_all();
structggml_init_params params = {
.mem_size = 1280 * 1024 * 1024,
.mem_buffer = NULL,
.no_alloc = false,
};
structggml_context * ctx = ggml_init(params);
constint64_t neC[2] = {896, 896};
constint64_t neD[2] = {896, 1};
constint64_t neA[2] = {896, 151936};
structggml_tensor * C = ggml_new_tensor_2d(ctx, GGML_TYPE_F16, neC[0], neC[1]);
structggml_tensor * D = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, neD[0], neD[1]);
structggml_tensor * A = ggml_new_tensor_2d(ctx, GGML_TYPE_F16, neA[0], neA[1]);
// Fill tensors with random valuesfor (int i = 0; i < ggml_nelements(C); ++i) {
ggml_set_f32_1d(C, i, ((float)rand() / RAND_MAX) * 2.0f - 1.0f);
}
for (int i = 0; i < ggml_nelements(D); ++i) {
ggml_set_f32_1d(D, i, ((float)rand() / RAND_MAX) * 2.0f - 1.0f);
}
for (int i = 0; i < ggml_nelements(A); ++i) {
ggml_set_f32_1d(A, i, ((float)rand() / RAND_MAX) * 2.0f - 1.0f);
}
structggml_tensor * B = ggml_mul_mat(ctx, C, D);
structggml_tensor * E = ggml_mul_mat(ctx, A, B);
// Build and compute the graphstructggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, E);
ggml_graph_compute_with_ctx(ctx, gf, 4);
ggml_free(ctx);
}
intmain() {
for (int i = 0; i < 6; i++) {
test_f16_matmul();
sleep(1);
}
return0;
}
Run Command on Device via ADB cd /data/local/tmp/ggml/build-ggml && chmod +x mytest && export LD_LIBRARY_PATH=:/data/local/tmp/ggml/build-ggml && taskset 0f ./mytest >ggml_op.txt
The result from running this standalone test shows: X*Y (896 1 1 1) * (896 151936 1 1): 9.072333 ms
Multiple runs consistently show around 8–9 ms , whereas the same operation inside llama.cpp takes only ~4.6 ms. I verified through debugging that both versions execute the same underlying code path .
Why does the mul_mat operation in llama.cpp run significantly faster than in my standalone test?
This issue has been puzzling me for two weeks now. I would greatly appreciate any insight you can offer!
Thank you for taking the time to read through this.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I use the following command to compile an executable file for Android:
This is the command I enter in the shell on Android (REDMI K80 Pro):
I used taskset 0f to limit it to only 4 CPU cores. The model file qwen2.5-0.5b-instruct-fp16.gguf means that weights are stored in FP16 format and activations are computed in FP32.
I modified the ggml_graph_compute_thread function in ggml/src/ggml-cpu/ggml-cpu.c to log information such as the operator name, shape, and execution time. Here is the code:
I wrote a script to analyze the output, which shows:
X*Y (896 1 1 1) * (896 151936 1 1): 4.616333 ms
This means that multiplying an activation tensor of shape (896 1 1 1) (FP32) with a weight matrix of shape (896 151936 1 1) (FP16) takes an average of 4.616333 ms .
To reproduce this operation, I built a minimal test graph using GGML:
tests/mytest.cpp
tests/CMakeLists.txt
Build Command
Run Command on Device via ADB
cd /data/local/tmp/ggml/build-ggml && chmod +x mytest && export LD_LIBRARY_PATH=:/data/local/tmp/ggml/build-ggml && taskset 0f ./mytest >ggml_op.txt
The result from running this standalone test shows:
X*Y (896 1 1 1) * (896 151936 1 1): 9.072333 ms
Multiple runs consistently show around 8–9 ms , whereas the same operation inside llama.cpp takes only ~4.6 ms. I verified through debugging that both versions execute the same underlying code path .
Why does the mul_mat operation in llama.cpp run significantly faster than in my standalone test?
This issue has been puzzling me for two weeks now. I would greatly appreciate any insight you can offer!
Thank you for taking the time to read through this.
Python Script for Log Analysis (ana_log.py)
Usage
python ana_log.py op.txt avg
Beta Was this translation helpful? Give feedback.
All reactions