-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MlasTranspose multi-threads support. #24261
MlasTranspose multi-threads support. #24261
Conversation
/azp run Big Models, Linux CPU Minimal Build E2E CI Pipeline, Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
@microsoft-github-policy-service agree company="Fujitsu Ltd." |
##[error]D:\a_work\onnxruntime\onnxruntime\onnxruntime\core\mlas\lib\transpose.cpp(986,5): error C2664: 'void MlasExecuteThreaded(MLAS_THREADED_ROUTINE (__cdecl *),void *,ptrdiff_t,MLAS_THREADPOOL *)': cannot convert argument 1 from 'void (__stdcall *)(void *,ptrdiff_t)' to 'MLAS_THREADED_ROUTINE (__cdecl *)' [D:\a_work\onnxruntime\onnxruntime\build\RelWithDebInfo\onnxruntime_mlas.vcxproj] |
Warning (CLANGFORMAT) format
|
OK, I'll apply the patch and push again. |
@amarin16 please test this out and review. thanks! |
The code changes look good to me. Waiting for the pipelines to pass |
@msy-kato Could you please provide some details about how you ran the performance tests? Did you use onnxruntime-genai? |
Thanks for the review.
Sure! I converted HF model by
import torch
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained("intfloat/multilingual-e5-large")
tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large")
input_texts = [' '.join(['Hello']) * 32] * 2
inputs = dict(tokenizer(input_texts, return_tensors="pt"))
torch.onnx.export(
model,
inputs,
"model.onnx",
input_names=list(inputs.keys()),
output_names=['last_hidden_state', 'pooler_output'],
dynamic_axes={
'input_ids': {0: 'batch_size', 1: 'max_input_length'},
'attention_mask': {0: 'batch_size', 1: 'max_input_length'},
}
)
import onnxruntime
import time
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large")
input_texts = [' '.join(['Hello']) * 510] * 4
options = onnxruntime.SessionOptions()
options.inter_op_num_threads = 1
options.intra_op_num_threads = 16
ort_session = onnxruntime.InferenceSession("model.onnx", sess_options=options)
batch_dict = dict(tokenizer(input_texts, max_length=512, return_tensors="pt"))
batch_dict = {name: tensor.numpy() for name, tensor in batch_dict.items()}
# warmup
_ = ort_session.run(['last_hidden_state'], batch_dict)
start_time = time.time()
for i in range(10):
_ = ort_session.run(['last_hidden_state'], batch_dict)
end_time = time.time()
print('step duration(avg) = {:.7f} sec/step'.format((end_time - start_time) / 10)) commands $ python3 convert.py
$ numactl -C 0-15 python3 run.py |
@amarin16 Thank you for approving my PR. I noticed that the CI/CD pipeline hasn't completed yet. Could you advise if there's anything I can do? |
Could try closing the PR and re-opening it |
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 5 pipeline(s). |
1a22f09
to
09aade9
Compare
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline |
Azure Pipelines successfully started running 5 pipeline(s). |
Description
MlasTranspose was running single-thread and was not performing well enough on a multi-threaded CPU. Therefore, I modified it to run with multi-thread to improve performance.
The
MlasTranspose
was previously running in a single-threaded, which resulted in suboptimal performance on multi-threaded CPUs. To address this, I have modified it to utilize multi-threading.Motivation and Context
We encountered this issue while running the multilingual-e5-large, which was converted to ONNX format and executed on a multi-core CPU (Xeon 6338). Below are the performance metrics before and after the modification:
Condition
This change resulted in a performance improvement of approximately 14%. MlasTranspose stand-alone performance improvements are as follows
MlasTranspose
is x15~16 faster.