Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] the root cause of onnxruntime-genai's being much faster than onnxruntime? #22805

Open
idruker-cerence opened this issue Nov 11, 2024 · 0 comments
Labels
performance issues related to performance regressions

Comments

@idruker-cerence
Copy link

Describe the issue

CUDA creates an execution pipeline on Device (that is, where GPU runs). For efficiency, it’s the best to make Host operations on Device whenever possible. There is a CUDA API that allows explicitly copying data from Host to Device, from Device to Host, from Device to Device – cudaMemcpy() and its derivatives. This function is widely used within onnxruntime and even more in onnxruntime-genai.

The performance bottleneck of generative LLM models is the huge KV-cache. When using onnxruntime even with CUDA, the output KV-cache is to be copied to the Host, then reordered as per beam indices, then fed back to LLM model as input for generation of a next token. Unlike it, onnxruntime-genai has a special function that does all those operations on Device memory space - for that it uses cudaMemcpyAsync. This is, to my understanding, is the root cause of the much better performance of onnxruntime-genai over onnxruntime.

However, CUDA toolkit is available to onnxruntime when compiled with "--use_cuda" flag. But an application that is built against the onnxruntime with CUDA does not have access to cudaMemcpy because onnxruntime API does not expose it. The only way for the application to achieve the performance comparable to onnxruntime-genai's is to compile it both against onnxruntime with CUDA and CUDA Toolkit.

Am I correct in my understanding of 1. the root cause and 2. the inability to get the maximum performance of solely building against onnxruntime + CUDA?

To reproduce

Nothing to reproduce, this is an analysis

Urgency

No response

Platform

Linux

OS Version

ubuntu 20

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime 1.18.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

@idruker-cerence idruker-cerence added the performance issues related to performance regressions label Nov 11, 2024
@idruker-cerence idruker-cerence changed the title [Performance] the root cause of onnxruntime-genai's being much faster than onnxruntime? [Performance] [CUDA] the root cause of onnxruntime-genai's being much faster than onnxruntime? Nov 12, 2024
@idruker-cerence idruker-cerence changed the title [Performance] [CUDA] the root cause of onnxruntime-genai's being much faster than onnxruntime? [Performance] the root cause of onnxruntime-genai's being much faster than onnxruntime? Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance issues related to performance regressions
Projects
None yet
Development

No branches or pull requests

1 participant