[Performance] the root cause of onnxruntime-genai's being much faster than onnxruntime?

### Describe the issue

CUDA creates an execution pipeline on Device (that is, where GPU runs). For efficiency, it’s the best to make Host operations on Device whenever possible. There is a CUDA API that allows explicitly copying data from Host to Device, from Device to Host, from Device to Device – cudaMemcpy() and its derivatives. This function is widely used within onnxruntime and even more in onnxruntime-genai. 

The performance bottleneck of generative LLM models is the huge KV-cache. When using onnxruntime even with CUDA, the output KV-cache is to be copied to the Host, then reordered as per beam indices, then fed back to LLM model as input for generation of a next token. Unlike it, onnxruntime-genai has a special function that does all those operations on Device memory space - for that it uses cudaMemcpyAsync. This is, to my understanding, is the root cause of the much better performance of onnxruntime-genai over onnxruntime.

However, CUDA toolkit is available to onnxruntime when compiled with "--use_cuda" flag. But an application that is built against the onnxruntime with CUDA does not have access to cudaMemcpy because onnxruntime API does not expose it. The only way for the application to achieve the performance comparable to onnxruntime-genai's is to compile it **both** against onnxruntime with CUDA **and** CUDA Toolkit.

Am I correct in my understanding of 1. the root cause and 2. the inability to get the maximum performance of solely building against onnxruntime + CUDA? 

### To reproduce

Nothing to reproduce, this is an analysis

### Urgency

_No response_

### Platform

Linux

### OS Version

ubuntu 20

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

onnxruntime 1.18.0

### ONNX Runtime API

C++

### Architecture

X64

### Execution Provider

CUDA

### Execution Provider Library Version

_No response_

### Model File

_No response_

### Is this a quantized model?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] the root cause of onnxruntime-genai's being much faster than onnxruntime? #22805

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] the root cause of onnxruntime-genai's being much faster than onnxruntime? #22805

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions