Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT backend not work when the device_id is greater than 0 #19467

Closed
jax11235 opened this issue Feb 8, 2024 · 15 comments
Closed

TensorRT backend not work when the device_id is greater than 0 #19467

jax11235 opened this issue Feb 8, 2024 · 15 comments
Assignees
Labels
ep:TensorRT issues related to TensorRT execution provider

Comments

@jax11235
Copy link

jax11235 commented Feb 8, 2024

Describe the issue

Cuda backend runs fine on multiple gpus, but TensorRT fails when the device_id is specified as a value greater than 0.
One successful workaround I have tried so far is using multiple processes, and each using different environment variables CUDA_VISIBLE_DEVICES=device_id.

To reproduce

...

Urgency

Urgent, because the project is organized in multiple threads, and no workaround works.

Platform

Linux

OS Version

20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3 and 1.17.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

TensorRT 8.6

@tianleiwu tianleiwu added the ep:TensorRT issues related to TensorRT execution provider label Feb 9, 2024
@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 9, 2024

Some conversations here
#16274

@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 9, 2024

Can you share your C++ code of how you set device id?

We suggest to use session_options.AppendExecutionProvider_TensorRT_V2(tensorrt_options);
please see here for reference:
https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#click-below-for-c-api-example

@jax11235
Copy link
Author

This is part of my code to append tensorrt backend to sess_options:

    OrtApi const& ortApi = Ort::GetApi(); // Uses ORT_API_VERSION
    OrtTensorRTProviderOptionsV2* tensorrt_options = nullptr;
    Ort::ThrowOnError(ortApi.CreateTensorRTProviderOptions(&tensorrt_options));
    std::vector<const char*> keys{"device_id", "trt_max_workspace_size", "trt_fp16_enable", "trt_engine_cache_enable", "trt_engine_cache_path", "trt_timing_cache_enable"};
    std::vector<const char*> values{device_id.c_str(), memory_limit.c_str(), "1", "1", tensorrt_engine_cache_path.c_str(), "1"};
    Ort::ThrowOnError(ortApi.UpdateTensorRTProviderOptions(tensorrt_options, keys.data(), values.data(), keys.size()));
    Ort::ThrowOnError(ortApi.SessionOptionsAppendExecutionProvider_TensorRT_V2(so, tensorrt_options));
    ortApi.ReleaseTensorRTProviderOptions(tensorrt_options);

@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 10, 2024

Thanks, the code to append TRT EP to session options looks okay.

Could you also share the

  • application code where it creates multiple sessions and specifies different device_id to intialize session with TRT EP? and where you set CUDA_VISIBLE_DEVICES
  • the error message

Please note that users can only specify device_id at session initialization time and it's one gpu for one session.
Once the session is created, you can't change the device. It means when you call session.Run(), TRT EP always uses that device to perform inference, i think even though you set CUDA_VISIBLE_DEVICES=device_id before session.Run(), it won't have effect.

I'm still not sure what your application code is for now.
But we tested from our side, TRT EP can run different device within one session through provider options "device_id":
onnxruntime_perf_test -e tensorrt -r 1 -i "device_id|2" model.onnx

@chilo-ms
Copy link
Contributor

Reply other discussion thread here

What I mean is that different threads uses different sessions, but if the first thread init the session with the environment variable CUDA_VISIBLE_DEVICES, subsequent threads can not change the device by modifying CUDA_VISIBLE_DEVICES, they use the same device with the first thread.

"subsequent threads can not change device by modifying CUDA_VISIBLE_DEVICES", other threads try to change the device at session creation? Or session.Run()?

@jax11235
Copy link
Author

Thanks, the code to append TRT EP to session options looks okay.

Could you also share the

  • application code where it creates multiple sessions and specifies different device_id to intialize session with TRT EP? and where you set CUDA_VISIBLE_DEVICES
  • the error message

Please note that users can only specify device_id at session initialization time and it's one gpu for one session. Once the session is created, you can't change the device. It means when you call session.Run(), TRT EP always uses that device to perform inference, i think even though you set CUDA_VISIBLE_DEVICES=device_id before session.Run(), it won't have effect.

I'm still not sure what your application code is for now. But we tested from our side, TRT EP can run different device within one session through provider options "device_id": onnxruntime_perf_test -e tensorrt -r 1 -i "device_id|2" model.onnx

Without setting CUDA_VISIBLE_DEVICES, the way I create sessions is equivalent to creating a session array, each session with a different device_id appended to a new session_option.
The error message is TensorRT EP execution context enqueue failed, the same as #16274.

@jax11235
Copy link
Author

Reply other discussion thread here

What I mean is that different threads uses different sessions, but if the first thread init the session with the environment variable CUDA_VISIBLE_DEVICES, subsequent threads can not change the device by modifying CUDA_VISIBLE_DEVICES, they use the same device with the first thread.

"subsequent threads can not change device by modifying CUDA_VISIBLE_DEVICES", other threads try to change the device at session creation? Or session.Run()?

My previous description was inaccurate, I create theses sessions sequentially in a single thread, like:

for (int i...) {
    setenv(CUDA_VISIBLE_DEVICES, device_ids[i], 1);
    sessions[i] = ...; // create a new session with a different session_options all with device_id=0
    sessions[i].run(...); // warmup
}

Above code can use a device with device_id > 0, but all sessions use the same device with device_id=device_ids[0].

@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 13, 2024

Without setting CUDA_VISIBLE_DEVICES, the way I create sessions is equivalent to creating a session array, each session with a different device_id appended to a new session_option. The error message is TensorRT EP execution context enqueue failed, the same as #16274.

That's a bit strange, i can't repro from my side with devide_id specified by provider option:

for (int i...) {
    std::vector<const char*> keys{"device_id"};
    std::vector<const char*> values{device_id[i].c_str()}
    Ort::ThrowOnError(ortApi.UpdateTensorRTProviderOptions(tensorrt_options, keys.data(), values.data(), keys.size()));
    sessions[i] = ...; // create a new session with a different session_options with different device id
    sessions[i].run(...); // warmup
}

The error message is TensorRT EP execution context enqueue failed, the same as #16274.

Is that all the error message?
The issue has another error message: Error Code 1: Cuda Runtime (invalid resource handle)

Could you help enable verbose log by adding a line of code as below to see the TRT EP's full log?

Ort::Env env(ORT_LOGGING_LEVEL_VERBOSE, "test");

you will see the device_id configured by TRT EP provider options: (you should see different devide_id for different session)

...
2024-02-13 18:45:22.629147053 [V:onnxruntime:test, tensorrt_execution_provider.cc:1700 TensorrtExecutionProvider] [TensorRT EP] TensorRT provider options: device_id: 2, trt_max_partition_iterations: 1000, trt_min_subgraph_size: 1, trt_max_workspace_size: 1073741824, trt_fp16_enable: 0, trt_int8_enable: 0, trt_int8_calibration_cache_name: , int8_calibration_cache_available: 0, trt_int8_use_native_tensorrt_calibration_table: 0, trt_dla_enable: 0, trt_dla_core: 0, trt_dump_subgraphs: 0, trt_engine_cache_enable: 0, trt_cache_path: , trt_global_cache_path: , trt_engine_decryption_enable: 0, trt_engine_decryption_lib_path: , trt_force_sequential_engine_build: 0, trt_context_memory_sharing_enable: 0, trt_layer_norm_fp32_fallback: 0, trt_build_heuristics_enable: 0, trt_sparsity_enable: 0, trt_builder_optimization_level: 3, trt_auxiliary_streams: -1, trt_tactic_sources: , trt_profile_min_shapes: , trt_profile_max_shapes: , trt_profile_opt_shapes: , trt_cuda_graph_enable: 0, trt_dump_ep_context_model: 0, trt_ep_context_file_path: , trt_ep_context_embed_mode: 0, trt_cache_prefix:
...

@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 13, 2024

My previous description was inaccurate, I create theses sessions sequentially in a single thread, like:

for (int i...) {
    setenv(CUDA_VISIBLE_DEVICES, device_ids[i], 1);
    sessions[i] = ...; // create a new session with a different session_options all with device_id=0
    sessions[i].run(...); // warmup
}

Above code can use a device with device_id > 0, but all sessions use the same device with device_id=device_ids[0].

I can't repro from my side using CUDA_VISIBLE_DEVICES either

@chilo-ms
Copy link
Contributor

chilo-ms commented Feb 13, 2024

Follow-up questions here:

  1. How do you check that session using which devicde_id?
  2. Can you create one session with device_id > 0 via TRT EP provider options and run the session successfully?
    Or the session can only be run successfully with device_id = 0

@jax11235
Copy link
Author

Without setting CUDA_VISIBLE_DEVICES, the way I create sessions is equivalent to creating a session array, each session with a different device_id appended to a new session_option. The error message is TensorRT EP execution context enqueue failed, the same as #16274.

That's a bit strange, i can't repro from my side with devide_id specified by provider option:

for (int i...) {
    std::vector<const char*> keys{"device_id"};
    std::vector<const char*> values{device_id[i].c_str()}
    Ort::ThrowOnError(ortApi.UpdateTensorRTProviderOptions(tensorrt_options, keys.data(), values.data(), keys.size()));
    sessions[i] = ...; // create a new session with a different session_options with different device id
    sessions[i].run(...); // warmup
}

The error message is TensorRT EP execution context enqueue failed, the same as #16274.

Is that all the error message? The issue has another error message: Error Code 1: Cuda Runtime (invalid resource handle)

Could you help enable verbose log by adding a line of code as below to see the TRT EP's full log?

Ort::Env env(ORT_LOGGING_LEVEL_VERBOSE, "test");

you will see the device_id configured by TRT EP provider options: (you should see different devide_id for different session)

...
2024-02-13 18:45:22.629147053 [V:onnxruntime:test, tensorrt_execution_provider.cc:1700 TensorrtExecutionProvider] [TensorRT EP] TensorRT provider options: device_id: 2, trt_max_partition_iterations: 1000, trt_min_subgraph_size: 1, trt_max_workspace_size: 1073741824, trt_fp16_enable: 0, trt_int8_enable: 0, trt_int8_calibration_cache_name: , int8_calibration_cache_available: 0, trt_int8_use_native_tensorrt_calibration_table: 0, trt_dla_enable: 0, trt_dla_core: 0, trt_dump_subgraphs: 0, trt_engine_cache_enable: 0, trt_cache_path: , trt_global_cache_path: , trt_engine_decryption_enable: 0, trt_engine_decryption_lib_path: , trt_force_sequential_engine_build: 0, trt_context_memory_sharing_enable: 0, trt_layer_norm_fp32_fallback: 0, trt_build_heuristics_enable: 0, trt_sparsity_enable: 0, trt_builder_optimization_level: 3, trt_auxiliary_streams: -1, trt_tactic_sources: , trt_profile_min_shapes: , trt_profile_max_shapes: , trt_profile_opt_shapes: , trt_cuda_graph_enable: 0, trt_dump_ep_context_model: 0, trt_ep_context_file_path: , trt_ep_context_embed_mode: 0, trt_cache_prefix:
...

Thanks for your reply, I have switched to multiple processes version, it runs fine on multiple cuda devices.
Because you can't reproduce it, currently I think it may be other non-code issues, I will do more tests if have time.

@jax11235
Copy link
Author

Follow-up questions here:

  1. How do you check that session using which devicde_id?
  2. Can you create one session with device_id > 0 via TRT EP provider options and run the session successfully?
    Or the session can only be run successfully with device_id = 0

1: cuda ep works with correct device, just switch the ep to tensorrt.
2: no, if set CUDA_VISIBLE_DEVICES, can use a device with device_id>0, but the specified value of device_id in session_option is still 0.

@jax11235
Copy link
Author

I decide to close this issue and reopen it when there is new progress.

@chilo-ms
Copy link
Contributor

chilo-ms commented Mar 14, 2025

@jax11235
We recently root caused a multithreading issue which involves running on GPU device > 0 and fixed it for TRT EP.
#24010
The issue seems to be what you encountered here.

Could you please try this fix? it should be able to fix your problem here.

@jax11235
Copy link
Author

Good job! I will plan to test it recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:TensorRT issues related to TensorRT execution provider
Projects
None yet
Development

No branches or pull requests

4 participants