Significant Performance Gap: Python version (10ms) vs Rust version (30ms) on same CUDA device

### Search before asking

- [x] I have searched the Ultralytics [issues](https://github.com/ultralytics/PROJECT_NAME/issues) and [discussions](https://github.com/orgs/ultralytics/discussions) and found no similar questions.


### Question

## Description

I observed a significant performance difference between the Python and Rust implementations of Ultralytics YOLO on the same machine with the same model and video input using CUDA.

- Python: ~9-10ms per frame (Inference)
- Rust: ~28-32ms (Inference step) / ~40ms (Total)

I want to know what is the reason and if my rust code can be optimized

## Environment

- GPU: RTX 4060
- OS: Windows
- Model: YOLOv26l
- Library: ultralytics (Python) vs ultralytics-inference = { git = "https://github.com/ultralytics/inference.git", features = ["cuda", "tensorrt"] } (Rust)

## Comparison Data

Python logs:

```bash
......
video 1/1 (frame 228/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.2ms
video 1/1 (frame 229/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 8.2ms
video 1/1 (frame 230/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 10.5ms
video 1/1 (frame 231/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 17.3ms
video 1/1 (frame 232/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.3ms
video 1/1 (frame 233/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 13.2ms
video 1/1 (frame 234/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 8.6ms
video 1/1 (frame 235/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.9ms
video 1/1 (frame 236/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 self_run, 13.9ms
video 1/1 (frame 237/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_run, 8.3ms
video 1/1 (frame 238/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_run, 1 self_stand, 9.2ms
video 1/1 (frame 239/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 self_run, 9.5ms
video 1/1 (frame 240/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_cry, 8.8ms
video 1/1 (frame 241/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 9.3ms
video 1/1 (frame 242/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_stand, 9.2ms
video 1/1 (frame 243/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_stand, 9.3ms
....
```

Rust logs:

```bash
===========================================
读取一帧完成======1.4575ms
读Resize完成======1.8232ms
转BGR完成======1.9506ms
转image完成======2.0977ms
推理完成======28.5842ms
画面显示完成======38.1044ms
===========================================
读取一帧完成======1.6994ms
读Resize完成======2.1734ms
转BGR完成======2.3352ms
转image完成======2.6267ms
推理完成======27.9236ms
画面显示完成======39.3473ms
===========================================
读取一帧完成======2.0785ms
读Resize完成======2.5626ms
转BGR完成======2.7047ms
转image完成======2.8463ms
推理完成======29.1376ms
画面显示完成======39.8454ms
===========================================
读取一帧完成======1.6215ms
读Resize完成======2.1005ms
转BGR完成======2.238ms
转image完成======2.4213ms
推理完成======28.4747ms
画面显示完成======39.8682ms
===========================================
```

> use `cargo run --release`
> Most time is spent on reasoning

## Source Code

Python Script:

```Python
from ultralytics import YOLO
model = YOLO("yolo_boss.pt")

results = model(
    "boss.mp4",
    show=True, # 是否显示
    conf=0.25,  # 置信度
    device=0, # 使用 GPU
    half=True # 加快gpu速度
)
```

Rust Script:

```toml
[package]
name = "yolo_boss"
version = "0.1.0"
edition = "2024"

[dependencies]
image = "0.25.9"
opencv = "0.98.1"
ultralytics-inference = { git = "https://github.com/ultralytics/inference.git", features = ["cuda", "tensorrt"] }
```

```rust
use image::{DynamicImage, RgbImage};
use opencv::{core, highgui, imgproc, prelude::*, videoio};
use std::time::Instant;
use ultralytics_inference::{Device, InferenceConfig, YOLOModel};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. 初始化模型 (GPU)
    let config = InferenceConfig::new().with_device(Device::Cuda(0));
    // .with_half(true);
    let mut model = YOLOModel::load_with_config("yolo_boss.onnx", config)?;
    // 假设你的模型输入尺寸是 640x640 (或者根据你的 boss.mp4 调整为640x384)
    let model_size = core::Size::new(640, 384);

    // 2. 初始化文件
    let mut cam = videoio::VideoCapture::from_file("wukong.mp4", videoio::CAP_FFMPEG)?;
    if !videoio::VideoCapture::is_opened(&cam)? {
        panic!("无法打开文件");
    }

    // 创建显示窗口
    let window = "YOLO Real-time Detection";
    highgui::named_window(window, highgui::WINDOW_AUTOSIZE)?;

    println!("==========================开始实时检测，按 'q' 退出...");

    let mut frame = core::Mat::default();
    loop {
        let start = Instant::now(); // 计时器

        // 3. 读取一帧
        cam.read(&mut frame)?;
        if frame.empty() {
            break;
        }
        println!("读取一帧完成======{:?}", &start.elapsed());

        // 2. 在 OpenCV 里完成 Resize。这是 C++ 优化的，比 Rust image 库快 10 倍以上
        let mut resized = core::Mat::default();
        imgproc::resize(
            &frame,
            &mut resized,
            model_size,
            0.0,
            0.0,
            imgproc::INTER_LINEAR,
        )?;

        println!("读Resize完成======{:?}", &start.elapsed());

        // 4. OpenCV Mat (BGR) 转为 DynamicImage (RGB) 供模型使用
        // 先转 BGR 到 RGB
        let mut rgb_frame = core::Mat::default();
        imgproc::cvt_color(
            &resized,
            &mut rgb_frame,
            imgproc::COLOR_BGR2RGB,
            0,
            core::AlgorithmHint::ALGO_HINT_DEFAULT,
        )?;

        println!("转BGR完成======{:?}", &start.elapsed());

        // 转换为 image 库的对象 (这部分可能会有轻微性能开销)
        let width = rgb_frame.cols() as u32;
        let height = rgb_frame.rows() as u32;
        let data = rgb_frame.data_bytes()?;
        let rgb_img =
            RgbImage::from_raw(width, height, data.to_vec()).ok_or("Failed to create RgbImage")?;
        let dyn_img = DynamicImage::ImageRgb8(rgb_img);
        println!("转image完成======{:?}", &start.elapsed());

        // 5. 推理
        let results = model.predict_image(&dyn_img, "webcam_frame".to_string())?; // 注意使用 predict_image
        println!("推理完成======{:?}", &start.elapsed());

        // dbg!(&results);
        // 6. 处理结果并画框
        // 注意：results 往往是一个 Vec，通常单张图推理取第一个结果即可
        if let Some(result) = results.get(0) {
            if let Some(ref boxes) = result.boxes {
                // 先把整个坐标矩阵存到一个变量里
                let xyxy_matrix = boxes.xyxy();
                for i in 0..boxes.len() {
                    // 使用 .row(i) 获取第 i 行（包含 x1, y1, x2, y2）
                    let b = xyxy_matrix.row(i);
                    let conf = boxes.conf()[i];
                    if conf < 0.3 {
                        continue;
                    } // 置信度过滤

                    let cls = boxes.cls()[i] as usize;
                    let name = result
                        .names
                        .get(&cls)
                        .map(|s| s.as_str())
                        .unwrap_or("unknown");
                    let label = format!("{}: {:.2}", name, conf);

                    // OpenCV 画框 (Mat 使用 BGR)
                    let p1 = core::Point::new(b[0] as i32, b[1] as i32);
                    let p2 = core::Point::new(b[2] as i32, b[3] as i32);
                    imgproc::rectangle(
                        &mut resized,
                        core::Rect::from_points(p1, p2),
                        core::Scalar::new(0.0, 255.0, 0.0, 0.0),
                        2,
                        8,
                        0,
                    )?;

                    // 画一个文本底色背景（可选，为了看清楚）
                    imgproc::put_text(
                        &mut resized,
                        &label,
                        core::Point::new(b[0] as i32, (b[1] - 10.0) as i32), // 写在框上方
                        imgproc::FONT_HERSHEY_SIMPLEX,
                        0.6,
                        core::Scalar::new(0.0, 255.0, 0.0, 0.0), // 绿色文字
                        2,
                        imgproc::LINE_8,
                        false,
                    )?;
                }
            }
        }

        // 计算并显示 FPS
        let fps = 1.0 / start.elapsed().as_secs_f32();
        imgproc::put_text(
            &mut resized,
            &format!("FPS: {:.1}", fps),
            core::Point::new(20, 40),
            imgproc::FONT_HERSHEY_SIMPLEX,
            1.0,
            core::Scalar::new(0.0, 0.0, 255.0, 0.0),
            2,
            imgproc::LINE_8,
            false,
        )?;

        // // 7. 显示画面
        highgui::imshow(window, &resized)?;

        // ... 推理逻辑 ...
        // 每 2 帧才刷新一次显示
        let key = highgui::wait_key(1)?; // 减少这行代码的执行频率
        if key == 'q' as i32 {
            break;
        }
        println!("画面显示完成======{:?}", &start.elapsed());
        println!("===========================================");
    }

    Ok(())
}
```

## Questions

- Why is it so slow?
- Is there something wrong with my code?
- I really can't optimize my rust code anymore😂😂


### Additional

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significant Performance Gap: Python version (10ms) vs Rust version (30ms) on same CUDA device #87

Search before asking

Question

Description

Environment

Comparison Data

Source Code

Questions

Additional

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Significant Performance Gap: Python version (10ms) vs Rust version (30ms) on same CUDA device #87

Description

Search before asking

Question

Description

Environment

Comparison Data

Source Code

Questions

Additional

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions