Skip to content

Significant Performance Gap: Python version (10ms) vs Rust version (30ms) on same CUDA device #87

@georgetime1970

Description

@georgetime1970

Search before asking

  • I have searched the Ultralytics issues and discussions and found no similar questions.

Question

Description

I observed a significant performance difference between the Python and Rust implementations of Ultralytics YOLO on the same machine with the same model and video input using CUDA.

  • Python: ~9-10ms per frame (Inference)
  • Rust: ~28-32ms (Inference step) / ~40ms (Total)

I want to know what is the reason and if my rust code can be optimized

Environment

Comparison Data

Python logs:

......
video 1/1 (frame 228/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.2ms
video 1/1 (frame 229/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 8.2ms
video 1/1 (frame 230/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 10.5ms
video 1/1 (frame 231/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 17.3ms
video 1/1 (frame 232/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.3ms
video 1/1 (frame 233/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 13.2ms
video 1/1 (frame 234/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 8.6ms
video 1/1 (frame 235/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.9ms
video 1/1 (frame 236/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 self_run, 13.9ms
video 1/1 (frame 237/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_run, 8.3ms
video 1/1 (frame 238/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_run, 1 self_stand, 9.2ms
video 1/1 (frame 239/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 self_run, 9.5ms
video 1/1 (frame 240/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_cry, 8.8ms
video 1/1 (frame 241/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 9.3ms
video 1/1 (frame 242/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_stand, 9.2ms
video 1/1 (frame 243/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_stand, 9.3ms
....

Rust logs:

===========================================
读取一帧完成======1.4575ms
读Resize完成======1.8232ms
转BGR完成======1.9506ms
转image完成======2.0977ms
推理完成======28.5842ms
画面显示完成======38.1044ms
===========================================
读取一帧完成======1.6994ms
读Resize完成======2.1734ms
转BGR完成======2.3352ms
转image完成======2.6267ms
推理完成======27.9236ms
画面显示完成======39.3473ms
===========================================
读取一帧完成======2.0785ms
读Resize完成======2.5626ms
转BGR完成======2.7047ms
转image完成======2.8463ms
推理完成======29.1376ms
画面显示完成======39.8454ms
===========================================
读取一帧完成======1.6215ms
读Resize完成======2.1005ms
转BGR完成======2.238ms
转image完成======2.4213ms
推理完成======28.4747ms
画面显示完成======39.8682ms
===========================================

use cargo run --release
Most time is spent on reasoning

Source Code

Python Script:

from ultralytics import YOLO
model = YOLO("yolo_boss.pt")

results = model(
    "boss.mp4",
    show=True, # 是否显示
    conf=0.25,  # 置信度
    device=0, # 使用 GPU
    half=True # 加快gpu速度
)

Rust Script:

[package]
name = "yolo_boss"
version = "0.1.0"
edition = "2024"

[dependencies]
image = "0.25.9"
opencv = "0.98.1"
ultralytics-inference = { git = "https://github.com/ultralytics/inference.git", features = ["cuda", "tensorrt"] }
use image::{DynamicImage, RgbImage};
use opencv::{core, highgui, imgproc, prelude::*, videoio};
use std::time::Instant;
use ultralytics_inference::{Device, InferenceConfig, YOLOModel};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. 初始化模型 (GPU)
    let config = InferenceConfig::new().with_device(Device::Cuda(0));
    // .with_half(true);
    let mut model = YOLOModel::load_with_config("yolo_boss.onnx", config)?;
    // 假设你的模型输入尺寸是 640x640 (或者根据你的 boss.mp4 调整为640x384)
    let model_size = core::Size::new(640, 384);

    // 2. 初始化文件
    let mut cam = videoio::VideoCapture::from_file("wukong.mp4", videoio::CAP_FFMPEG)?;
    if !videoio::VideoCapture::is_opened(&cam)? {
        panic!("无法打开文件");
    }

    // 创建显示窗口
    let window = "YOLO Real-time Detection";
    highgui::named_window(window, highgui::WINDOW_AUTOSIZE)?;

    println!("==========================开始实时检测,按 'q' 退出...");

    let mut frame = core::Mat::default();
    loop {
        let start = Instant::now(); // 计时器

        // 3. 读取一帧
        cam.read(&mut frame)?;
        if frame.empty() {
            break;
        }
        println!("读取一帧完成======{:?}", &start.elapsed());

        // 2. 在 OpenCV 里完成 Resize。这是 C++ 优化的,比 Rust image 库快 10 倍以上
        let mut resized = core::Mat::default();
        imgproc::resize(
            &frame,
            &mut resized,
            model_size,
            0.0,
            0.0,
            imgproc::INTER_LINEAR,
        )?;

        println!("读Resize完成======{:?}", &start.elapsed());

        // 4. OpenCV Mat (BGR) 转为 DynamicImage (RGB) 供模型使用
        // 先转 BGR 到 RGB
        let mut rgb_frame = core::Mat::default();
        imgproc::cvt_color(
            &resized,
            &mut rgb_frame,
            imgproc::COLOR_BGR2RGB,
            0,
            core::AlgorithmHint::ALGO_HINT_DEFAULT,
        )?;

        println!("转BGR完成======{:?}", &start.elapsed());

        // 转换为 image 库的对象 (这部分可能会有轻微性能开销)
        let width = rgb_frame.cols() as u32;
        let height = rgb_frame.rows() as u32;
        let data = rgb_frame.data_bytes()?;
        let rgb_img =
            RgbImage::from_raw(width, height, data.to_vec()).ok_or("Failed to create RgbImage")?;
        let dyn_img = DynamicImage::ImageRgb8(rgb_img);
        println!("转image完成======{:?}", &start.elapsed());

        // 5. 推理
        let results = model.predict_image(&dyn_img, "webcam_frame".to_string())?; // 注意使用 predict_image
        println!("推理完成======{:?}", &start.elapsed());

        // dbg!(&results);
        // 6. 处理结果并画框
        // 注意:results 往往是一个 Vec,通常单张图推理取第一个结果即可
        if let Some(result) = results.get(0) {
            if let Some(ref boxes) = result.boxes {
                // 先把整个坐标矩阵存到一个变量里
                let xyxy_matrix = boxes.xyxy();
                for i in 0..boxes.len() {
                    // 使用 .row(i) 获取第 i 行(包含 x1, y1, x2, y2)
                    let b = xyxy_matrix.row(i);
                    let conf = boxes.conf()[i];
                    if conf < 0.3 {
                        continue;
                    } // 置信度过滤

                    let cls = boxes.cls()[i] as usize;
                    let name = result
                        .names
                        .get(&cls)
                        .map(|s| s.as_str())
                        .unwrap_or("unknown");
                    let label = format!("{}: {:.2}", name, conf);

                    // OpenCV 画框 (Mat 使用 BGR)
                    let p1 = core::Point::new(b[0] as i32, b[1] as i32);
                    let p2 = core::Point::new(b[2] as i32, b[3] as i32);
                    imgproc::rectangle(
                        &mut resized,
                        core::Rect::from_points(p1, p2),
                        core::Scalar::new(0.0, 255.0, 0.0, 0.0),
                        2,
                        8,
                        0,
                    )?;

                    // 画一个文本底色背景(可选,为了看清楚)
                    imgproc::put_text(
                        &mut resized,
                        &label,
                        core::Point::new(b[0] as i32, (b[1] - 10.0) as i32), // 写在框上方
                        imgproc::FONT_HERSHEY_SIMPLEX,
                        0.6,
                        core::Scalar::new(0.0, 255.0, 0.0, 0.0), // 绿色文字
                        2,
                        imgproc::LINE_8,
                        false,
                    )?;
                }
            }
        }

        // 计算并显示 FPS
        let fps = 1.0 / start.elapsed().as_secs_f32();
        imgproc::put_text(
            &mut resized,
            &format!("FPS: {:.1}", fps),
            core::Point::new(20, 40),
            imgproc::FONT_HERSHEY_SIMPLEX,
            1.0,
            core::Scalar::new(0.0, 0.0, 255.0, 0.0),
            2,
            imgproc::LINE_8,
            false,
        )?;

        // // 7. 显示画面
        highgui::imshow(window, &resized)?;

        // ... 推理逻辑 ...
        // 每 2 帧才刷新一次显示
        let key = highgui::wait_key(1)?; // 减少这行代码的执行频率
        if key == 'q' as i32 {
            break;
        }
        println!("画面显示完成======{:?}", &start.elapsed());
        println!("===========================================");
    }

    Ok(())
}

Questions

  • Why is it so slow?
  • Is there something wrong with my code?
  • I really can't optimize my rust code anymore😂😂

Additional

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions