generated from ultralytics/template
-
-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Search before asking
- I have searched the Ultralytics issues and discussions and found no similar questions.
Question
Description
I observed a significant performance difference between the Python and Rust implementations of Ultralytics YOLO on the same machine with the same model and video input using CUDA.
- Python: ~9-10ms per frame (Inference)
- Rust: ~28-32ms (Inference step) / ~40ms (Total)
I want to know what is the reason and if my rust code can be optimized
Environment
- GPU: RTX 4060
- OS: Windows
- Model: YOLOv26l
- Library: ultralytics (Python) vs ultralytics-inference = { git = "https://github.com/ultralytics/inference.git", features = ["cuda", "tensorrt"] } (Rust)
Comparison Data
Python logs:
......
video 1/1 (frame 228/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.2ms
video 1/1 (frame 229/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 8.2ms
video 1/1 (frame 230/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 10.5ms
video 1/1 (frame 231/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 17.3ms
video 1/1 (frame 232/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.3ms
video 1/1 (frame 233/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 13.2ms
video 1/1 (frame 234/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 8.6ms
video 1/1 (frame 235/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_feet_attack, 1 self_run, 9.9ms
video 1/1 (frame 236/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 self_run, 13.9ms
video 1/1 (frame 237/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_run, 8.3ms
video 1/1 (frame 238/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_run, 1 self_stand, 9.2ms
video 1/1 (frame 239/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 self_run, 9.5ms
video 1/1 (frame 240/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_cry, 8.8ms
video 1/1 (frame 241/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 9.3ms
video 1/1 (frame 242/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_stand, 9.2ms
video 1/1 (frame 243/12299) C:\Users\george\Documents\gitbub\yolo26\boss.mp4: 384x640 1 boss_boom_attack, 1 self_stand, 9.3ms
....Rust logs:
===========================================
读取一帧完成======1.4575ms
读Resize完成======1.8232ms
转BGR完成======1.9506ms
转image完成======2.0977ms
推理完成======28.5842ms
画面显示完成======38.1044ms
===========================================
读取一帧完成======1.6994ms
读Resize完成======2.1734ms
转BGR完成======2.3352ms
转image完成======2.6267ms
推理完成======27.9236ms
画面显示完成======39.3473ms
===========================================
读取一帧完成======2.0785ms
读Resize完成======2.5626ms
转BGR完成======2.7047ms
转image完成======2.8463ms
推理完成======29.1376ms
画面显示完成======39.8454ms
===========================================
读取一帧完成======1.6215ms
读Resize完成======2.1005ms
转BGR完成======2.238ms
转image完成======2.4213ms
推理完成======28.4747ms
画面显示完成======39.8682ms
===========================================use
cargo run --release
Most time is spent on reasoning
Source Code
Python Script:
from ultralytics import YOLO
model = YOLO("yolo_boss.pt")
results = model(
"boss.mp4",
show=True, # 是否显示
conf=0.25, # 置信度
device=0, # 使用 GPU
half=True # 加快gpu速度
)Rust Script:
[package]
name = "yolo_boss"
version = "0.1.0"
edition = "2024"
[dependencies]
image = "0.25.9"
opencv = "0.98.1"
ultralytics-inference = { git = "https://github.com/ultralytics/inference.git", features = ["cuda", "tensorrt"] }use image::{DynamicImage, RgbImage};
use opencv::{core, highgui, imgproc, prelude::*, videoio};
use std::time::Instant;
use ultralytics_inference::{Device, InferenceConfig, YOLOModel};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// 1. 初始化模型 (GPU)
let config = InferenceConfig::new().with_device(Device::Cuda(0));
// .with_half(true);
let mut model = YOLOModel::load_with_config("yolo_boss.onnx", config)?;
// 假设你的模型输入尺寸是 640x640 (或者根据你的 boss.mp4 调整为640x384)
let model_size = core::Size::new(640, 384);
// 2. 初始化文件
let mut cam = videoio::VideoCapture::from_file("wukong.mp4", videoio::CAP_FFMPEG)?;
if !videoio::VideoCapture::is_opened(&cam)? {
panic!("无法打开文件");
}
// 创建显示窗口
let window = "YOLO Real-time Detection";
highgui::named_window(window, highgui::WINDOW_AUTOSIZE)?;
println!("==========================开始实时检测,按 'q' 退出...");
let mut frame = core::Mat::default();
loop {
let start = Instant::now(); // 计时器
// 3. 读取一帧
cam.read(&mut frame)?;
if frame.empty() {
break;
}
println!("读取一帧完成======{:?}", &start.elapsed());
// 2. 在 OpenCV 里完成 Resize。这是 C++ 优化的,比 Rust image 库快 10 倍以上
let mut resized = core::Mat::default();
imgproc::resize(
&frame,
&mut resized,
model_size,
0.0,
0.0,
imgproc::INTER_LINEAR,
)?;
println!("读Resize完成======{:?}", &start.elapsed());
// 4. OpenCV Mat (BGR) 转为 DynamicImage (RGB) 供模型使用
// 先转 BGR 到 RGB
let mut rgb_frame = core::Mat::default();
imgproc::cvt_color(
&resized,
&mut rgb_frame,
imgproc::COLOR_BGR2RGB,
0,
core::AlgorithmHint::ALGO_HINT_DEFAULT,
)?;
println!("转BGR完成======{:?}", &start.elapsed());
// 转换为 image 库的对象 (这部分可能会有轻微性能开销)
let width = rgb_frame.cols() as u32;
let height = rgb_frame.rows() as u32;
let data = rgb_frame.data_bytes()?;
let rgb_img =
RgbImage::from_raw(width, height, data.to_vec()).ok_or("Failed to create RgbImage")?;
let dyn_img = DynamicImage::ImageRgb8(rgb_img);
println!("转image完成======{:?}", &start.elapsed());
// 5. 推理
let results = model.predict_image(&dyn_img, "webcam_frame".to_string())?; // 注意使用 predict_image
println!("推理完成======{:?}", &start.elapsed());
// dbg!(&results);
// 6. 处理结果并画框
// 注意:results 往往是一个 Vec,通常单张图推理取第一个结果即可
if let Some(result) = results.get(0) {
if let Some(ref boxes) = result.boxes {
// 先把整个坐标矩阵存到一个变量里
let xyxy_matrix = boxes.xyxy();
for i in 0..boxes.len() {
// 使用 .row(i) 获取第 i 行(包含 x1, y1, x2, y2)
let b = xyxy_matrix.row(i);
let conf = boxes.conf()[i];
if conf < 0.3 {
continue;
} // 置信度过滤
let cls = boxes.cls()[i] as usize;
let name = result
.names
.get(&cls)
.map(|s| s.as_str())
.unwrap_or("unknown");
let label = format!("{}: {:.2}", name, conf);
// OpenCV 画框 (Mat 使用 BGR)
let p1 = core::Point::new(b[0] as i32, b[1] as i32);
let p2 = core::Point::new(b[2] as i32, b[3] as i32);
imgproc::rectangle(
&mut resized,
core::Rect::from_points(p1, p2),
core::Scalar::new(0.0, 255.0, 0.0, 0.0),
2,
8,
0,
)?;
// 画一个文本底色背景(可选,为了看清楚)
imgproc::put_text(
&mut resized,
&label,
core::Point::new(b[0] as i32, (b[1] - 10.0) as i32), // 写在框上方
imgproc::FONT_HERSHEY_SIMPLEX,
0.6,
core::Scalar::new(0.0, 255.0, 0.0, 0.0), // 绿色文字
2,
imgproc::LINE_8,
false,
)?;
}
}
}
// 计算并显示 FPS
let fps = 1.0 / start.elapsed().as_secs_f32();
imgproc::put_text(
&mut resized,
&format!("FPS: {:.1}", fps),
core::Point::new(20, 40),
imgproc::FONT_HERSHEY_SIMPLEX,
1.0,
core::Scalar::new(0.0, 0.0, 255.0, 0.0),
2,
imgproc::LINE_8,
false,
)?;
// // 7. 显示画面
highgui::imshow(window, &resized)?;
// ... 推理逻辑 ...
// 每 2 帧才刷新一次显示
let key = highgui::wait_key(1)?; // 减少这行代码的执行频率
if key == 'q' as i32 {
break;
}
println!("画面显示完成======{:?}", &start.elapsed());
println!("===========================================");
}
Ok(())
}Questions
- Why is it so slow?
- Is there something wrong with my code?
- I really can't optimize my rust code anymore😂😂
Additional
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested