Skip to content

Intellindust-AI-Lab/DEIMv2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Object Detection Meets DINOv3

license arXiv project webpage prs issues stars Contact Us

DEIMv2 is an evolution of the DEIM framework while leveraging the rich features from DINOv3. Our method is designed with various model sizes, from an ultra-light version up to S, M, L, and X, to be adaptable for a wide range of scenarios. Across these variants, DEIMv2 achieves state-of-the-art performance, with the S-sized model notably surpassing 50 AP on the challenging COCO benchmark.


Shihua Huang1*,   Yongjie Hou1,2*,   Longfei Liu1*,   Xuanlong Yu1,   Xi Shen1†  

1. Intellindust AI Lab    2. Xiamen University  
* Equal Contribution    † Corresponding Author

If you like our work, please give us a ⭐!

Image 1 Image 2

🚀 Updates

🧭 Table of Content

1. Model Zoo

Model Dataset AP #Params GFLOPs Latency (ms) config checkpoint log
Atto COCO 23.8 0.5M 0.8 1.10 yml Google / Quark Google / Quark
Femto COCO 31.0 1.0M 1.7 1.45 yml Google / Quark Google / Quark
Pico COCO 38.5 1.5M 5.2 2.13 yml Google / Quark Google / Quark
N COCO 43.0 3.6M 6.8 2.32 yml Google / Quark Google / Quark
S COCO 50.9 9.7M 25.6 5.78 yml Google / Quark Google / Quark
M COCO 53.0 18.1M 52.2 8.80 yml Google / Quark Google / Quark
L COCO 56.0 32.2M 96.7 10.47 yml Google / Quark Google / Quark
X COCO 57.8 50.3M 151.6 13.75 yml Google / Quark Google / Quark

2. Quick start

Setup

conda create -n deimv2 python=3.11 -y
conda activate deimv2
pip install -r requirements.txt

Data Preparation

COCO2017 Dataset
  1. Download COCO2017 from OpenDataLab or COCO.

  2. Modify paths in coco_detection.yml

    train_dataloader:
        img_folder: /data/COCO2017/train2017/
        ann_file: /data/COCO2017/annotations/instances_train2017.json
    val_dataloader:
        img_folder: /data/COCO2017/val2017/
        ann_file: /data/COCO2017/annotations/instances_val2017.json
Custom Dataset

To train on your custom dataset, you need to organize it in the COCO format. Follow the steps below to prepare your dataset:

  1. Set remap_mscoco_category to False:

    This prevents the automatic remapping of category IDs to match the MSCOCO categories.

    remap_mscoco_category: False
  2. Organize Images:

    Structure your dataset directories as follows:

    dataset/
    ├── images/
    │   ├── train/
    │   │   ├── image1.jpg
    │   │   ├── image2.jpg
    │   │   └── ...
    │   ├── val/
    │   │   ├── image1.jpg
    │   │   ├── image2.jpg
    │   │   └── ...
    └── annotations/
        ├── instances_train.json
        ├── instances_val.json
        └── ...
    • images/train/: Contains all training images.
    • images/val/: Contains all validation images.
    • annotations/: Contains COCO-formatted annotation files.
  3. Convert Annotations to COCO Format:

    If your annotations are not already in COCO format, you'll need to convert them. You can use the following Python script as a reference or utilize existing tools:

    import json
    
    def convert_to_coco(input_annotations, output_annotations):
        # Implement conversion logic here
        pass
    
    if __name__ == "__main__":
        convert_to_coco('path/to/your_annotations.json', 'dataset/annotations/instances_train.json')
  4. Update Configuration Files:

    Modify your custom_detection.yml.

    task: detection
    
    evaluator:
      type: CocoEvaluator
      iou_types: ['bbox', ]
    
    num_classes: 777 # your dataset classes
    remap_mscoco_category: False
    
    train_dataloader:
      type: DataLoader
      dataset:
        type: CocoDetection
        img_folder: /data/yourdataset/train
        ann_file: /data/yourdataset/train/train.json
        return_masks: False
        transforms:
          type: Compose
          ops: ~
      shuffle: True
      num_workers: 4
      drop_last: True
      collate_fn:
        type: BatchImageCollateFunction
    
    val_dataloader:
      type: DataLoader
      dataset:
        type: CocoDetection
        img_folder: /data/yourdataset/val
        ann_file: /data/yourdataset/val/ann.json
        return_masks: False
        transforms:
          type: Compose
          ops: ~
      shuffle: False
      num_workers: 4
      drop_last: False
      collate_fn:
        type: BatchImageCollateFunction

Backbone Checkpoints

For DINOv3 S and S+, download them following the guide in https://github.com/facebookresearch/dinov3

For our distilled ViT-Tiny and ViT-Tiny+, you can download them from ViT-Tiny and ViT-Tiny+.

Then place them into ./ckpts as:

ckpts/
├── dinov3_vits16.pth
├── vitt_distill.pt
├── vittplus_distill.pt
└── ...

3. Usage

COCO2017
  1. Training
# for ViT-based variants
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7777 --nproc_per_node=4 train.py -c configs/deimv2/deimv2_dinov3_${model}_coco.yml --use-amp --seed=0

# for HGNetv2-based variants
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7777 --nproc_per_node=4 train.py -c configs/deimv2/deimv2_hgnetv2_${model}_coco.yml --use-amp --seed=0
  1. Testing
# for ViT-based variants
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7777 --nproc_per_node=4 train.py -c configs/deimv2/deimv2_dinov3_${model}_coco.yml --test-only -r model.pth

# for HGNetv2-based variants
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7777 --nproc_per_node=4 train.py -c configs/deimv2/deimv2_hgnetv2_${model}_coco.yml --test-only -r model.pth
  1. Tuning
# for ViT-based variants
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7777 --nproc_per_node=4 train.py -c configs/deimv2/deimv2_dinov3_${model}_coco.yml --use-amp --seed=0 -t model.pth

# for HGNetv2-based variants
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7777 --nproc_per_node=4 train.py -c configs/deimv2/deimv2_hgnetv2_${model}_coco.yml --use-amp --seed=0 -t model.pth
Customizing Batch Size

For example, if you want to use DEIMv2-S and double the total batch size to 64 when training DEIMv2 on COCO2017, here are the steps you should follow:

  1. Modify your deimv2_dinov3_s_coco.yml to increase the total_batch_size:

    train_dataloader:
      total_batch_size: 64 
      dataset: 
        transforms:
          ops:
            ...
    
      collate_fn:
        ...
  2. Modify your deimv2_dinov3_s_coco.yml. Here’s how the key parameters should be adjusted:

    optimizer:
      type: AdamW
    
      params: 
        -
          # except norm/bn/bias in self.dinov3
          params: '^(?=.*.dinov3)(?!.*(?:norm|bn|bias)).*$'  
          lr: 0.00005  # doubled, linear scaling law
        -
          # including all norm/bn/bias in self.dinov3
          params: '^(?=.*.dinov3)(?=.*(?:norm|bn|bias)).*$'    
          lr: 0.00005   # doubled, linear scaling law
          weight_decay: 0.
        - 
          # including all norm/bn/bias except for the self.dinov3
          params: '^(?=.*(?:sta|encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
          weight_decay: 0.
    
      lr: 0.0005   # linear scaling law if needed
      betas: [0.9, 0.999]
      weight_decay: 0.0001
    
    ema:  # added EMA settings
      decay: 0.9998  # adjusted by 1 - (1 - decay) * 2
      warmups: 500  # halved
    
    lr_warmup_scheduler:
      warmup_duration: 250  # halved
Customizing Input Size

If you'd like to train DEIMv2-S on COCO2017 with an input size of 320x320, follow these steps:

  1. Modify your deimv2_dinov3_s_coco.yml:

    eval_spatial_size: [320, 320]
    
    train_dataloader:
      # Here we set the total_batch_size to 64 as an example.
      total_batch_size: 64 
      dataset: 
        transforms:
          ops:
            #  Especially for Mosaic augmentation, it is recommended that output_size = input_size / 2.
            - {type: Mosaic, output_size: 160, rotation_range: 10, translation_range: [0.1, 0.1], scaling_range: [0.5, 1.5],
               probability: 1.0, fill_value: 0, use_cache: True, max_cached_images: 50, random_pop: True}
            ...
            - {type: Resize, size: [320, 320], }
            ...
        collate_fn:
          base_size: 320
          ...
    
    val_dataloader:
      dataset:
        transforms:
          ops:
            - {type: Resize, size: [320, 320], }
            ...
Customizing Epoch

If you want to finetune DEIMv2-S for 20 epochs, follow these steps (for reference only; feel free to adjust them according to your needs):

epoches: 32 #  Total epochs: 20 for training + EMA  for 4n = 12. n refers to the model size in the matched config.

flat_epoch: 14    # 4 + 20 // 2
no_aug_epoch: 12  # 4n

train_dataloader:
  dataset: 
    transforms:
      ops:
        ...
      policy:
        epoch: [4, 14, 20]   # [start_epoch, flat_epoch, epoches - no_aug_epoch]

  collate_fn:
    ...
    mixup_epochs: [4, 14]  # [start_epoch, flat_epoch]
    stop_epoch: 20  # epoches - no_aug_epoch
    copyblend_epochs: [4, 20]  # [start_epoch, epoches - no_aug_epoch]
  
DEIMCriterion:
  matcher:
    ...
    matcher_change_epoch: 18  # ~90% of (epoches - no_aug_epoch)

4. Tools

Deployment
  1. Setup
pip install onnx onnxsim
  1. Export onnx
python tools/deployment/export_onnx.py --check -c configs/deimv2/deimv2_dinov3_${model}_coco.yml -r model.pth
  1. Export tensorrt
trtexec --onnx="model.onnx" --saveEngine="model.engine" --fp16
Inference (Visualization)
  1. Setup
pip install -r tools/inference/requirements.txt
  1. Inference (onnxruntime / tensorrt / torch)

Inference on images and videos is now supported.

python tools/inference/onnx_inf.py --onnx model.onnx --input image.jpg  # video.mp4
python tools/inference/trt_inf.py --trt model.engine --input image.jpg
python tools/inference/torch_inf.py -c configs/deimv2/deimv2_dinov3_${model}_coco.yml -r model.pth --input image.jpg --device cuda:0
Benchmark
  1. Setup
pip install -r tools/benchmark/requirements.txt
  1. Model FLOPs, MACs, and Params
python tools/benchmark/get_info.py -c configs/deimv2/deimv2_dinov3_${model}_coco.yml
  1. TensorRT Latency
python tools/benchmark/trt_benchmark.py --COCO_dir path/to/COCO2017 --engine_dir model.engine
Fiftyone Visualization
  1. Setup
pip install fiftyone
  1. Voxel51 Fiftyone Visualization (fiftyone)
python tools/visualization/fiftyone_vis.py -c configs/deimv2/deimv2_dinov3_${model}_coco.yml -r model.pth
Others
  1. Auto Resume Training
bash reference/safe_training.sh
  1. Converting Model Weights
python reference/convert_weight.py model.pth

5. Citation

If you use DEIMv2 or its methods in your work, please cite the following BibTeX entries:

bibtex
@article{huang2025deimv2,
  title={Real-Time Object Detection Meets DINOv3},
  author={Huang, Shihua and Hou, Yongjie and Liu, Longfei and Yu, Xuanlong and Shen, Xi},
  journal={arXiv},
  year={2025}
}
  

6. Acknowledgement

Our work is built upon D-FINE, RT-DETR, DEIM, and DINOv3. Thanks for their great work!

✨ Feel free to contribute and reach out if you have any questions! ✨

7. Star History

Star History Chart

Releases

No releases published

Packages

No packages published

Contributors 5