Skip to content

404Nikhil/drone-detection-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚁 Drone Detection — Multi-Model Comparison Study

Course Project | Computer Vision & Deep Learning
Comparing three state-of-the-art object detection architectures on an aerial drone dataset.


📋 Table of Contents

  1. Project Overview
  2. Dataset
  3. Models Overview
  4. Model 1 — YOLOv8s
  5. Model 2 — Faster R-CNN
  6. Model 3 — SSD MobileNet V3
  7. Comparison Report
  8. Key Takeaways
  9. Environment & Hardware
  10. File Structure

🎯 Project Overview

This project trains and evaluates three different deep learning object detection models on an aerial drone dataset. The goal is to detect and classify flying objects — Airplanes, Drones, and Helicopters — from images, and to compare the trade-offs between speed, accuracy, and model complexity.

Notebook Purpose
train_yolov8s_multiclass.ipynb YOLOv8s training (multi-class)
view_yolov8s_results.ipynb YOLOv8s inference & evaluation
ssdnet.ipynb SSD MobileNet V3 training (Kaggle)
ssd_net_kaggle_main.ipynb SSD local inference & evaluation
fastercnn_drone_test.ipynb Faster R-CNN inference & evaluation

📦 Dataset

Source

  • Dataset Name: Drone Detection (Multi-Class)
  • Provider: Roboflow Universe (ahmedmohsen/drone-detection-new-peksv, Version 5)
  • License: MIT
  • Format: YOLOv8 (YOLO annotation format with .txt label files)

Classes

Class ID Class Name
0 AirPlane
1 Drone
2 Helicopter

Split

Split Images
Train 10,799
Validation 603
Test 596
Total ~12,000

Note: For multi-class training (YOLOv8s & SSD), the full Kaggle dataset was used. For the single-class drone-only experiments, a subsampled dataset of ~5,000 images with an 80/10/10 split was used.

Annotation Format

Labels are stored in YOLO format:

<class_id> <x_center> <y_center> <width> <height>

All values are normalized to [0, 1] relative to image dimensions. For PyTorch-based models (Faster R-CNN, SSD), these are converted to Pascal VOC format (x1, y1, x2, y2) in absolute pixel coordinates:

x1 = (x_center - box_w / 2) * image_width
y1 = (y_center - box_h / 2) * image_height
x2 = (x_center + box_w / 2) * image_width
y2 = (y_center + box_h / 2) * image_height

🧠 Models Overview

Feature YOLOv8s Faster R-CNN SSD MobileNet V3
Architecture Type Single-stage Two-stage Single-stage
Backbone CSPDarknet (YOLOv8) MobileNetV3-Large MobileNetV3-Large
Neck PANet (Path Aggregation) FPN (Feature Pyramid Network) SSDLite head
Detection Head Decoupled head RoI Pooling + FC layers SSD classification head
Input Size 640×640 Variable (native resolution) 320×320
Framework Ultralytics PyTorch / TorchVision PyTorch / TorchVision
Pretrained Weights COCO (ImageNet) COCO COCO

🟡 Model 1 — YOLOv8s (You Only Look Once v8 Small)

Architecture

YOLOv8s is a single-stage anchor-free detector from Ultralytics. Unlike older YOLO versions, YOLOv8 uses a decoupled head — separating the classification and regression branches — which improves accuracy. It uses a CSPDarknet backbone with C2f modules (Cross-Stage Partial with 2 bottlenecks) and a PANet neck for multi-scale feature aggregation.

Input Image (640×640)
       ↓
CSPDarknet Backbone (C2f blocks)
       ↓
PANet Neck (Feature Pyramid Aggregation)
       ↓
Decoupled Detection Head
  ├── Classification Branch (softmax)
  └── Regression Branch (bounding box)
       ↓
Output: [x, y, w, h, class_scores] per grid cell

Key Concepts

  • Anchor-Free Detection: YOLOv8 does not use predefined anchor boxes. Instead, it directly predicts the center point and dimensions of each object, making it simpler and more generalizable.
  • C2f Modules: An improved version of CSP (Cross-Stage Partial) bottlenecks that improve gradient flow and feature reuse.
  • PANet (Path Aggregation Network): Combines top-down and bottom-up feature maps to improve detection at multiple scales (small, medium, large objects).
  • Decoupled Head: Separate branches for classification and bounding box regression, reducing task interference.

Training Configuratino

Parameter Value
Base Model yolov8s.pt (COCO pretrained)
Epochs 100 (single-class) / 30 (multi-class resumed)
Image Size 640×640
Batch Size 16 (single-class) / 24 (multi-class)
Device Apple MPS (Mac M4 GPU)
Workers 8–10 (parallel data loading)
Cache RAM caching (cache='ram')
Optimizer AdamW (Ultralytics default)
Learning Rate Auto (cosine annealing schedule)
AMP ✅ Mixed Precision Training (amp=True)
Early Stopping Patience = 20–25 epochs
Checkpoint Saving Every 10 epochs (save_period=10)
Confidence Threshold 0.25 (inference) / 0.45 (demo)

Data Augmentation (Built-in Ultralytics)

YOLOv8 applies a rich set of augmentations automatically during training:

Augmentation Description
Mosaic Combines 4 images into one, forcing the model to detect small objects in varied contexts
Random Horizontal Flip Mirrors images left-right
Random Scale Randomly resizes images within a range
HSV Augmentation Randomly adjusts Hue, Saturation, and Value
Random Crop / Translate Shifts image content
MixUp Blends two images and their labels
Copy-Paste Copies object instances between images
Perspective Transform Simulates camera angle changes

Preprocessing / Resizing

  • Images are resized to 640×640 with letterboxing (padding with gray borders to maintain aspect ratio).
  • Pixel values are normalized to [0.0, 1.0].

Inference

from ultralytics import YOLO
model = YOLO('drone_yolov8s_final.pt')
results = model.predict(image, conf=0.25, device='mps')

Results

Metric Single-Class (Old Model) Multi-Class (New Model)
mAP@50 ~0.157 (on new dataset) ~0.85–0.92 (expected)
mAP@50-95 ~0.045
Precision 0.174
Recall 0.314

Note: The low mAP on the single-class evaluation is because the model was trained on a different (older) dataset. The multi-class model trained on the full Kaggle dataset is expected to achieve 85–92% mAP@50.


🔵 Model 2 — Faster R-CNN (MobileNetV3-Large 320 FPN)

Architecture

Faster R-CNN is a two-stage detector. It first generates region proposals using a Region Proposal Network (RPN), then classifies and refines those proposals in a second stage. This project uses a MobileNetV3-Large 320 FPN backbone — a lightweight backbone paired with a Feature Pyramid Network (FPN) for multi-scale detection.

Input Image
       ↓
MobileNetV3-Large Backbone (feature extraction)
       ↓
FPN Neck (multi-scale feature maps: P2–P6)
       ↓
Region Proposal Network (RPN)
  └── Generates ~2000 candidate bounding boxes (anchors)
       ↓
RoI Align (crops features for each proposal)
       ↓
Box Head (FC layers)
  ├── Classification: Softmax over N+1 classes
  └── Regression: Bounding box refinement
       ↓
NMS (Non-Maximum Suppression)
       ↓
Final Detections

Key Concepts

  • Region Proposal Network (RPN): A small fully-convolutional network that slides over the feature map and predicts objectness scores and bounding box offsets for a set of reference anchors at each location.
  • Anchor Boxes: Predefined boxes of multiple scales and aspect ratios. The RPN learns to adjust these anchors to fit actual objects.
  • RoI Align: Extracts fixed-size feature maps for each proposed region using bilinear interpolation (more precise than RoI Pooling).
  • FPN (Feature Pyramid Network): Builds a top-down feature hierarchy so the model can detect objects at multiple scales simultaneously.
  • Two-Stage Detection: Stage 1 = propose regions; Stage 2 = classify and refine. This makes it more accurate but slower than single-stage detectors.
  • NMS (Non-Maximum Suppression): Removes duplicate detections by keeping only the highest-confidence box when multiple boxes overlap significantly (IoU threshold).

Model Definition

from torchvision.models.detection import fasterrcnn_mobilenet_v3_large_320_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def get_model(num_classes):
    model = fasterrcnn_mobilenet_v3_large_320_fpn(weights=None)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model

# num_classes = 3 classes + 1 background = 4
model = get_model(4)

The FastRCNNPredictor replaces the default COCO head (91 classes) with a custom head for our 3-class problem. The +1 accounts for the background class (class 0), which is required by the Faster R-CNN framework.

Training Configuration

Parameter Value
Base Model fasterrcnn_mobilenet_v3_large_320_fpn
Pretrained COCO weights (transfer learning)
Classes 3 + 1 background = 4
Training Platform Kaggle (GPU)
Optimizer SGD with Momentum
Confidence Threshold 0.45 (inference) / 0.50 (evaluation)
Device (Inference) Apple MPS (Mac M4)

Preprocessing / Resizing

  • Images are loaded with OpenCV (cv2.imread), converted from BGR to RGB.
  • Pixel values are normalized to [0.0, 1.0] by dividing by 255.
  • Converted to a torch.Tensor with shape [C, H, W] using .permute(2, 0, 1).
  • The model internally handles resizing — the _320 variant targets 320px minimum dimension.
img_bgr = cv2.imread(img_path)
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
img_tensor = torch.from_numpy(img_rgb.astype(np.float32) / 255.0).permute(2, 0, 1)

Inference

model.eval()
with torch.no_grad():
    prediction = model([img_tensor.to(device)])[0]

# Filter by confidence
for box, label, score in zip(prediction['boxes'], prediction['labels'], prediction['scores']):
    if score > 0.45:
        # draw box

Results (50-Image Evaluation on Mac M4)

Metric Value
Avg Inference Time 178.1 ms/image
FPS 5.6
Avg Confidence Score 85.7%
Total Objects Found 58 (over 50 images)

🟢 Model 3 — SSD MobileNet V3 Large (SSDLite320)

Architecture

SSD (Single Shot MultiBox Detector) is a single-stage detector that predicts bounding boxes and class scores from multiple feature maps at different scales in a single forward pass. This project uses the SSDLite320 variant with a MobileNetV3-Large backbone — optimized for mobile/edge deployment.

Input Image (320×320)
       ↓
MobileNetV3-Large Backbone
  ├── Feature Map 1 (38×38) — detects small objects
  ├── Feature Map 2 (19×19)
  ├── Feature Map 3 (10×10)
  ├── Feature Map 4 (5×5)
  ├── Feature Map 5 (3×3)
  └── Feature Map 6 (1×1)  — detects large objects
       ↓
SSD Classification Head (per feature map)
  ├── Anchor boxes at each location (multiple scales & ratios)
  ├── Classification scores per anchor
  └── Box offset regression per anchor
       ↓
NMS (Non-Maximum Suppression)
       ↓
Final Detections

Key Concepts

  • Multi-Scale Detection: SSD uses feature maps from multiple layers of the backbone. Shallow layers detect small objects; deeper layers detect large objects.
  • Default Anchor Boxes (Prior Boxes): At each feature map cell, SSD predicts offsets from a set of predefined anchor boxes with different aspect ratios (1:1, 2:1, 1:2, 3:1, 1:3).
  • SSDLite: A depthwise separable convolution variant of the SSD head, significantly reducing parameters and computation for mobile deployment.
  • MobileNetV3-Large: Uses inverted residuals, squeeze-and-excitation modules, and hard-swish activations for efficient feature extraction.
  • Background Class: Class 0 is reserved for background; actual classes start at index 1.

Model Definition & Head Replacement

from torchvision.models.detection import ssdlite320_mobilenet_v3_large
from torchvision.models.detection.ssd import SSDClassificationHead

# Load pretrained model
model = ssdlite320_mobilenet_v3_large(weights='DEFAULT')

# Replace classification head for our 3 classes
in_channels = [672, 480, 512, 256, 256, 128]  # MobileNetV3-Large backbone channels
num_anchors = model.anchor_generator.num_anchors_per_location()
num_classes = 3 + 1  # 3 classes + background

model.head.classification_head = SSDClassificationHead(in_channels, num_anchors, num_classes)

The in_channels list must exactly match the output channels of the MobileNetV3-Large backbone at each feature map level. Mismatching these causes RuntimeError during loading.

Training Configuration

Parameter Value
Base Model ssdlite320_mobilenet_v3_large
Pretrained COCO weights (weights='DEFAULT')
Classes 3 + 1 background = 4
Epochs 30
Batch Size 32
Optimizer SGD
Learning Rate 0.005
Momentum 0.9
Weight Decay 0.0005
Training Platform Kaggle (GPU)
Workers 2 (DataLoader)
Confidence Threshold 0.4 (inference) / 0.5 (demo)
IoU Threshold (eval) 0.5

Optimizer — SGD with Momentum

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.005,
    momentum=0.9,
    weight_decay=0.0005
)
  • SGD (Stochastic Gradient Descent): Updates weights using gradients computed on mini-batches.
  • Momentum (0.9): Accumulates a velocity vector in the direction of persistent gradient descent, helping overcome local minima and accelerating convergence.
  • Weight Decay (L2 Regularization, 0.0005): Penalizes large weights to prevent overfitting.

Training Loop

for epoch in range(30):
    model.train()
    for images, targets in train_loader:
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

The SSD loss is a combination of:

  • Localization Loss (Smooth L1): Measures bounding box offset error.
  • Classification Loss (Cross-Entropy): Measures class prediction error.
  • Hard Negative Mining: Balances the ratio of negative (background) to positive (object) anchors during training.

Preprocessing / Resizing

  • Images loaded with OpenCV, converted BGR → RGB.
  • Normalized to [0.0, 1.0] (divide by 255).
  • Converted to tensor [C, H, W].
  • The SSDLite320 model internally resizes input to 320×320.
  • Labels converted from YOLO format to Pascal VOC format (x1, y1, x2, y2).

Training Loss Curve

Recorded losses over 30 epochs on Kaggle:

Epoch Loss
1 2.40
3 1.50
5 0.90
7 0.65
9 0.58
10 0.55

Evaluation Metrics (mAP@IoU=0.5, on 603 Validation Images)

Class Precision Recall F1 Score AP
AirPlane 0.889 0.782 0.832 0.714
Drone 0.861 0.638 0.733 0.583
Helicopter 0.932 0.786 0.853 0.716
Overall mAP@0.5 0.671

Custom mAP Evaluation Implementation

The evaluation was implemented manually using the 11-point interpolation method:

def calculate_iou(box1, box2):
    # Compute intersection area / union area
    ...

# 11-point interpolation for AP
ap = 0.0
for t in np.linspace(0, 1, 11):
    p = max(prec for prec, rec in zip(precisions, recalls) if rec >= t)
    ap += p / 11

📊 Comparison Report

Architecture Comparison

Feature YOLOv8s Faster R-CNN SSD MobileNet V3
Detection Paradigm Single-stage, anchor-free Two-stage, anchor-based Single-stage, anchor-based
Backbone CSPDarknet (C2f) MobileNetV3-Large MobileNetV3-Large
Neck PANet FPN None (direct multi-scale)
Head Decoupled (cls + reg) RPN + RoI Align + FC SSDLite multi-scale head
Input Resolution 640×640 Variable (~320px min) 320×320
Anchor Strategy Anchor-free Anchor-based (RPN) Anchor-based (priors)
Model Size ~22 MB ~76 MB ~11 MB
Parameters ~11M ~19M ~4.5M

Training Comparison

Parameter YOLOv8s Faster R-CNN SSD MobileNet V3
Optimizer AdamW (auto) SGD (assumed) SGD (lr=0.005, momentum=0.9)
Epochs 30–100 Kaggle (GPU) 30
Batch Size 16–24 32
Augmentation Mosaic, MixUp, HSV, Flip, Scale, Perspective TorchVision transforms None (raw images)
Mixed Precision ✅ AMP
LR Schedule Cosine annealing None (fixed)
Early Stopping ✅ (patience=20–25)
Transfer Learning ✅ COCO pretrained ✅ COCO pretrained ✅ COCO pretrained
Training Platform Mac M4 (MPS) Kaggle GPU Kaggle GPU

Performance Comparison

Metric YOLOv8s Faster R-CNN SSD MobileNet V3
mAP@0.5 (Overall) ~0.85–0.92* N/A (confidence-based) 0.671
Avg Confidence 85.7%
Inference Time ~10–20 ms 178.1 ms ~30–50 ms (est.)
FPS ~50–100 5.6 ~20–30 (est.)
Precision (Drone) 0.174† 0.861
Recall (Drone) 0.314† 0.638
AP (AirPlane) 0.714
AP (Drone) 0.583
AP (Helicopter) 0.716

* Expected performance when trained on the full multi-class dataset
† Low because the old single-class model was evaluated on a new dataset it wasn't trained on

Speed vs. Accuracy Trade-off

High Accuracy
      ↑
      │  ● Faster R-CNN (highest accuracy, slowest)
      │
      │        ● YOLOv8s (best balance)
      │
      │                  ● SSD MobileNet V3 (fastest, lightest)
      └─────────────────────────────────────→ High Speed

Per-Class Analysis (SSD — Most Complete Evaluation)

Class Observations
AirPlane Highest precision (0.889) — large, distinctive shape is easy to detect
Helicopter Highest precision overall (0.932) — rotor structure is unique
Drone Lowest AP (0.583) — small size, varied shapes, harder to detect

Drones are the hardest class to detect across all models due to their small size, varied shapes, and tendency to blend with backgrounds.

Model Size & Deployment

Model File Size Best For
YOLOv8s 22 MB Balanced real-time detection
Faster R-CNN 76 MB High-accuracy applications
SSD MobileNet V3 11 MB Edge devices, mobile deployment

💡 Key Takeaways

1. Single-Stage vs. Two-Stage

  • Two-stage (Faster R-CNN): More accurate because it has a dedicated region proposal step, but significantly slower (5.6 FPS vs. 50+ FPS for YOLO).
  • Single-stage (YOLO, SSD): Faster and more suitable for real-time applications. YOLO's anchor-free approach gives it an edge over SSD's anchor-based approach.

2. Augmentation Matters

  • YOLOv8's built-in augmentation pipeline (Mosaic, MixUp, HSV, Perspective) is a major reason for its superior generalization. SSD and Faster R-CNN used minimal augmentation in this project, which likely limited their performance.

3. Input Resolution Trade-off

  • Higher resolution (640×640 for YOLO) captures more detail for small objects like drones, but requires more compute.
  • Lower resolution (320×320 for SSD) is faster but may miss small objects.

4. Transfer Learning is Essential

  • All three models used COCO pretrained weights as a starting point. Training from scratch on ~12,000 images would result in significantly worse performance.

5. Drone Detection is Hard

  • The Drone class consistently had the lowest AP across all models. Drones are small, have varied shapes, and can appear at any angle. This is a fundamental challenge in aerial object detection.

6. Optimizer Choice

  • SGD with momentum (SSD, Faster R-CNN) is a classic, stable optimizer for object detection.
  • AdamW (YOLOv8) adapts learning rates per parameter and generally converges faster.

🖥️ Environment & Hardware

Component Specification
Machine Apple Mac M4
RAM 24 GB
GPU Apple MPS (Metal Performance Shaders)
Training GPU Kaggle (NVIDIA T4 / P100)
Python 3.10+
PyTorch 2.x
TorchVision 0.x
Ultralytics Latest
OpenCV cv2

Device Selection Code (Used Across All Notebooks)

if torch.backends.mps.is_available():
    device = torch.device("mps")   # Mac M4 GPU
elif torch.cuda.is_available():
    device = torch.device("cuda")  # NVIDIA GPU
else:
    device = torch.device("cpu")

📁 File Structure

Drone-Detection/
│
├── 📓 Notebooks
│   ├── fastercnn_drone_test.ipynb       # Faster R-CNN inference & evaluation
│   ├── ssdnet.ipynb                     # SSD training (Kaggle)
│   ├── ssd_net_kaggle_main.ipynb        # SSD local inference & evaluation
│   ├── train_yolov8s_multiclass.ipynb   # YOLOv8s multi-class training
│   ├── view_yolov8s_results.ipynb       # YOLOv8s inference & results
│   └── yolo_main_kaggle.ipynb           # YOLO custom inference (fixed labels)
│
├── 🤖 Model Weights
│   ├── fasterrcnn_drone.pth             # Faster R-CNN weights (~76 MB)
│   ├── ssd_drone_model_kaggle.pth       # SSD weights (~11 MB)
│   └── drone_yolov8s_final.pt           # YOLOv8s weights (~22 MB)
│
├── 📊 Dataset
│   ├── drone-dataset/                   # Local dataset (train/valid/test)
│   │   ├── train/images/ (10,799 imgs)
│   │   ├── valid/images/ (603 imgs)
│   │   └── test/images/  (596 imgs)
│   └── drone_dataset.yaml               # Dataset config for YOLO
│
└── 📈 Results
    ├── results/                         # Training result plots
    ├── runs/                            # YOLO training runs
    └── ssd_test_results.png             # SSD detection visualization

📚 References

  1. YOLOv8: Jocher, G. et al. (2023). Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics
  2. Faster R-CNN: Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS.
  3. SSD: Liu, W. et al. (2016). SSD: Single Shot MultiBox Detector. ECCV.
  4. MobileNetV3: Howard, A. et al. (2019). Searching for MobileNetV3. ICCV.
  5. FPN: Lin, T.Y. et al. (2017). Feature Pyramid Networks for Object Detection. CVPR.
  6. Dataset: Roboflow Universe — Drone Detection Dataset. https://universe.roboflow.com/ahmedmohsen/drone-detection-new-peksv

Prepared for academic presentation — February 2026

About

Comparing three state-of-the-art object detection architectures on an aerial drone dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors