Course Project | Computer Vision & Deep Learning
Comparing three state-of-the-art object detection architectures on an aerial drone dataset.
- Project Overview
- Dataset
- Models Overview
- Model 1 — YOLOv8s
- Model 2 — Faster R-CNN
- Model 3 — SSD MobileNet V3
- Comparison Report
- Key Takeaways
- Environment & Hardware
- File Structure
This project trains and evaluates three different deep learning object detection models on an aerial drone dataset. The goal is to detect and classify flying objects — Airplanes, Drones, and Helicopters — from images, and to compare the trade-offs between speed, accuracy, and model complexity.
| Notebook | Purpose |
|---|---|
train_yolov8s_multiclass.ipynb |
YOLOv8s training (multi-class) |
view_yolov8s_results.ipynb |
YOLOv8s inference & evaluation |
ssdnet.ipynb |
SSD MobileNet V3 training (Kaggle) |
ssd_net_kaggle_main.ipynb |
SSD local inference & evaluation |
fastercnn_drone_test.ipynb |
Faster R-CNN inference & evaluation |
- Dataset Name: Drone Detection (Multi-Class)
- Provider: Roboflow Universe (
ahmedmohsen/drone-detection-new-peksv, Version 5) - License: MIT
- Format: YOLOv8 (YOLO annotation format with
.txtlabel files)
| Class ID | Class Name |
|---|---|
| 0 | AirPlane |
| 1 | Drone |
| 2 | Helicopter |
| Split | Images |
|---|---|
| Train | 10,799 |
| Validation | 603 |
| Test | 596 |
| Total | ~12,000 |
Note: For multi-class training (YOLOv8s & SSD), the full Kaggle dataset was used. For the single-class drone-only experiments, a subsampled dataset of ~5,000 images with an 80/10/10 split was used.
Labels are stored in YOLO format:
<class_id> <x_center> <y_center> <width> <height>
All values are normalized to [0, 1] relative to image dimensions. For PyTorch-based models (Faster R-CNN, SSD), these are converted to Pascal VOC format (x1, y1, x2, y2) in absolute pixel coordinates:
x1 = (x_center - box_w / 2) * image_width
y1 = (y_center - box_h / 2) * image_height
x2 = (x_center + box_w / 2) * image_width
y2 = (y_center + box_h / 2) * image_height| Feature | YOLOv8s | Faster R-CNN | SSD MobileNet V3 |
|---|---|---|---|
| Architecture Type | Single-stage | Two-stage | Single-stage |
| Backbone | CSPDarknet (YOLOv8) | MobileNetV3-Large | MobileNetV3-Large |
| Neck | PANet (Path Aggregation) | FPN (Feature Pyramid Network) | SSDLite head |
| Detection Head | Decoupled head | RoI Pooling + FC layers | SSD classification head |
| Input Size | 640×640 | Variable (native resolution) | 320×320 |
| Framework | Ultralytics | PyTorch / TorchVision | PyTorch / TorchVision |
| Pretrained Weights | COCO (ImageNet) | COCO | COCO |
YOLOv8s is a single-stage anchor-free detector from Ultralytics. Unlike older YOLO versions, YOLOv8 uses a decoupled head — separating the classification and regression branches — which improves accuracy. It uses a CSPDarknet backbone with C2f modules (Cross-Stage Partial with 2 bottlenecks) and a PANet neck for multi-scale feature aggregation.
Input Image (640×640)
↓
CSPDarknet Backbone (C2f blocks)
↓
PANet Neck (Feature Pyramid Aggregation)
↓
Decoupled Detection Head
├── Classification Branch (softmax)
└── Regression Branch (bounding box)
↓
Output: [x, y, w, h, class_scores] per grid cell
- Anchor-Free Detection: YOLOv8 does not use predefined anchor boxes. Instead, it directly predicts the center point and dimensions of each object, making it simpler and more generalizable.
- C2f Modules: An improved version of CSP (Cross-Stage Partial) bottlenecks that improve gradient flow and feature reuse.
- PANet (Path Aggregation Network): Combines top-down and bottom-up feature maps to improve detection at multiple scales (small, medium, large objects).
- Decoupled Head: Separate branches for classification and bounding box regression, reducing task interference.
| Parameter | Value |
|---|---|
| Base Model | yolov8s.pt (COCO pretrained) |
| Epochs | 100 (single-class) / 30 (multi-class resumed) |
| Image Size | 640×640 |
| Batch Size | 16 (single-class) / 24 (multi-class) |
| Device | Apple MPS (Mac M4 GPU) |
| Workers | 8–10 (parallel data loading) |
| Cache | RAM caching (cache='ram') |
| Optimizer | AdamW (Ultralytics default) |
| Learning Rate | Auto (cosine annealing schedule) |
| AMP | ✅ Mixed Precision Training (amp=True) |
| Early Stopping | Patience = 20–25 epochs |
| Checkpoint Saving | Every 10 epochs (save_period=10) |
| Confidence Threshold | 0.25 (inference) / 0.45 (demo) |
YOLOv8 applies a rich set of augmentations automatically during training:
| Augmentation | Description |
|---|---|
| Mosaic | Combines 4 images into one, forcing the model to detect small objects in varied contexts |
| Random Horizontal Flip | Mirrors images left-right |
| Random Scale | Randomly resizes images within a range |
| HSV Augmentation | Randomly adjusts Hue, Saturation, and Value |
| Random Crop / Translate | Shifts image content |
| MixUp | Blends two images and their labels |
| Copy-Paste | Copies object instances between images |
| Perspective Transform | Simulates camera angle changes |
- Images are resized to 640×640 with letterboxing (padding with gray borders to maintain aspect ratio).
- Pixel values are normalized to
[0.0, 1.0].
from ultralytics import YOLO
model = YOLO('drone_yolov8s_final.pt')
results = model.predict(image, conf=0.25, device='mps')| Metric | Single-Class (Old Model) | Multi-Class (New Model) |
|---|---|---|
| mAP@50 | ~0.157 (on new dataset) | ~0.85–0.92 (expected) |
| mAP@50-95 | ~0.045 | — |
| Precision | 0.174 | — |
| Recall | 0.314 | — |
Note: The low mAP on the single-class evaluation is because the model was trained on a different (older) dataset. The multi-class model trained on the full Kaggle dataset is expected to achieve 85–92% mAP@50.
Faster R-CNN is a two-stage detector. It first generates region proposals using a Region Proposal Network (RPN), then classifies and refines those proposals in a second stage. This project uses a MobileNetV3-Large 320 FPN backbone — a lightweight backbone paired with a Feature Pyramid Network (FPN) for multi-scale detection.
Input Image
↓
MobileNetV3-Large Backbone (feature extraction)
↓
FPN Neck (multi-scale feature maps: P2–P6)
↓
Region Proposal Network (RPN)
└── Generates ~2000 candidate bounding boxes (anchors)
↓
RoI Align (crops features for each proposal)
↓
Box Head (FC layers)
├── Classification: Softmax over N+1 classes
└── Regression: Bounding box refinement
↓
NMS (Non-Maximum Suppression)
↓
Final Detections
- Region Proposal Network (RPN): A small fully-convolutional network that slides over the feature map and predicts objectness scores and bounding box offsets for a set of reference anchors at each location.
- Anchor Boxes: Predefined boxes of multiple scales and aspect ratios. The RPN learns to adjust these anchors to fit actual objects.
- RoI Align: Extracts fixed-size feature maps for each proposed region using bilinear interpolation (more precise than RoI Pooling).
- FPN (Feature Pyramid Network): Builds a top-down feature hierarchy so the model can detect objects at multiple scales simultaneously.
- Two-Stage Detection: Stage 1 = propose regions; Stage 2 = classify and refine. This makes it more accurate but slower than single-stage detectors.
- NMS (Non-Maximum Suppression): Removes duplicate detections by keeping only the highest-confidence box when multiple boxes overlap significantly (IoU threshold).
from torchvision.models.detection import fasterrcnn_mobilenet_v3_large_320_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
def get_model(num_classes):
model = fasterrcnn_mobilenet_v3_large_320_fpn(weights=None)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
return model
# num_classes = 3 classes + 1 background = 4
model = get_model(4)The
FastRCNNPredictorreplaces the default COCO head (91 classes) with a custom head for our 3-class problem. The+1accounts for the background class (class 0), which is required by the Faster R-CNN framework.
| Parameter | Value |
|---|---|
| Base Model | fasterrcnn_mobilenet_v3_large_320_fpn |
| Pretrained | COCO weights (transfer learning) |
| Classes | 3 + 1 background = 4 |
| Training Platform | Kaggle (GPU) |
| Optimizer | SGD with Momentum |
| Confidence Threshold | 0.45 (inference) / 0.50 (evaluation) |
| Device (Inference) | Apple MPS (Mac M4) |
- Images are loaded with OpenCV (
cv2.imread), converted from BGR to RGB. - Pixel values are normalized to
[0.0, 1.0]by dividing by 255. - Converted to a
torch.Tensorwith shape[C, H, W]using.permute(2, 0, 1). - The model internally handles resizing — the
_320variant targets 320px minimum dimension.
img_bgr = cv2.imread(img_path)
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
img_tensor = torch.from_numpy(img_rgb.astype(np.float32) / 255.0).permute(2, 0, 1)model.eval()
with torch.no_grad():
prediction = model([img_tensor.to(device)])[0]
# Filter by confidence
for box, label, score in zip(prediction['boxes'], prediction['labels'], prediction['scores']):
if score > 0.45:
# draw box| Metric | Value |
|---|---|
| Avg Inference Time | 178.1 ms/image |
| FPS | 5.6 |
| Avg Confidence Score | 85.7% |
| Total Objects Found | 58 (over 50 images) |
SSD (Single Shot MultiBox Detector) is a single-stage detector that predicts bounding boxes and class scores from multiple feature maps at different scales in a single forward pass. This project uses the SSDLite320 variant with a MobileNetV3-Large backbone — optimized for mobile/edge deployment.
Input Image (320×320)
↓
MobileNetV3-Large Backbone
├── Feature Map 1 (38×38) — detects small objects
├── Feature Map 2 (19×19)
├── Feature Map 3 (10×10)
├── Feature Map 4 (5×5)
├── Feature Map 5 (3×3)
└── Feature Map 6 (1×1) — detects large objects
↓
SSD Classification Head (per feature map)
├── Anchor boxes at each location (multiple scales & ratios)
├── Classification scores per anchor
└── Box offset regression per anchor
↓
NMS (Non-Maximum Suppression)
↓
Final Detections
- Multi-Scale Detection: SSD uses feature maps from multiple layers of the backbone. Shallow layers detect small objects; deeper layers detect large objects.
- Default Anchor Boxes (Prior Boxes): At each feature map cell, SSD predicts offsets from a set of predefined anchor boxes with different aspect ratios (1:1, 2:1, 1:2, 3:1, 1:3).
- SSDLite: A depthwise separable convolution variant of the SSD head, significantly reducing parameters and computation for mobile deployment.
- MobileNetV3-Large: Uses inverted residuals, squeeze-and-excitation modules, and hard-swish activations for efficient feature extraction.
- Background Class: Class 0 is reserved for background; actual classes start at index 1.
from torchvision.models.detection import ssdlite320_mobilenet_v3_large
from torchvision.models.detection.ssd import SSDClassificationHead
# Load pretrained model
model = ssdlite320_mobilenet_v3_large(weights='DEFAULT')
# Replace classification head for our 3 classes
in_channels = [672, 480, 512, 256, 256, 128] # MobileNetV3-Large backbone channels
num_anchors = model.anchor_generator.num_anchors_per_location()
num_classes = 3 + 1 # 3 classes + background
model.head.classification_head = SSDClassificationHead(in_channels, num_anchors, num_classes)The
in_channelslist must exactly match the output channels of the MobileNetV3-Large backbone at each feature map level. Mismatching these causesRuntimeErrorduring loading.
| Parameter | Value |
|---|---|
| Base Model | ssdlite320_mobilenet_v3_large |
| Pretrained | COCO weights (weights='DEFAULT') |
| Classes | 3 + 1 background = 4 |
| Epochs | 30 |
| Batch Size | 32 |
| Optimizer | SGD |
| Learning Rate | 0.005 |
| Momentum | 0.9 |
| Weight Decay | 0.0005 |
| Training Platform | Kaggle (GPU) |
| Workers | 2 (DataLoader) |
| Confidence Threshold | 0.4 (inference) / 0.5 (demo) |
| IoU Threshold (eval) | 0.5 |
optimizer = torch.optim.SGD(
model.parameters(),
lr=0.005,
momentum=0.9,
weight_decay=0.0005
)- SGD (Stochastic Gradient Descent): Updates weights using gradients computed on mini-batches.
- Momentum (0.9): Accumulates a velocity vector in the direction of persistent gradient descent, helping overcome local minima and accelerating convergence.
- Weight Decay (L2 Regularization, 0.0005): Penalizes large weights to prevent overfitting.
for epoch in range(30):
model.train()
for images, targets in train_loader:
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad()
losses.backward()
optimizer.step()The SSD loss is a combination of:
- Localization Loss (Smooth L1): Measures bounding box offset error.
- Classification Loss (Cross-Entropy): Measures class prediction error.
- Hard Negative Mining: Balances the ratio of negative (background) to positive (object) anchors during training.
- Images loaded with OpenCV, converted BGR → RGB.
- Normalized to
[0.0, 1.0](divide by 255). - Converted to tensor
[C, H, W]. - The SSDLite320 model internally resizes input to 320×320.
- Labels converted from YOLO format to Pascal VOC format (
x1, y1, x2, y2).
Recorded losses over 30 epochs on Kaggle:
| Epoch | Loss |
|---|---|
| 1 | 2.40 |
| 3 | 1.50 |
| 5 | 0.90 |
| 7 | 0.65 |
| 9 | 0.58 |
| 10 | 0.55 |
| Class | Precision | Recall | F1 Score | AP |
|---|---|---|---|---|
| AirPlane | 0.889 | 0.782 | 0.832 | 0.714 |
| Drone | 0.861 | 0.638 | 0.733 | 0.583 |
| Helicopter | 0.932 | 0.786 | 0.853 | 0.716 |
| Overall mAP@0.5 | — | — | — | 0.671 |
The evaluation was implemented manually using the 11-point interpolation method:
def calculate_iou(box1, box2):
# Compute intersection area / union area
...
# 11-point interpolation for AP
ap = 0.0
for t in np.linspace(0, 1, 11):
p = max(prec for prec, rec in zip(precisions, recalls) if rec >= t)
ap += p / 11| Feature | YOLOv8s | Faster R-CNN | SSD MobileNet V3 |
|---|---|---|---|
| Detection Paradigm | Single-stage, anchor-free | Two-stage, anchor-based | Single-stage, anchor-based |
| Backbone | CSPDarknet (C2f) | MobileNetV3-Large | MobileNetV3-Large |
| Neck | PANet | FPN | None (direct multi-scale) |
| Head | Decoupled (cls + reg) | RPN + RoI Align + FC | SSDLite multi-scale head |
| Input Resolution | 640×640 | Variable (~320px min) | 320×320 |
| Anchor Strategy | Anchor-free | Anchor-based (RPN) | Anchor-based (priors) |
| Model Size | ~22 MB | ~76 MB | ~11 MB |
| Parameters | ~11M | ~19M | ~4.5M |
| Parameter | YOLOv8s | Faster R-CNN | SSD MobileNet V3 |
|---|---|---|---|
| Optimizer | AdamW (auto) | SGD (assumed) | SGD (lr=0.005, momentum=0.9) |
| Epochs | 30–100 | Kaggle (GPU) | 30 |
| Batch Size | 16–24 | — | 32 |
| Augmentation | Mosaic, MixUp, HSV, Flip, Scale, Perspective | TorchVision transforms | None (raw images) |
| Mixed Precision | ✅ AMP | ❌ | ❌ |
| LR Schedule | Cosine annealing | — | None (fixed) |
| Early Stopping | ✅ (patience=20–25) | ❌ | ❌ |
| Transfer Learning | ✅ COCO pretrained | ✅ COCO pretrained | ✅ COCO pretrained |
| Training Platform | Mac M4 (MPS) | Kaggle GPU | Kaggle GPU |
| Metric | YOLOv8s | Faster R-CNN | SSD MobileNet V3 |
|---|---|---|---|
| mAP@0.5 (Overall) | ~0.85–0.92* | N/A (confidence-based) | 0.671 |
| Avg Confidence | — | 85.7% | — |
| Inference Time | ~10–20 ms | 178.1 ms | ~30–50 ms (est.) |
| FPS | ~50–100 | 5.6 | ~20–30 (est.) |
| Precision (Drone) | 0.174† | — | 0.861 |
| Recall (Drone) | 0.314† | — | 0.638 |
| AP (AirPlane) | — | — | 0.714 |
| AP (Drone) | — | — | 0.583 |
| AP (Helicopter) | — | — | 0.716 |
* Expected performance when trained on the full multi-class dataset
† Low because the old single-class model was evaluated on a new dataset it wasn't trained on
High Accuracy
↑
│ ● Faster R-CNN (highest accuracy, slowest)
│
│ ● YOLOv8s (best balance)
│
│ ● SSD MobileNet V3 (fastest, lightest)
└─────────────────────────────────────→ High Speed
| Class | Observations |
|---|---|
| AirPlane | Highest precision (0.889) — large, distinctive shape is easy to detect |
| Helicopter | Highest precision overall (0.932) — rotor structure is unique |
| Drone | Lowest AP (0.583) — small size, varied shapes, harder to detect |
Drones are the hardest class to detect across all models due to their small size, varied shapes, and tendency to blend with backgrounds.
| Model | File Size | Best For |
|---|---|---|
| YOLOv8s | 22 MB | Balanced real-time detection |
| Faster R-CNN | 76 MB | High-accuracy applications |
| SSD MobileNet V3 | 11 MB | Edge devices, mobile deployment |
- Two-stage (Faster R-CNN): More accurate because it has a dedicated region proposal step, but significantly slower (5.6 FPS vs. 50+ FPS for YOLO).
- Single-stage (YOLO, SSD): Faster and more suitable for real-time applications. YOLO's anchor-free approach gives it an edge over SSD's anchor-based approach.
- YOLOv8's built-in augmentation pipeline (Mosaic, MixUp, HSV, Perspective) is a major reason for its superior generalization. SSD and Faster R-CNN used minimal augmentation in this project, which likely limited their performance.
- Higher resolution (640×640 for YOLO) captures more detail for small objects like drones, but requires more compute.
- Lower resolution (320×320 for SSD) is faster but may miss small objects.
- All three models used COCO pretrained weights as a starting point. Training from scratch on ~12,000 images would result in significantly worse performance.
- The Drone class consistently had the lowest AP across all models. Drones are small, have varied shapes, and can appear at any angle. This is a fundamental challenge in aerial object detection.
- SGD with momentum (SSD, Faster R-CNN) is a classic, stable optimizer for object detection.
- AdamW (YOLOv8) adapts learning rates per parameter and generally converges faster.
| Component | Specification |
|---|---|
| Machine | Apple Mac M4 |
| RAM | 24 GB |
| GPU | Apple MPS (Metal Performance Shaders) |
| Training GPU | Kaggle (NVIDIA T4 / P100) |
| Python | 3.10+ |
| PyTorch | 2.x |
| TorchVision | 0.x |
| Ultralytics | Latest |
| OpenCV | cv2 |
if torch.backends.mps.is_available():
device = torch.device("mps") # Mac M4 GPU
elif torch.cuda.is_available():
device = torch.device("cuda") # NVIDIA GPU
else:
device = torch.device("cpu")Drone-Detection/
│
├── 📓 Notebooks
│ ├── fastercnn_drone_test.ipynb # Faster R-CNN inference & evaluation
│ ├── ssdnet.ipynb # SSD training (Kaggle)
│ ├── ssd_net_kaggle_main.ipynb # SSD local inference & evaluation
│ ├── train_yolov8s_multiclass.ipynb # YOLOv8s multi-class training
│ ├── view_yolov8s_results.ipynb # YOLOv8s inference & results
│ └── yolo_main_kaggle.ipynb # YOLO custom inference (fixed labels)
│
├── 🤖 Model Weights
│ ├── fasterrcnn_drone.pth # Faster R-CNN weights (~76 MB)
│ ├── ssd_drone_model_kaggle.pth # SSD weights (~11 MB)
│ └── drone_yolov8s_final.pt # YOLOv8s weights (~22 MB)
│
├── 📊 Dataset
│ ├── drone-dataset/ # Local dataset (train/valid/test)
│ │ ├── train/images/ (10,799 imgs)
│ │ ├── valid/images/ (603 imgs)
│ │ └── test/images/ (596 imgs)
│ └── drone_dataset.yaml # Dataset config for YOLO
│
└── 📈 Results
├── results/ # Training result plots
├── runs/ # YOLO training runs
└── ssd_test_results.png # SSD detection visualization
- YOLOv8: Jocher, G. et al. (2023). Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics
- Faster R-CNN: Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS.
- SSD: Liu, W. et al. (2016). SSD: Single Shot MultiBox Detector. ECCV.
- MobileNetV3: Howard, A. et al. (2019). Searching for MobileNetV3. ICCV.
- FPN: Lin, T.Y. et al. (2017). Feature Pyramid Networks for Object Detection. CVPR.
- Dataset: Roboflow Universe — Drone Detection Dataset. https://universe.roboflow.com/ahmedmohsen/drone-detection-new-peksv
Prepared for academic presentation — February 2026