-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Your repository and paper have been enormously useful to me, but I've come across a few issues. For context, I've been trying to use the detection branch to train multimodal OD models.
Issue 1: bounding boxes are sometimes inconsistent, missing, or incorrect.
https://github.com/user-attachments/assets/fd8e7809-b784-4958-a113-57a56571a481
The above clip is an example from town 07. In the first second, you'll see a few bicycles with multiple bounding boxes each. A few seconds later, there are motorcyclists without bounding boxes, then towards the end of the clip there are cars either missing bounding boxes or boxes that lag behind the actual object.
Inconsistent boxes aren't unique to just this town; furthermore, there's at least one instance in the dataset of a motorcyclist bounding box that occupies 49% of the frame's pixels.
Attempted fix
I've narrowed the issue down to the __getitem__ method of SelmaDrones. I verified for a few objects missing boxes that the semantic segmentation map matches the class I see visually, but the bounding boxes are thrown out somewhere in translating from the CARLA 3D boxes to 2D boxes within the screen.
Issue 2: Class numbers and mapping
Your paper mentions using 8 classes for object detection, but the code uses 9: ["Person", "Car", "Truck", "Bus", "Train", "Motorcycle", "Bicycle", "Motorcyclist", "Bicyclist"], mapping the classes to numbers 0-8. This is an issue because 0 is reserved for the background class in the torchvision models used. Line 75 in train.py instantiates the model: model = fasterrcnn_mobilenet_v3_large_fpn(num_classes=9). This is a problem because the documentation for fasterrcnn_mobilenet_v3_large_fpn says that num_classes should be the number of actual classes + 1; the code should say model = fasterrcnn_mobilenet_v3_large_fpn(num_classes=10)