Add double anchor

patrick-llgc · patrick-llgc · commit ff2374f63710 · 2020-10-20T12:45:58.000-07:00
diff --git a/README.md b/README.md
@@ -105,8 +105,14 @@ semi-supervised training](http://openaccess.thecvf.com/content_CVPR_2019/papers/
 - [Adaptive NMS: Refining Pedestrian Detection in a Crowd](https://arxiv.org/abs/1904.03629) [[Notes](paper_notes/adaptive_nms.md)] <kbd>CVPR 2019 oral</kbd> [crowd detection, NMS]
 - [Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd](https://arxiv.org/abs/1807.08407) [[Notes](paper_notes/orcnn.md)] <kbd>ECCV 2018</kbd> [crowd detection]
 - [CrowdDet: Detection in Crowded Scenes: One Proposal, Multiple Predictions](https://arxiv.org/abs/2003.09163) [[Notes](paper_notes/crowd_det.md)] <kbd>CVPR 2020 oral</kbd> [crowd detection, Megvii]
+- [RR-NMS: NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing](https://arxiv.org/abs/2003.12729) [[Notes](paper_notes/rr_nms.md)] <kbd>CVPR 2020</kbd>
+- [Double Anchor R-CNN for Human Detection in a Crowd](https://arxiv.org/abs/1909.09998) [[Notes](paper_notes/double_anchor.md)] [head-body bundle]
+- [Review: AP vs MR](paper_notes/ap_mr.md)
+- [Precise Detection in Densely Packed Scenes](https://arxiv.org/abs/1904.00853) <kbd>CVPR 2019</kbd> [crowd detection, no occlusion]
 - [TLL: Small-scale Pedestrian Detection Based on Somatic Topology Localization and Temporal Feature Aggregation](https://arxiv.org/abs/1807.01438) <kbd>ECCV 2018</kbd>
 - [Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels](https://arxiv.org/abs/2010.03506) [mono3D, Daniel Cremers, TUM]
+- [ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://openreview.net/forum?id=YicbFdNTTy) [[Notes](paper_notes/vit.md)] <kbd>ICLR 2020<kbd>
+- [BYOL: Bootstrap your own latent: A new approach to self-supervised Learning](https://arxiv.org/abs/2006.07733) [self-supervised]
 - [SAFENet: Self-Supervised Monocular Depth Estimation with Semantic-Aware
 Feature Extraction](https://arxiv.org/abs/2010.02893) [Monodepth, semantics, Naver labs]
 - [Toward Interactive Self-Annotation For Video Object Bounding Box: Recurrent Self-Learning And Hierarchical Annotation Based Framework](https://openaccess.thecvf.com/content_WACV_2020/papers/Le_Toward_Interactive_Self-Annotation_For_Video_Object_Bounding_Box_Recurrent_Self-Learning_WACV_2020_paper.pdf) <kbd>WACV 2020</kbd>
diff --git a/paper_notes/adaptive_nms.md b/paper_notes/adaptive_nms.md
@@ -17,7 +17,8 @@ Both [RepLoss](rep_loss.md) and [Occlusion aware R-CNN](orcnn.md) proposes addit
 	- On top of RPN for two stage detectors, taking the objectness predictions, bounding box predictions and conv features as input. 
 
 #### Technical details
-- Combining adaptive NMS and soft-NMS has minor or even negative improvements on metric MR^-2 (0.01 to 1 FPPI). Reason may be the benefit happens beyond 1 FPPI and thus does not improve metric. 
+- [AP vs MR](ap_mr.md) in object detection.
+	- Combining adaptive NMS and soft-NMS has minor or even negative improvements on metric MR^-2 (0.01 to 1 FPPI). Reason may be the benefit happens beyond 1 FPPI and thus does not improve metric. 
 - Reasonable: Bare (0 to 0.1), Partial (0.1 to 0.35), Heavy (0.35 to 1).
 
 #### Notes
diff --git a/paper_notes/crowd_det.md b/paper_notes/crowd_det.md
@@ -28,8 +28,9 @@ Current works are either too complex or less effective for handling highly overl
 
 #### Technical details
 - Test on COCO to verify that there is no performance degradation rather than significant performance improvement. 
-- AP is more sensitive to recall. MR is very sensitive to FP with high confidence. 
+- [AP vs MR](ap_mr.md) in object detection.
+	- AP is more sensitive to recall. MR is very sensitive to FP with high confidence. 
 
 #### Notes
-- Questions and notes on how to improve/revise the current work  
+- [Pytorch code on Github](https://github.com/Purkialo/CrowdDet)
 
diff --git a/paper_notes/crowdhuman.md b/paper_notes/crowdhuman.md
@@ -22,5 +22,4 @@ Previous datasets are more likely to annotate crowd human as a whole ignored reg
 - Pervious datasets (CityPerson) annotates top of the head to the middle of the feet and generated a full bbox with fixed aspect ratio of 0.41.
 
 #### Notes
-- Questions and notes on how to improve/revise the current work  
-
+- [AP vs MR](ap_mr.md) in object detection.
diff --git a/paper_notes/double_anchor.md b/paper_notes/double_anchor.md
@@ -0,0 +1,36 @@
+# [Double Anchor R-CNN for Human Detection in a Crowd](https://arxiv.org/abs/1909.09998)
+
+_October 2020_
+
+tl;dr: Double Anchor RPN is developed to capture body and head parts in pairs.
+
+#### Overall impression
+Crowd occlusion is challenging for two reasons:
+
+- when people overlap largely with each other, semantic features of different instances also interweave and make sectors difficult to discriminate instance boundaries. 
+- Even though detectors successfully differentiate and detect instances, they may be suppressed by NMS.
+
+The intuition behind the paper is simple: compared with the human body, the head usually has a smaller scale, less overlap and a better view in real-world images, and thus is more robust to pose variations and crowd occlusions. 
+
+One main challenge in crowd detection is high score false positives. --> However safety-wise this does not seem to be an issue for autonomous driving. 
+
+#### Key ideas
+- **Double Anchor RPN** basically is to output two regressed offsets (for body and head) and one score.
+- Proposal Crossover: 
+	- Two branches: head-body branch which regresses head and body from head anchor, and body-head branch which regresses head and body from body anchor
+	- Body proposals from head-body branch is not good. Thus perform IoU check of the body proposals between head-body branch and body-head branch, and replace body proposal from head-body branch (lower quality) with that from the body-head branch (higher quality)
+- Feature aggregation:
+	- perform RoIAlign on two proposals separately, then concat
+	- predict head bbox loc/score, and body bbox loc/score.
+- Joint NMS: 
+	- weighted score from both head bbox score and body bbox score. 
+	- If head IoU or body IoU exceeds certain threshold then suppress
+
+#### Technical details
+- [AP vs MR](ap_mr.md) in object detection.
+	- Soft-NMS maintains lots of long-tail detection results for improving recall at the expense of bringing more false positives, which leqds to negative impact on human detection especially for the metric of MR (where FP with high score is the bottleneck).
+	- Note that in deployment, neither MR or AP is a good metric, as we have to select one working point. 
+
+#### Notes
+- [Review on Zhihu](https://zhuanlan.zhihu.com/p/95253096)
+
diff --git a/paper_notes/rep_loss.md b/paper_notes/rep_loss.md
@@ -25,7 +25,8 @@ Visualization before NMS seems to be a powerful debugging tool.
 	- pred bboxes are much denser than the GT boxes, a pair of two pred bboxes are more likely to have a larger overlap than a pair of one predicted box and one GT box. Thus RepBox is more likely to have outliers than in RepGT.
 
 #### Technical details
-- Log average miss rate on False Positive Per Image (MR^-2) is usually the KPI for pedestrian detection. This looks like FROC curve. Miss rate = 1 - recall. MR score is plot on both logx and logy. The lower the better. 
+- [AP vs MR](ap_mr.md) in object detection.
+	- Log average miss rate on False Positive Per Image (MR^-2) is usually the KPI for pedestrian detection. This looks like FROC curve. Miss rate = 1 - recall. MR score is plot on both logx and logy. The lower the better. 
 - Occlusion: occ > 0.1. Occ is calculated by 1 - (visible bbox area / full bbox area). Crowd occlusion: occ > 0.1, IoU > 0.1
 - Occlusion < 35%. [0, 10%]: bare, [10%, 35%] partial, [35%, 1): heavy. Bare and partial occlusions are **reasonable** occlusions.
 - FP: background (0 GT under 0.1 IoU), localization error (1 GT), and crowd error (2+ GT).