Update topic in crowd detection

patrick-llgc · patrick-llgc · commit 41ad4f0eecbc · 2020-11-29T00:11:30.000-08:00
diff --git a/README.md b/README.md
@@ -39,7 +39,7 @@ Here is [a list of trustworthy sources of papers](trusty.md) in case I ran out o
 ## My Review Posts by Topics
 I regularly update [my blog in Toward Data Science](https://medium.com/@patrickllgc).
 
-- [Object detection in crowded scenes](??) ([related paper notes](topic_crowd_detection.md))
+- [Deep-Learning based Object detection in Crowded Scenes](https://towardsdatascience.com/deep-learning-based-object-detection-in-crowded-scenes-1c9fddbd7bc4) ([related paper notes](topic_crowd_detection.md))
 - [Monocular Bird’s-Eye-View Semantic Segmentation for Autonomous Driving](https://towardsdatascience.com/monocular-birds-eye-view-semantic-segmentation-for-autonomous-driving-ee2f771afb59)
 - [Deep Learning in Mapping for Autonomous Driving](https://towardsdatascience.com/deep-learning-in-mapping-for-autonomous-driving-9e33ee951a44)
 - [Monocular Dynamic Object SLAM in Autonomous Driving](https://towardsdatascience.com/monocular-dynamic-object-slam-in-autonomous-driving-f12249052bf1)
@@ -56,6 +56,8 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [Scaled-YOLOv4: Scaling Cross Stage Partial Network](https://arxiv.org/abs/2011.08036) [[Notes](paper_notes/scaled_yolov4.md)] [yolo]
 - [PP-YOLO: An Effective and Efficient Implementation of Object Detector](https://arxiv.org/abs/2007.12099) [yolo, paddle-paddle, baidu]
 - [Sparse R-CNN: End-to-End Object Detection with Learnable Proposals](https://arxiv.org/abs/2011.12450) [Transformer head]
+- [Rethinking Transformer-based Set Prediction for Object Detection](https://arxiv.org/abs/2011.10881) [transformers, Kris Kitani]
+- [UP-DETR: Unsupervised Pre-training for Object Detection with Transformers](https://arxiv.org/abs/2011.09094) [transformers]
 - [Unsupervised Monocular Depth Learning in Dynamic Scenes](https://arxiv.org/abs/2010.16404) [[Notes](paper_notes/learn_depth_and_motion.md)] <kbd>CoRL 2020</kbd> [LearnK improved ver, Google]
 - [MoNet3D: Towards Accurate Monocular 3D Object Localization in Real Time](https://arxiv.org/abs/2006.16007) [[Notes](paper_notes/monet3d.md)] <kbd>ICML 2020</kbd> [Mono3D, pairwise relationship]
 - [Argoverse: 3D Tracking and Forecasting with Rich Maps](https://arxiv.org/abs/1911.02620) [[Notes](paper_notes/argoverse.md)] <kbd>CVPR 2019</kbd> [HD maps, dataset, CV lidar]
@@ -144,7 +146,7 @@ Feature Extraction](https://arxiv.org/abs/2010.02893) [monodepth, semantics, Nav
 - [Locating Objects Without Bounding Boxes](Locating Objects Without Bounding Boxes) <kbd>CVPR 2019</kbd>
 - [Deep Evidential Regression](https://papers.nips.cc/paper/2020/file/aab085461de182608ee9f607f3f7d18f-Paper.pdf) <kbd>NeurIPS 2020</kbd> [one-pass aleatoric/epistemic uncertainty]
 - [Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection](https://arxiv.org/abs/2011.12885) [focal loss]
-- [Rethinking Transformer-based Set Prediction for Object Detection](https://arxiv.org/abs/2011.10881) [transformers, Kris Kitani]
+
 
 ## 2020-10 (14)
 - [TSM: Temporal Shift Module for Efficient Video Understanding](https://arxiv.org/abs/1811.08383) [[Notes](paper_notes/tsm.md)] <kbd>ICCV 2019</kbd> [Song Han, video, object detection]
@@ -580,7 +582,7 @@ Crosswalk Behavior](http://openaccess.thecvf.com/content_ICCV_2017_workshops/pap
 - [SfMLearner: Unsupervised Learning of Depth and Ego-Motion from Video](https://people.eecs.berkeley.edu/~tinghuiz/projects/SfMLearner/cvpr17_sfm_final.pdf) [[Notes](paper_notes/sfm_learner.md)] <kbd>CVPR 2017</kbd>
 - [Monodepth2: Digging Into Self-Supervised Monocular Depth Estimation](https://arxiv.org/abs/1806.01260) [[Notes](paper_notes/monodepth2.md)] <kbd>ICCV 2019</kbd> [Niantic]
 - [DeepSignals: Predicting Intent of Drivers Through Visual Signals](https://arxiv.org/pdf/1905.01333.pdf) [[Notes](paper_notes/deep_signals.md)] <kbd>ICRA 2019</kbd> (@Uber, turn signal detection)
-- [FCOS: Fully Convolutional One-Stage Object Detection](https://arxiv.org/pdf/1904.01355.pdf) [[Notes](paper_notes/fcos.md)] <kbd>ICCV 2019</kbd> [Chunhua Shen]
+- [FCOS: Fully Convolutional One-Stage Object Detection](https://arxiv.org/abs/1904.01355) [[Notes](paper_notes/fcos.md)] <kbd>ICCV 2019</kbd> [Chunhua Shen]
 - [Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving](https://arxiv.org/abs/1906.06310) [[Notes](paper_notes/pseudo_lidar++.md)] <kbd>ICLR 2020</kbd>
 - [MMF: Multi-Task Multi-Sensor Fusion for 3D Object Detection](http://www.cs.toronto.edu/~byang/papers/mmf.pdf) [[Notes](paper_notes/mmf.md)] <kbd>CVPR 2019</kbd> (@Uber, sensor fusion)
 
@@ -612,7 +614,7 @@ for Road Detection Algorithms](http://www.cvlibs.net/publications/Fritsch2013ITS
 - [MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving](https://arxiv.org/pdf/1612.07695.pdf) [[Notes](paper_notes/multinet_raquel.md)]
 - [Optimizing the Trade-off between Single-Stage and Two-Stage Object Detectors using Image Difficulty Prediction](https://arxiv.org/pdf/1803.08707.pdf) (Very nice illustration of 1 and 2 stage object detection)
 - [Light-Head R-CNN: In Defense of Two-Stage Object Detector](https://arxiv.org/pdf/1711.07264.pdf) [[Notes](paper_notes/lighthead_rcnn.md)] (from Megvii)
-- [CSP: High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection](https://arxiv.org/pdf/1904.02948.pdf) [[Notes](paper_notes/csp_pedestrian.md)] <kbd>CVPR 2019</kbd> [center and scale prediction, anchor-free, near SOTA pedestrian]
+- [CSP: High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection](https://arxiv.org/abs/1904.02948) [[Notes](paper_notes/csp_pedestrian.md)] <kbd>CVPR 2019</kbd> [center and scale prediction, anchor-free, near SOTA pedestrian]
 - [Review of Anchor-free methods (知乎Blog) 目标检测：Anchor-Free时代](https://zhuanlan.zhihu.com/p/62103812) [Anchor free深度学习的目标检测方法](https://zhuanlan.zhihu.com/p/64563186) [My Slides on CSP](https://docs.google.com/presentation/d/1_dUfxv63108bZXUnVYPIOAdEIkRZw5BR9-rOp-Ni0X0/)
 - [DenseBox: Unifying Landmark Localization with End to End Object Detection](https://arxiv.org/pdf/1509.04874.pdf)
 - [CornerNet: Detecting Objects as Paired Keypoints](https://arxiv.org/pdf/1808.01244.pdf) [[Notes](paper_notes/cornernet.md)] <kbd>ECCV 2018</kbd>
diff --git a/paper_notes/agg_loss.md b/paper_notes/agg_loss.md
@@ -11,7 +11,7 @@ Both [RepLoss](rep_loss.md) and [AggLoss](agg_loss.md) proposes additional penal
 
 #### Key ideas
 - **AggLoss**
-	- for GT subset associated with more than one anchor, enforce SL1 loss between the avg prediction of the anchors and the corresponding GT.
+	- If one GT bbox is associated with more than one anchors, encourages the prediction from all these anchors to be the same. It enforces SL1 loss between the avg prediction of the anchors and the corresponding GT. --> There seems to be something wrong in the paper's formulation. Shouldn't this be taking the avg of the abs (~SL1 loss) of the diff, rather than taking abs of the avg diff?
 - **PORoI** (Part occlusion aware RoI pooling)
 	- A part based model: inductive bias to introduce prior structure information of human body with visible prediction into the network.
 	- The human body is divided into 5 parts, and each region U is compared with the visible region of bbox (V) to find IoU (intersection over U) to generate a binary visibility score. 
diff --git a/paper_notes/crowd_det.md b/paper_notes/crowd_det.md
@@ -16,7 +16,7 @@ Current works are either too complex or less effective for handling highly overl
 #### Key ideas
 - Multiple instance prediction: The prediction of nearby proposals are expected to infer the **same set of instances**, rather than distinguishing individuals. 
 	- Some cases are inherently difficult and ambiguous to detect and differentiate such as ◩ or ◪. 
-	- this also greatly eases the learning ini crowded scene. 
+	- this also greatly eases the learning in crowded scene. 
 	- Each anchor predicts K (K=2) bboxes. When K=1, CrowdDet reduces to normal object detection.
 - EMD (earth mover's distance) loss
 	- For all permutaions of matching, select the best matching one with smallest loss
diff --git a/paper_notes/csp_pedestrian.md b/paper_notes/csp_pedestrian.md
@@ -1,4 +1,4 @@
-# [High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection](https://arxiv.org/pdf/1904.02948.pdf) (center and scale prediction)
+# [High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection](https://arxiv.org/abs/1904.02948) (center and scale prediction)
 
 _April 2019_
 
diff --git a/paper_notes/fcos.md b/paper_notes/fcos.md
@@ -1,4 +1,4 @@
-# [FCOS: Fully Convolutional One-Stage Object Detection](https://arxiv.org/pdf/1904.01355.pdf)
+# [FCOS: Fully Convolutional One-Stage Object Detection](https://arxiv.org/abs/1904.01355)
 
 _June 2019_
 
diff --git a/paper_notes/r2_nms.md b/paper_notes/r2_nms.md
@@ -8,7 +8,7 @@ tl;dr: Predict both full bbox and visible region and use visible region for NMS.
 This paper resembles that of [Visibility guided NMS](vg_nms.md), which focuses on crowd vehicle detection. The contribution of this paper is the introduction of the PPFE (paired proposal feature extractor) module. This takes the feature aggregation in [Double Anchor](double_anchor.md) to a new level.
 
 #### Key ideas
-- Visible parts of pedestrians by definition **suffer much less from occlusio**n, a relative low IoU thresh sufficiently removes the redundant bboxes locating the same pedestrian, and meanwhile avoids the large number of FP. --> This has the same motivation as [Double Anchor](double_anchor.md).
+- Visible parts of pedestrians by definition **suffer much less from occlusion**, a relative low IoU thresh sufficiently removes the redundant bboxes locating the same pedestrian, and meanwhile avoids the large number of FP. --> This has the same motivation as [Double Anchor](double_anchor.md).
 	- The IoU between visible regions of two bboxes is a better indicator showing if two full body bboxes belong to the same pedestrian. 
 - The visible bbox and full bbox have high overlaps and is not as different as head-body bboxes, and thus can more reliably get regressed from the same anchor, as compared to [Double Anchor](double_anchor.md).
 - NPM (Native pair model) + PPFE (paired proposal feature extractor / feature aggregator) = PBM (paired box model).
diff --git a/paper_notes/rep_loss.md b/paper_notes/rep_loss.md
@@ -19,7 +19,7 @@ Visualization before NMS seems to be a powerful debugging tool.
 - Occlusion: inter-class and intra-class. Intra-class occlusion is also named crowd occlusion, which happens when an object is occluded by objects of the same category.
 - Repulsion terms: 
 	- **RepGT**: Intersection over GT (to avoid prediction from cheating by increasing pred bbox) with smooth ln loss from [UnitBox](https://arxiv.org/abs/1608.01471). It penalizes overlap with non-target GT object. 
-	- **RepBox**: the IoU region between two predicted bboxes with different designated targets needs to be small. This means the predicted bboes with diff regression targets are more likely to be merged into one after NMS. 
+	- **RepBox**: encourages that the IoU region between two predicted bboxes with different designated targets needs to be small. This means the predicted bboxes with diff regression targets are less likely to be merged into one after NMS. 
 - The selection of IoU or IoG is due to their boundedness within [0, 1]. 
 - Smooth ln loss: more robust to outliers. 
 	- pred bboxes are much denser than the GT boxes, a pair of two pred bboxes are more likely to have a larger overlap than a pair of one predicted box and one GT box. Thus RepBox is more likely to have outliers than in RepGT.
diff --git a/paper_notes/vg_nms.md b/paper_notes/vg_nms.md
@@ -1,4 +1,4 @@
-# [Visibility Guided NMS: Efficient Boosting of Amodal Object Detection in Crowded Traffic Scenes](https://ml4ad.github.io/files/papers/Visibility%20Guided%20NMS:%20Efficient%20Boosting%20of%20Amodal%20Object%20Detection%20in%20Crowded%20Traffic%20Scenes.pdf)
+# [Visibility Guided NMS: Efficient Boosting of Amodal Object Detection in Crowded Traffic Scenes](https://arxiv.org/abs/2006.08547)
 
 _June 2020_
 
@@ -14,7 +14,7 @@ This is very similar to [R2 NMS](r2_nms.md) in CVPR 2020, which focuses on crowd
 #### Key ideas
 - Training object detector with 4 additional attributes. Thus it predicts both the visible part (pixel-based bbox) and the entire object (amodal bbox).
 - VG-NMS: NMS is performed on the pixel-based bbox that describe the actually visible parts but output the amodal bboxes that belong to the indices that rare retained during pixel-based NMS.
-- Pixel based modal bbox can be generated from segmentation mask or ordered amodal bbox. 
+- Pixel based modal bbox can be generated from segmentation mask. --> Or they could be generated from the ordering of amodal bbox based on geometric priors. For example, bbox with large ymax is closer to camera.
 
 #### Technical details
 - **don't care** objects: KITTI ignore 25x25 pixels, and cityscape ignore 10x10 pixels. 
diff --git a/topic_crowd_detection.md b/topic_crowd_detection.md
@@ -1,9 +1,10 @@
 # Detection in Crowded Scenes
 
 - [RepLoss: Repulsion Loss: Detecting Pedestrians in a Crowd](https://arxiv.org/abs/1711.07752) [[Notes](paper_notes/rep_loss.md)] <kbd>CVPR 2018</kbd> [crowd detection, Megvii]
-- [Adaptive NMS: Refining Pedestrian Detection in a Crowd](https://arxiv.org/abs/1904.03629) [[Notes](paper_notes/adaptive_nms.md)] <kbd>CVPR 2019 oral</kbd> [crowd detection, NMS]
 - [AggLoss: Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd](https://arxiv.org/abs/1807.08407) [[Notes](paper_notes/agg_loss.md)] <kbd>ECCV 2018</kbd> [crowd detection]
-- [CrowdDet: Detection in Crowded Scenes: One Proposal, Multiple Predictions](https://arxiv.org/abs/2003.09163) [[Notes](paper_notes/crowd_det.md)] <kbd>CVPR 2020 oral</kbd> [crowd detection, Megvii]
-- [R2-NMS: NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing](https://arxiv.org/abs/2003.12729) [[Notes](paper_notes/r2_nms.md)] <kbd>CVPR 2020</kbd>
+- [Adaptive NMS: Refining Pedestrian Detection in a Crowd](https://arxiv.org/abs/1904.03629) [[Notes](paper_notes/adaptive_nms.md)] <kbd>CVPR 2019 oral</kbd> [crowd detection, NMS]
 - [Double Anchor R-CNN for Human Detection in a Crowd](https://arxiv.org/abs/1909.09998) [[Notes](paper_notes/double_anchor.md)] [head-body bundle]
+- [R2-NMS: NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing](https://arxiv.org/abs/2003.12729) [[Notes](paper_notes/r2_nms.md)] <kbd>CVPR 2020</kbd>
 - [VG-NMS: Visibility Guided NMS: Efficient Boosting of Amodal Object Detection in Crowded Traffic Scenes](https://arxiv.org/abs/2006.08547) [[Notes](paper_notes/vg_nms.md)] <kbd>NeurIPS 2019 workshop</kbd> [Crowded scene, NMS, Daimler]
+- [CrowdDet: Detection in Crowded Scenes: One Proposal, Multiple Predictions](https://arxiv.org/abs/2003.09163) [[Notes](paper_notes/crowd_det.md)] <kbd>CVPR 2020 oral</kbd> [crowd detection, Megvii]
+- [CSP: High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection](https://arxiv.org/abs/1904.02948) [[Notes](paper_notes/csp_pedestrian.md)] <kbd>CVPR 2019</kbd> [center and scale prediction, anchor-free, near SOTA pedestrian]

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# [High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection](https://arxiv.org/pdf/1904.02948.pdf) (center and scale prediction)`
	`1`	`+# [High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection](https://arxiv.org/abs/1904.02948) (center and scale prediction)`
`2`	`2`
`3`	`3`	`_April 2019_`
`4`	`4`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# [FCOS: Fully Convolutional One-Stage Object Detection](https://arxiv.org/pdf/1904.01355.pdf)`
	`1`	`+# [FCOS: Fully Convolutional One-Stage Object Detection](https://arxiv.org/abs/1904.01355)`
`2`	`2`
`3`	`3`	`_June 2019_`
`4`	`4`