Add SDF label

patrick-llgc · patrick-llgc · commit f17631f959c0 · 2020-09-04T23:13:03.000-07:00
diff --git a/README.md b/README.md
@@ -51,6 +51,9 @@ The list of resource in this [link](https://autonomous-driving.org/front/resourc
 
 ## CVPR 2020
 - [Online Depth Learning against Forgetting in Monocular Videos](http://openaccess.thecvf.com/content_CVPR_2020/papers/Zhang_Online_Depth_Learning_Against_Forgetting_in_Monocular_Videos_CVPR_2020_paper.pdf) <kbd>CVPR 2020</kbd>
+- [Leveraging Pre-Trained 3D Object Detection Models For Fast Ground Truth Generation](https://arxiv.org/abs/1807.06072) <kbd>ITSC 2018</kbd>
+- [DensePose: Dense Human Pose Estimation In The Wild](https://arxiv.org/abs/1802.00434) <kbd>CVPR 2018</kbd>
+- [Canonical Surface Mapping via Geometric Cycle Consistency](https://arxiv.org/abs/1907.10043) <kbd>ICCV 2019</kbd>
 - [Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume](http://openaccess.thecvf.com/content_CVPR_2020/papers/Johnston_Self-Supervised_Monocular_Trained_Depth_Estimation_Using_Self-Attention_and_Discrete_Disparity_CVPR_2020_paper.pdf) <kbd>CVPR 2020</kbd>
 - [Visual SLAM for Automated Driving: Exploring the Applications of Deep Learning](http://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w9/Milz_Visual_SLAM_for_CVPR_2018_paper.pdf)
 - [Just Go with the Flow: Self-Supervised Scene Flow Estimation](https://arxiv.org/abs/1912.00497) <kbd>CVPR 2020 oral</kbd> [Scene flow]
@@ -448,7 +451,7 @@ Crosswalk Behavior](http://openaccess.thecvf.com/content_ICCV_2017_workshops/pap
 - [BEV-IPM: Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image](https://ieeexplore.ieee.org/abstract/document/8814050) [[Notes](paper_notes/bev_od_ipm.md)] <kbd>IV 2019</kbd>
 - [ForeSeE: Task-Aware Monocular Depth Estimation for 3D Object Detection](https://arxiv.org/abs/1909.07701) [[Notes](paper_notes/foresee_mono3dod.md)] <kbd>AAAI 2020 oral</kbd> [successor to pseudo-lidar, mono 3DOD SOTA]
 - [Obj-dist: Learning Object-specific Distance from a Monocular Image](https://arxiv.org/abs/1909.04182) [[Notes](paper_notes/obj_dist_iccv2019.md)] <kbd>ICCV 2019</kbd> (xmotors.ai + NYU) [monocular distance]
-- [DisNet: A novel method for distance estimation from monocular camera](https://project.inria.fr/ppniv18/files/2018/10/paper22.pdf) [[Notes](paper_notes/disnet.md)] <kbd>IROS 2018</kbd> [monocular distance]
+- [DisNet: A novel method for distance estimation from monocular camera](https://project.inria.fr/ppniv18/files/2018/10/paper22.pdf) [[Notes]`(paper_notes/disnet.md)] <kbd>IROS 2018</kbd> [monocular distance]
 - [BirdGAN: Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles](https://arxiv.org/abs/1904.08494) [[Notes](paper_notes/birdgan.md)] <kbd>IROS 2019</kbd> 
 - [Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints](https://arxiv.org/abs/1905.09970) [[Notes](paper_notes/shift_rcnn.md)] <kbd>ICIP 2019</kbd>
 - [3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare](http://openaccess.thecvf.com/content_cvpr_2018/papers/Kundu_3D-RCNN_Instance-Level_3D_CVPR_2018_paper.pdf) [[Notes](paper_notes/3d_rcnn.md)] <kbd>CVPR 2018</kbd>
diff --git a/paper_notes/sdflabel.md b/paper_notes/sdflabel.md
@@ -0,0 +1,51 @@
+# [SDFLabel: Autolabeling 3D Objects With Differentiable Rendering of SDF Shape Priors](https://arxiv.org/abs/1911.11288)
+
+_September 2020_
+
+tl;dr: Using differentiable rendering for automatic labeling.
+
+#### Overall impression
+Use 2D regression to predict NOCS map and a shape vector. The NOCS map can be used with lidar to extract a sparse 3D model, and the shape vector can be used with DeepSDF to decode a 3D model. The compute approximate pose with 3D matching. Then calculate 2D and 3D loss for back-propagation for refinement. 
+
+Previous work such as [3D RCNN](3d_rcnn.md) and [RoI10D](roi10d.md) uses PCA or CAE (conv auto-encoder) to predict the shape of cars. This is not end-to-end differentiable. DeepSDF enables backpropagation onto a smooth shape manifold and is more powerful. 
+
+Autolabel is still not as good as lidar labels, but very close. Closer in performance in BEV rather than 3D, but only BEV should be good enough for autonomous driving. But the 3D drop may not be real as the autolabels are **tight** 3D bbox as compared to KITTI3D lidar labels. 
+
+#### Key ideas
+- CSS (coordinate shape space): combination of NOCS and DeepSDF
+	- NOCS (normalized object coordinate system). It encodes the pose and shape information, i.e., surface coordinates. With **dense** depth information, 3D pose can be recovered from NOCS.
+	- NOCS is a correspondence map
+	- NOCS encodes surface iniformation (normal)
+	![](https://cdn-images-1.medium.com/max/1600/1*ZbN913AmRDsblCMJhtzC3g.png)
+- DeepSDF to embed watertight models into a joint and compact shape space representation.
+	- 11 CAD models are embedded into 3D space (3-dim latent code) with DeepSDF. 
+- DeepSDF is combined with differentiable rendering so that the surface point is differentiable wrt scale, pose, or latent code.
+- Overall workflow
+	- predicted z is decoded, and NOCS coordinate is calculated. 
+	- Lidar is projected onto the predicted NOCS map.
+	- Estimate initial pose and scale by 3D matching.
+- Loss
+	- 2D Loss: SDF renderer and decoded map to have a rendered NOCS map, and compare with predicted NOCS map.
+	- 3D Loss: correspondence with lidar points --> drastic drop in 3D metric when optimizing only in 2D
+- Verification: similar to 2D and 3D losses, they can be used for verification. 
+		- projective: Mask IoU > 0.7
+		- geometric: lidar points within 0.2 m band of surface > 60%
+
+#### Technical details
+- **KITTI3D cuboids have a varying amount of spatial padding** and are not tight. Deep learning models trained on these data will learn the padding too.
+- Ways to scale up annotation pipeline include better tooling, active learning or a combination thereof.
+- Curriculum learning pipeline: to bridge the synthetic-to-real gap. Iteratively add real samples passing sanity check.
+	- Rather fast **diffusion** into target domain.
+- Parallel domain (acquired by Toyota). But we should be able to use CARLA or vKITTI as well.
+- Level of difficulty of a label is measured by pixel size, amount of intersection with other 2D labels, or whether the label is cropped. 
+	- Easy: h > 40 pix
+	- Moderate: h > 25 pix, and not having IoU > 0.3 with others
+- CSS is trained on about 8k patches.
+- 6sec to autolabel one instance. 
+
+
+#### Notes
+- [Code on github](https://github.com/TRI-ML/sdflabel)
+- The idea seems to be closely related to [DensePose](densepose.md). It densely map canonical 2D coordinates (coorespondence map) to human bodies. But it allow for projective scene analysis up to scale. --> we can fix that with IPM!?
+- [NOCS](nocs.md) extended dense coordinates to 3D space.
+