The main idea of the project is to find the best depth map prediction model for the inference on the mobile device. One of the main restrictions was the model's inference time, so the goal was to apply a kind of convolutional neural network, not visual transformer. The RT Mono Depth model was chosen as a baseline and adopted for the project.
Compared to the basic model, an additional depth map was extracted from the encoder bottleneck and used as a depth distribution in the image.
- NYU Depth V2 (25% of the dataset) as a base dataset
- ARKit Scenes Dataset as an additional dataset
The model was trained in Google Colab (A100). The standard PyTorch-based training loop was used.
The model training results.
The combination of Structural Similarity Index (SSIM), L1, and Smooth Loss was used for depth map.
This loss function is based on L1 loss, while SSIM and Smooth losses help the model produce sharper details in the depth map.
For depth distribution RMSE Loss was used.
For the model quality assessment the following metrics were used
- ABSRel
- SQRel
- RMSE
- RMSE Logarithmic
As a result, a model was trained and tested that provides the required inference time on a mobile device and has better quality compared to the base model. And smoothing the predicted depth maps using the moving average method, as shown in the notebook, allowed us to obtain higher-quality depth maps, which will allow us to use the model in applications using augmented reality.