Transformer-based Models for Supervised Monocular Depth Estimation

Gupta A.; Prince A.A.; Fredo A.R.J.; Robert F.

doi:https://doi.org/10.1109/ICICCSP53532.2022.9862348

Transformer-based Models for Supervised Monocular Depth Estimation

Authors

Gupta A.; Prince A.A.; Fredo A.R.J.; Robert F.

Abstract

Existing traditional solutions for monocular depth estimation, usually use convolution networks as the backbone of their model architecture. This work presents an encoder-decoder network using a transformer architecture that can perform monocular depth estimation on a single RGB image. For environment perception and autonomous navigation systems, where depth estimation is done on edge devices, there is a need for lightweight and efficient models. It is shown that transformer-based architectures provide comparable results to the currently used convolution networks with significantly fewer parameters. Unlike convolutional networks, transformers don't downsample the input progressively at each layer. Maintaining a similar resolution throughout the encoding process allows for global awareness at each stage. 2 different decoder models are implemented on top of a transformer encoder and their usability is evaluated for depth estimation. On comparing with a comparable convolution network, it is observed that on the KITTI outdoor dataset, the lighter transformer model performs better in terms of robustness and accuracy. © 2022 IEEE.