Transformer-based Models for Supervised Monocular Depth Estimation
| dc.contributor.author | Gupta A.; Prince A.A.; Fredo A.R.J.; Robert F. | |
| dc.date.accessioned | 2025-05-23T11:24:27Z | |
| dc.description.abstract | Existing traditional solutions for monocular depth estimation, usually use convolution networks as the backbone of their model architecture. This work presents an encoder-decoder network using a transformer architecture that can perform monocular depth estimation on a single RGB image. For environment perception and autonomous navigation systems, where depth estimation is done on edge devices, there is a need for lightweight and efficient models. It is shown that transformer-based architectures provide comparable results to the currently used convolution networks with significantly fewer parameters. Unlike convolutional networks, transformers don't downsample the input progressively at each layer. Maintaining a similar resolution throughout the encoding process allows for global awareness at each stage. 2 different decoder models are implemented on top of a transformer encoder and their usability is evaluated for depth estimation. On comparing with a comparable convolution network, it is observed that on the KITTI outdoor dataset, the lighter transformer model performs better in terms of robustness and accuracy. © 2022 IEEE. | |
| dc.identifier.doi | https://doi.org/10.1109/ICICCSP53532.2022.9862348 | |
| dc.identifier.uri | http://172.23.0.11:4000/handle/123456789/10107 | |
| dc.relation.ispartofseries | 2022 International Conference on Intelligent Controller and Computing for Smart Power, ICICCSP 2022 | |
| dc.title | Transformer-based Models for Supervised Monocular Depth Estimation |