TY - JOUR
T1 - TinyDepth
T2 - Lightweight self-supervised monocular depth estimation based on transformer
AU - Cheng, Zeyu
AU - Zhang, Yi
AU - Yu, Yang
AU - Song, Zhe
AU - Tang, Chengkai
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/12
Y1 - 2024/12
N2 - Monocular depth estimation plays an important role in autonomous driving, virtual reality, augmented reality, and other fields. Self-supervised monocular depth estimation has received much attention because it does not require hard-to-obtain depth labels during training. The previously used convolutional neural network (CNN) has shown limitations in modeling large-scale spatial dependencies. A new idea for monocular depth estimation is replacing the CNN architecture or merging it with a Vision Transformer (ViT) architecture that can model large-scale spatial dependencies in images. However, there are still problems with too many parameters and calculations, making deployment difficult on mobile platforms. In response to these problems, we propose TinyDepth, a lightweight self-supervised monocular depth estimation method based on Transformer that employs hierarchical representation learning suitable for dense prediction, uses mobile convolution to reduce parameters and computational overhead. and includes a novel decoder based on multi-scale fusion attention that improves the local and global inference capability of the network through scale-wise attention processing and layer-wise fusion sampling for more accurate depth prediction. In experiments, TinyDepth achieved state-of-the-art results with few parameters on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) dataset, and exhibited good generalization ability on the challenging indoor New York University (NYU) dataset. Source code is available at https://github.com/ZYCheng777/TinyDepth.
AB - Monocular depth estimation plays an important role in autonomous driving, virtual reality, augmented reality, and other fields. Self-supervised monocular depth estimation has received much attention because it does not require hard-to-obtain depth labels during training. The previously used convolutional neural network (CNN) has shown limitations in modeling large-scale spatial dependencies. A new idea for monocular depth estimation is replacing the CNN architecture or merging it with a Vision Transformer (ViT) architecture that can model large-scale spatial dependencies in images. However, there are still problems with too many parameters and calculations, making deployment difficult on mobile platforms. In response to these problems, we propose TinyDepth, a lightweight self-supervised monocular depth estimation method based on Transformer that employs hierarchical representation learning suitable for dense prediction, uses mobile convolution to reduce parameters and computational overhead. and includes a novel decoder based on multi-scale fusion attention that improves the local and global inference capability of the network through scale-wise attention processing and layer-wise fusion sampling for more accurate depth prediction. In experiments, TinyDepth achieved state-of-the-art results with few parameters on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) dataset, and exhibited good generalization ability on the challenging indoor New York University (NYU) dataset. Source code is available at https://github.com/ZYCheng777/TinyDepth.
KW - Lightweight monocular depth estimation
KW - Multi-scale fusion attention
KW - Self-supervised learning
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85203868219&partnerID=8YFLogxK
U2 - 10.1016/j.engappai.2024.109313
DO - 10.1016/j.engappai.2024.109313
M3 - 文章
AN - SCOPUS:85203868219
SN - 0952-1976
VL - 138
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 109313
ER -