TinyDepth: Lightweight self-supervised monocular depth estimation based on transformer

Zeyu Cheng; Yi Zhang; Yang Yu; Zhe Song; Chengkai Tang

doi:10.1016/j.engappai.2024.109313

TinyDepth: Lightweight self-supervised monocular depth estimation based on transformer

Zeyu Cheng, Yi Zhang, Yang Yu, Zhe Song, Chengkai Tang

电子信息学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Monocular depth estimation plays an important role in autonomous driving, virtual reality, augmented reality, and other fields. Self-supervised monocular depth estimation has received much attention because it does not require hard-to-obtain depth labels during training. The previously used convolutional neural network (CNN) has shown limitations in modeling large-scale spatial dependencies. A new idea for monocular depth estimation is replacing the CNN architecture or merging it with a Vision Transformer (ViT) architecture that can model large-scale spatial dependencies in images. However, there are still problems with too many parameters and calculations, making deployment difficult on mobile platforms. In response to these problems, we propose TinyDepth, a lightweight self-supervised monocular depth estimation method based on Transformer that employs hierarchical representation learning suitable for dense prediction, uses mobile convolution to reduce parameters and computational overhead. and includes a novel decoder based on multi-scale fusion attention that improves the local and global inference capability of the network through scale-wise attention processing and layer-wise fusion sampling for more accurate depth prediction. In experiments, TinyDepth achieved state-of-the-art results with few parameters on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) dataset, and exhibited good generalization ability on the challenging indoor New York University (NYU) dataset. Source code is available at https://github.com/ZYCheng777/TinyDepth.

源语言	英语
文章编号	109313
期刊	Engineering Applications of Artificial Intelligence
卷	138
DOI	https://doi.org/10.1016/j.engappai.2024.109313
出版状态	已出版 - 12月 2024

访问文件

10.1016/j.engappai.2024.109313

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{23d33f1be79e4573a775eeb78a7572b9,

title = "TinyDepth: Lightweight self-supervised monocular depth estimation based on transformer",

abstract = "Monocular depth estimation plays an important role in autonomous driving, virtual reality, augmented reality, and other fields. Self-supervised monocular depth estimation has received much attention because it does not require hard-to-obtain depth labels during training. The previously used convolutional neural network (CNN) has shown limitations in modeling large-scale spatial dependencies. A new idea for monocular depth estimation is replacing the CNN architecture or merging it with a Vision Transformer (ViT) architecture that can model large-scale spatial dependencies in images. However, there are still problems with too many parameters and calculations, making deployment difficult on mobile platforms. In response to these problems, we propose TinyDepth, a lightweight self-supervised monocular depth estimation method based on Transformer that employs hierarchical representation learning suitable for dense prediction, uses mobile convolution to reduce parameters and computational overhead. and includes a novel decoder based on multi-scale fusion attention that improves the local and global inference capability of the network through scale-wise attention processing and layer-wise fusion sampling for more accurate depth prediction. In experiments, TinyDepth achieved state-of-the-art results with few parameters on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) dataset, and exhibited good generalization ability on the challenging indoor New York University (NYU) dataset. Source code is available at https://github.com/ZYCheng777/TinyDepth.",

keywords = "Lightweight monocular depth estimation, Multi-scale fusion attention, Self-supervised learning, Transformer",

author = "Zeyu Cheng and Yi Zhang and Yang Yu and Zhe Song and Chengkai Tang",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier Ltd",

year = "2024",

month = dec,

doi = "10.1016/j.engappai.2024.109313",

language = "英语",

volume = "138",

journal = "Engineering Applications of Artificial Intelligence",

issn = "0952-1976",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - TinyDepth

T2 - Lightweight self-supervised monocular depth estimation based on transformer

AU - Cheng, Zeyu

AU - Zhang, Yi

AU - Yu, Yang

AU - Song, Zhe

AU - Tang, Chengkai

PY - 2024/12

Y1 - 2024/12

N2 - Monocular depth estimation plays an important role in autonomous driving, virtual reality, augmented reality, and other fields. Self-supervised monocular depth estimation has received much attention because it does not require hard-to-obtain depth labels during training. The previously used convolutional neural network (CNN) has shown limitations in modeling large-scale spatial dependencies. A new idea for monocular depth estimation is replacing the CNN architecture or merging it with a Vision Transformer (ViT) architecture that can model large-scale spatial dependencies in images. However, there are still problems with too many parameters and calculations, making deployment difficult on mobile platforms. In response to these problems, we propose TinyDepth, a lightweight self-supervised monocular depth estimation method based on Transformer that employs hierarchical representation learning suitable for dense prediction, uses mobile convolution to reduce parameters and computational overhead. and includes a novel decoder based on multi-scale fusion attention that improves the local and global inference capability of the network through scale-wise attention processing and layer-wise fusion sampling for more accurate depth prediction. In experiments, TinyDepth achieved state-of-the-art results with few parameters on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) dataset, and exhibited good generalization ability on the challenging indoor New York University (NYU) dataset. Source code is available at https://github.com/ZYCheng777/TinyDepth.

AB - Monocular depth estimation plays an important role in autonomous driving, virtual reality, augmented reality, and other fields. Self-supervised monocular depth estimation has received much attention because it does not require hard-to-obtain depth labels during training. The previously used convolutional neural network (CNN) has shown limitations in modeling large-scale spatial dependencies. A new idea for monocular depth estimation is replacing the CNN architecture or merging it with a Vision Transformer (ViT) architecture that can model large-scale spatial dependencies in images. However, there are still problems with too many parameters and calculations, making deployment difficult on mobile platforms. In response to these problems, we propose TinyDepth, a lightweight self-supervised monocular depth estimation method based on Transformer that employs hierarchical representation learning suitable for dense prediction, uses mobile convolution to reduce parameters and computational overhead. and includes a novel decoder based on multi-scale fusion attention that improves the local and global inference capability of the network through scale-wise attention processing and layer-wise fusion sampling for more accurate depth prediction. In experiments, TinyDepth achieved state-of-the-art results with few parameters on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) dataset, and exhibited good generalization ability on the challenging indoor New York University (NYU) dataset. Source code is available at https://github.com/ZYCheng777/TinyDepth.

KW - Lightweight monocular depth estimation

KW - Multi-scale fusion attention

KW - Self-supervised learning

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85203868219&partnerID=8YFLogxK

U2 - 10.1016/j.engappai.2024.109313

DO - 10.1016/j.engappai.2024.109313

M3 - 文章

AN - SCOPUS:85203868219

SN - 0952-1976

VL - 138

JO - Engineering Applications of Artificial Intelligence

JF - Engineering Applications of Artificial Intelligence

M1 - 109313

ER -

TinyDepth: Lightweight self-supervised monocular depth estimation based on transformer

摘要

访问文件

其它文件与链接

指纹

引用此