Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

Zeyu Cheng; Yi Zhang; Chengkai Tang

doi:10.1109/JSEN.2021.3120753

Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

Zeyu Cheng, Yi Zhang, Chengkai Tang

电子信息学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

38 引用（Scopus）

摘要

Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.

源语言	英语
页（从-至）	26912-26920
页数	9
期刊	IEEE Sensors Journal
卷	21
期	23
DOI	https://doi.org/10.1109/JSEN.2021.3120753
出版状态	已出版 - 1 12月 2021

访问文件

10.1109/JSEN.2021.3120753

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9efa62793cfe47738c8814c1dac85d6e,

title = "Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation",

abstract = "Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.",

keywords = "Depth estimation, monocular sensors, multi-scale fusion attention, transformer",

author = "Zeyu Cheng and Yi Zhang and Chengkai Tang",

note = "Publisher Copyright: {\textcopyright} 2001-2012 IEEE.",

year = "2021",

month = dec,

day = "1",

doi = "10.1109/JSEN.2021.3120753",

language = "英语",

volume = "21",

pages = "26912--26920",

journal = "IEEE Sensors Journal",

issn = "1530-437X",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "23",

}

TY - JOUR

T1 - Swin-Depth

T2 - Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

AU - Cheng, Zeyu

AU - Zhang, Yi

AU - Tang, Chengkai

PY - 2021/12/1

Y1 - 2021/12/1

N2 - Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.

AB - Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.

KW - Depth estimation

KW - monocular sensors

KW - multi-scale fusion attention

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85118258180&partnerID=8YFLogxK

U2 - 10.1109/JSEN.2021.3120753

DO - 10.1109/JSEN.2021.3120753

M3 - 文章

AN - SCOPUS:85118258180

SN - 1530-437X

VL - 21

SP - 26912

EP - 26920

JO - IEEE Sensors Journal

JF - IEEE Sensors Journal

IS - 23

ER -

Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

摘要

访问文件

其它文件与链接

指纹

引用此