Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

Zeyu Cheng; Yi Zhang; Chengkai Tang

doi:10.1109/JSEN.2021.3120753

Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

Zeyu Cheng, Yi Zhang, Chengkai Tang

School of Electronics and Information

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

38 Scopus citations

Abstract

Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.

Original language	English
Pages (from-to)	26912-26920
Number of pages	9
Journal	IEEE Sensors Journal
Volume	21
Issue number	23
DOIs	https://doi.org/10.1109/JSEN.2021.3120753
State	Published - 1 Dec 2021

Keywords

Depth estimation
monocular sensors
multi-scale fusion attention
transformer

Access to Document

10.1109/JSEN.2021.3120753

Cite this

@article{9efa62793cfe47738c8814c1dac85d6e,

title = "Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation",

abstract = "Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.",

keywords = "Depth estimation, monocular sensors, multi-scale fusion attention, transformer",

author = "Zeyu Cheng and Yi Zhang and Chengkai Tang",

note = "Publisher Copyright: {\textcopyright} 2001-2012 IEEE.",

year = "2021",

month = dec,

day = "1",

doi = "10.1109/JSEN.2021.3120753",

language = "英语",

volume = "21",

pages = "26912--26920",

journal = "IEEE Sensors Journal",

issn = "1530-437X",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "23",

}

TY - JOUR

T1 - Swin-Depth

T2 - Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

AU - Cheng, Zeyu

AU - Zhang, Yi

AU - Tang, Chengkai

PY - 2021/12/1

Y1 - 2021/12/1

N2 - Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.

AB - Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.

KW - Depth estimation

KW - monocular sensors

KW - multi-scale fusion attention

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85118258180&partnerID=8YFLogxK

U2 - 10.1109/JSEN.2021.3120753

DO - 10.1109/JSEN.2021.3120753

M3 - 文章

AN - SCOPUS:85118258180

SN - 1530-437X

VL - 21

SP - 26912

EP - 26920

JO - IEEE Sensors Journal

JF - IEEE Sensors Journal

IS - 23

ER -

Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this