TY - JOUR
T1 - Swin-Depth
T2 - Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation
AU - Cheng, Zeyu
AU - Zhang, Yi
AU - Tang, Chengkai
N1 - Publisher Copyright:
© 2001-2012 IEEE.
PY - 2021/12/1
Y1 - 2021/12/1
N2 - Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.
AB - Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.
KW - Depth estimation
KW - monocular sensors
KW - multi-scale fusion attention
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85118258180&partnerID=8YFLogxK
U2 - 10.1109/JSEN.2021.3120753
DO - 10.1109/JSEN.2021.3120753
M3 - 文章
AN - SCOPUS:85118258180
SN - 1530-437X
VL - 21
SP - 26912
EP - 26920
JO - IEEE Sensors Journal
JF - IEEE Sensors Journal
IS - 23
ER -