Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

Zeyu Cheng, Yi Zhang, Chengkai Tang

Research output: Contribution to journalArticlepeer-review

38 Scopus citations

Abstract

Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network's ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.

Original languageEnglish
Pages (from-to)26912-26920
Number of pages9
JournalIEEE Sensors Journal
Volume21
Issue number23
DOIs
StatePublished - 1 Dec 2021

Keywords

  • Depth estimation
  • monocular sensors
  • multi-scale fusion attention
  • transformer

Fingerprint

Dive into the research topics of 'Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation'. Together they form a unique fingerprint.

Cite this