MLDA-Net: Multi-Level Dual Attention-Based Network for Self-Supervised Monocular Depth Estimation

Xibin Song; Wei Li; Dingfu Zhou; Yuchao Dai; Jin Fang; Hongdong Li; Liangjun Zhang

doi:10.1109/TIP.2021.3074306

MLDA-Net: Multi-Level Dual Attention-Based Network for Self-Supervised Monocular Depth Estimation

Xibin Song, Wei Li, Dingfu Zhou, Yuchao Dai, Jin Fang, Hongdong Li, Liangjun Zhang

School of Electronics and Information

Research output: Contribution to journal › Article › peer-review

46 Scopus citations

Abstract

The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.

Original language	English
Article number	9416235
Pages (from-to)	4691-4705
Number of pages	15
Journal	IEEE Transactions on Image Processing
Volume	30
DOIs	https://doi.org/10.1109/TIP.2021.3074306
State	Published - 2021

Keywords

Depth estimation
dual-attention
feature fusion
self-supervised

Access to Document

10.1109/TIP.2021.3074306

Cite this

@article{83ae21037ac0469b94f54f7a40ef1c67,

title = "MLDA-Net: Multi-Level Dual Attention-Based Network for Self-Supervised Monocular Depth Estimation",

abstract = "The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.",

keywords = "Depth estimation, dual-attention, feature fusion, self-supervised",

author = "Xibin Song and Wei Li and Dingfu Zhou and Yuchao Dai and Jin Fang and Hongdong Li and Liangjun Zhang",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2021",

doi = "10.1109/TIP.2021.3074306",

language = "英语",

volume = "30",

pages = "4691--4705",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - MLDA-Net

T2 - Multi-Level Dual Attention-Based Network for Self-Supervised Monocular Depth Estimation

AU - Song, Xibin

AU - Li, Wei

AU - Zhou, Dingfu

AU - Dai, Yuchao

AU - Fang, Jin

AU - Li, Hongdong

AU - Zhang, Liangjun

PY - 2021

Y1 - 2021

N2 - The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.

AB - The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.

KW - Depth estimation

KW - dual-attention

KW - feature fusion

KW - self-supervised

UR - http://www.scopus.com/inward/record.url?scp=85105102959&partnerID=8YFLogxK

U2 - 10.1109/TIP.2021.3074306

DO - 10.1109/TIP.2021.3074306

M3 - 文章

C2 - 33900917

AN - SCOPUS:85105102959

SN - 1057-7149

VL - 30

SP - 4691

EP - 4705

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

M1 - 9416235

ER -

MLDA-Net: Multi-Level Dual Attention-Based Network for Self-Supervised Monocular Depth Estimation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this