Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

Rui Li; Dong Gong; Wei Yin; Hao Chen; Yu Zhu; Kaixuan Wang; Xiaozhi Chen; Jinqiu Sun; Yanning Zhang

doi:10.1109/CVPR52729.2023.02063

Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

Rui Li, Dong Gong, Wei Yin, Hao Chen, Yu Zhu, Kaixuan Wang, Xiaozhi Chen, Jinqiu Sun, Yanning Zhang

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

25 Scopus citations

Abstract

Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multiview cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.

Original language	English
Title of host publication	Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Publisher	IEEE Computer Society
Pages	21539-21548
Number of pages	10
ISBN (Electronic)	9798350301298
DOIs	https://doi.org/10.1109/CVPR52729.2023.02063
State	Published - 2023
Event	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada Duration: 18 Jun 2023 → 22 Jun 2023

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2023-June
ISSN (Print)	1063-6919

Conference

Conference	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Country/Territory	Canada
City	Vancouver
Period	18/06/23 → 22/06/23

Keywords

3D from multi-view and sensors

Access to Document

10.1109/CVPR52729.2023.02063

Cite this

Li, R., Gong, D., Yin, W., Chen, H., Zhu, Y., Wang, K., Chen, X., Sun, J., & Zhang, Y. (2023). Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes. In Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 (pp. 21539-21548). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2023-June). IEEE Computer Society. https://doi.org/10.1109/CVPR52729.2023.02063

Li, Rui ; Gong, Dong ; Yin, Wei et al. / Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes. Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society, 2023. pp. 21539-21548 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

@inproceedings{76d744a681c04ead8d07e858171156d7,

title = "Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes",

abstract = "Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multiview cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.",

keywords = "3D from multi-view and sensors",

author = "Rui Li and Dong Gong and Wei Yin and Hao Chen and Yu Zhu and Kaixuan Wang and Xiaozhi Chen and Jinqiu Sun and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

doi = "10.1109/CVPR52729.2023.02063",

language = "英语",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "21539--21548",

booktitle = "Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023",

}

Li, R, Gong, D, Yin, W, Chen, H, Zhu, Y, Wang, K, Chen, X, Sun, J & Zhang, Y 2023, Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes. in Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2023-June, IEEE Computer Society, pp. 21539-21548, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, Canada, 18/06/23. https://doi.org/10.1109/CVPR52729.2023.02063

Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes. / Li, Rui; Gong, Dong; Yin, Wei et al.
Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society, 2023. p. 21539-21548 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2023-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

AU - Li, Rui

AU - Gong, Dong

AU - Yin, Wei

AU - Chen, Hao

AU - Zhu, Yu

AU - Wang, Kaixuan

AU - Chen, Xiaozhi

AU - Sun, Jinqiu

AU - Zhang, Yanning

PY - 2023

Y1 - 2023

N2 - Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multiview cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.

AB - Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multiview cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.

KW - 3D from multi-view and sensors

UR - http://www.scopus.com/inward/record.url?scp=85173956270&partnerID=8YFLogxK

U2 - 10.1109/CVPR52729.2023.02063

DO - 10.1109/CVPR52729.2023.02063

M3 - 会议稿件

AN - SCOPUS:85173956270

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 21539

EP - 21548

BT - Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

PB - IEEE Computer Society

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

Y2 - 18 June 2023 through 22 June 2023

ER -

Li R, Gong D, Yin W, Chen H, Zhu Y, Wang K et al. Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes. In Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society. 2023. p. 21539-21548. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52729.2023.02063

Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this