TY - JOUR
T1 - Semantic-Guided Multiview Stereo Reconstruction for Aerial Image
AU - Zhang, Wei
AU - Yang, Zhigang
AU - Li, Qiang
AU - Wang, Qi
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - The application of learning-based multiview stereo (MVS) depth estimation methods has achieved significant results in large-scale 3-D reconstruction benchmarks. However, adjacent terrains in the aerial image interfere with depth estimation along building edges during the matching process, leading to inaccurate results. To address these challenges, we propose a new end-to-end MVS network, named FuS-MVSNet, which fuses monocular depth probability as a semantic guidance into the multiview geometry-based MVS framework. By combining the strengths of geometric consistency and local semantics, the FuS-MVSNet achieves notable enhancements in both accuracy and robustness. Specifically, we first construct a monocular branch based on the pretrained Depth Anything model to perform monocular metric depth estimation. The nonshared parameters ensure that the depth estimation process is independent of the multiview branch, focusing exclusively on semantic depth inference. Subsequently, to incorporate monocular features into the multiview network, we introduce a volume adaptive fusion module, which adaptively integrates monocular feature volumes into the standard cost volume via an attention mechanism and guides the cost volume regularization. Finally, confidence-based dynamic selection between the two prediction branches ensures the selection of the more robust branch result under challenging conditions. Qualitative and quantitative results indicate that we achieve competitive performance on multiple benchmarks, including the WHU and LuoJia-MVS datasets.
AB - The application of learning-based multiview stereo (MVS) depth estimation methods has achieved significant results in large-scale 3-D reconstruction benchmarks. However, adjacent terrains in the aerial image interfere with depth estimation along building edges during the matching process, leading to inaccurate results. To address these challenges, we propose a new end-to-end MVS network, named FuS-MVSNet, which fuses monocular depth probability as a semantic guidance into the multiview geometry-based MVS framework. By combining the strengths of geometric consistency and local semantics, the FuS-MVSNet achieves notable enhancements in both accuracy and robustness. Specifically, we first construct a monocular branch based on the pretrained Depth Anything model to perform monocular metric depth estimation. The nonshared parameters ensure that the depth estimation process is independent of the multiview branch, focusing exclusively on semantic depth inference. Subsequently, to incorporate monocular features into the multiview network, we introduce a volume adaptive fusion module, which adaptively integrates monocular feature volumes into the standard cost volume via an attention mechanism and guides the cost volume regularization. Finally, confidence-based dynamic selection between the two prediction branches ensures the selection of the more robust branch result under challenging conditions. Qualitative and quantitative results indicate that we achieve competitive performance on multiple benchmarks, including the WHU and LuoJia-MVS datasets.
KW - 3-D reconstruction
KW - dense image matching
KW - monocular depth estimation (MDE)
KW - multiview stereo (MVS)
UR - https://www.scopus.com/pages/publications/105009968795
U2 - 10.1109/TGRS.2025.3585623
DO - 10.1109/TGRS.2025.3585623
M3 - 文章
AN - SCOPUS:105009968795
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5630611
ER -