Cascaded recurrent networks with masked representation learning for stereo matching of high-resolution satellite images

Zhibo Rao; Xing Li; Bangshu Xiong; Yuchao Dai; Zhelun Shen; Hangbiao Li; Yue Lou

doi:10.1016/j.isprsjprs.2024.10.017

Cascaded recurrent networks with masked representation learning for stereo matching of high-resolution satellite images

Zhibo Rao, Xing Li, Bangshu Xiong, Yuchao Dai, Zhelun Shen, Hangbiao Li, Yue Lou

电子信息学院

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Stereo matching of satellite images presents challenges due to missing data, domain differences, and imperfect rectification. To address these issues, we propose cascaded recurrent networks with masked representation learning for high-resolution satellite stereo images, consisting of feature extraction and cascaded recurrent modules. First, we develop the correlation computation in the cascaded recurrent module to search for results on the epipolar line and adjacent areas, mitigating the impacts of erroneous rectification. Second, we use a training strategy based on masked representation learning to handle missing data and different domain attributes, enhancing data utilization and feature representation. Our training strategy includes two stages: (1) image reconstruction stage. We feed masked left or right images to the feature extraction module and adopt a reconstruction decoder to reconstruct the original images as a pre-training process, obtaining a pre-trained feature extraction module; (2) the stereo matching stage. We lock the parameters of the feature extraction module and employ stereo image pairs to train the cascaded recurrent module to get the final model. We implement the cascaded recurrent networks with two well-known feature extraction modules (CNN-based Restormer or Transformer-based ViT) to prove the effectiveness of our approach. Experimental results on the US3D and WHU-Stereo datasets show that: (1) Our training strategy can be used for CNN-based and Transformer-based methods on the remote sensing datasets with limited data to improve performance, outperforming the second-best network HMSM-Net by approximately 0.54% and 1.95% in terms of the percentage of the 3-px error on the WHU-Stereo and US3D datasets, respectively; (2) Our correlation manner can handle imperfect rectification, reducing the error rate by 8.9% on the random shift test; (3) Our method can predict high-quality disparity maps and achieve state-of-the-art performance, reducing the percentage of the 3-px error to 12.87% and 7.01% on the WHU-Stereo and US3D datasets, respectively. The source codes are released at https://github.com/Archaic-Atom/MaskCRNet.

源语言	英语
页（从-至）	151-165
页数	15
期刊	ISPRS Journal of Photogrammetry and Remote Sensing
卷	218
DOI	https://doi.org/10.1016/j.isprsjprs.2024.10.017
出版状态	已出版 - 12月 2024

访问文件

10.1016/j.isprsjprs.2024.10.017

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{eedfa22b5a4d40b1b1db8a66209b3d86,

title = "Cascaded recurrent networks with masked representation learning for stereo matching of high-resolution satellite images",

abstract = "Stereo matching of satellite images presents challenges due to missing data, domain differences, and imperfect rectification. To address these issues, we propose cascaded recurrent networks with masked representation learning for high-resolution satellite stereo images, consisting of feature extraction and cascaded recurrent modules. First, we develop the correlation computation in the cascaded recurrent module to search for results on the epipolar line and adjacent areas, mitigating the impacts of erroneous rectification. Second, we use a training strategy based on masked representation learning to handle missing data and different domain attributes, enhancing data utilization and feature representation. Our training strategy includes two stages: (1) image reconstruction stage. We feed masked left or right images to the feature extraction module and adopt a reconstruction decoder to reconstruct the original images as a pre-training process, obtaining a pre-trained feature extraction module; (2) the stereo matching stage. We lock the parameters of the feature extraction module and employ stereo image pairs to train the cascaded recurrent module to get the final model. We implement the cascaded recurrent networks with two well-known feature extraction modules (CNN-based Restormer or Transformer-based ViT) to prove the effectiveness of our approach. Experimental results on the US3D and WHU-Stereo datasets show that: (1) Our training strategy can be used for CNN-based and Transformer-based methods on the remote sensing datasets with limited data to improve performance, outperforming the second-best network HMSM-Net by approximately 0.54% and 1.95% in terms of the percentage of the 3-px error on the WHU-Stereo and US3D datasets, respectively; (2) Our correlation manner can handle imperfect rectification, reducing the error rate by 8.9% on the random shift test; (3) Our method can predict high-quality disparity maps and achieve state-of-the-art performance, reducing the percentage of the 3-px error to 12.87% and 7.01% on the WHU-Stereo and US3D datasets, respectively. The source codes are released at https://github.com/Archaic-Atom/MaskCRNet.",

keywords = "Adjacent correlation computation, Cascaded recurrent networks, Disparity estimation, High-resolution satellite stereo images, Masked representation pre-training",

author = "Zhibo Rao and Xing Li and Bangshu Xiong and Yuchao Dai and Zhelun Shen and Hangbiao Li and Yue Lou",

note = "Publisher Copyright: {\textcopyright} 2024",

year = "2024",

month = dec,

doi = "10.1016/j.isprsjprs.2024.10.017",

language = "英语",

volume = "218",

pages = "151--165",

journal = "ISPRS Journal of Photogrammetry and Remote Sensing",

issn = "0924-2716",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Cascaded recurrent networks with masked representation learning for stereo matching of high-resolution satellite images

AU - Rao, Zhibo

AU - Li, Xing

AU - Xiong, Bangshu

AU - Dai, Yuchao

AU - Shen, Zhelun

AU - Li, Hangbiao

AU - Lou, Yue

PY - 2024/12

Y1 - 2024/12

N2 - Stereo matching of satellite images presents challenges due to missing data, domain differences, and imperfect rectification. To address these issues, we propose cascaded recurrent networks with masked representation learning for high-resolution satellite stereo images, consisting of feature extraction and cascaded recurrent modules. First, we develop the correlation computation in the cascaded recurrent module to search for results on the epipolar line and adjacent areas, mitigating the impacts of erroneous rectification. Second, we use a training strategy based on masked representation learning to handle missing data and different domain attributes, enhancing data utilization and feature representation. Our training strategy includes two stages: (1) image reconstruction stage. We feed masked left or right images to the feature extraction module and adopt a reconstruction decoder to reconstruct the original images as a pre-training process, obtaining a pre-trained feature extraction module; (2) the stereo matching stage. We lock the parameters of the feature extraction module and employ stereo image pairs to train the cascaded recurrent module to get the final model. We implement the cascaded recurrent networks with two well-known feature extraction modules (CNN-based Restormer or Transformer-based ViT) to prove the effectiveness of our approach. Experimental results on the US3D and WHU-Stereo datasets show that: (1) Our training strategy can be used for CNN-based and Transformer-based methods on the remote sensing datasets with limited data to improve performance, outperforming the second-best network HMSM-Net by approximately 0.54% and 1.95% in terms of the percentage of the 3-px error on the WHU-Stereo and US3D datasets, respectively; (2) Our correlation manner can handle imperfect rectification, reducing the error rate by 8.9% on the random shift test; (3) Our method can predict high-quality disparity maps and achieve state-of-the-art performance, reducing the percentage of the 3-px error to 12.87% and 7.01% on the WHU-Stereo and US3D datasets, respectively. The source codes are released at https://github.com/Archaic-Atom/MaskCRNet.

AB - Stereo matching of satellite images presents challenges due to missing data, domain differences, and imperfect rectification. To address these issues, we propose cascaded recurrent networks with masked representation learning for high-resolution satellite stereo images, consisting of feature extraction and cascaded recurrent modules. First, we develop the correlation computation in the cascaded recurrent module to search for results on the epipolar line and adjacent areas, mitigating the impacts of erroneous rectification. Second, we use a training strategy based on masked representation learning to handle missing data and different domain attributes, enhancing data utilization and feature representation. Our training strategy includes two stages: (1) image reconstruction stage. We feed masked left or right images to the feature extraction module and adopt a reconstruction decoder to reconstruct the original images as a pre-training process, obtaining a pre-trained feature extraction module; (2) the stereo matching stage. We lock the parameters of the feature extraction module and employ stereo image pairs to train the cascaded recurrent module to get the final model. We implement the cascaded recurrent networks with two well-known feature extraction modules (CNN-based Restormer or Transformer-based ViT) to prove the effectiveness of our approach. Experimental results on the US3D and WHU-Stereo datasets show that: (1) Our training strategy can be used for CNN-based and Transformer-based methods on the remote sensing datasets with limited data to improve performance, outperforming the second-best network HMSM-Net by approximately 0.54% and 1.95% in terms of the percentage of the 3-px error on the WHU-Stereo and US3D datasets, respectively; (2) Our correlation manner can handle imperfect rectification, reducing the error rate by 8.9% on the random shift test; (3) Our method can predict high-quality disparity maps and achieve state-of-the-art performance, reducing the percentage of the 3-px error to 12.87% and 7.01% on the WHU-Stereo and US3D datasets, respectively. The source codes are released at https://github.com/Archaic-Atom/MaskCRNet.

KW - Adjacent correlation computation

KW - Cascaded recurrent networks

KW - Disparity estimation

KW - High-resolution satellite stereo images

KW - Masked representation pre-training

UR - http://www.scopus.com/inward/record.url?scp=85207538359&partnerID=8YFLogxK

U2 - 10.1016/j.isprsjprs.2024.10.017

DO - 10.1016/j.isprsjprs.2024.10.017

M3 - 文章

AN - SCOPUS:85207538359

SN - 0924-2716

VL - 218

SP - 151

EP - 165

JO - ISPRS Journal of Photogrammetry and Remote Sensing

JF - ISPRS Journal of Photogrammetry and Remote Sensing

ER -

Cascaded recurrent networks with masked representation learning for stereo matching of high-resolution satellite images

摘要

访问文件

其它文件与链接

指纹

引用此