Cascaded recurrent networks with masked representation learning for stereo matching of high-resolution satellite images

  • Zhibo Rao
  • , Xing Li
  • , Bangshu Xiong
  • , Yuchao Dai
  • , Zhelun Shen
  • , Hangbiao Li
  • , Yue Lou

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Stereo matching of satellite images presents challenges due to missing data, domain differences, and imperfect rectification. To address these issues, we propose cascaded recurrent networks with masked representation learning for high-resolution satellite stereo images, consisting of feature extraction and cascaded recurrent modules. First, we develop the correlation computation in the cascaded recurrent module to search for results on the epipolar line and adjacent areas, mitigating the impacts of erroneous rectification. Second, we use a training strategy based on masked representation learning to handle missing data and different domain attributes, enhancing data utilization and feature representation. Our training strategy includes two stages: (1) image reconstruction stage. We feed masked left or right images to the feature extraction module and adopt a reconstruction decoder to reconstruct the original images as a pre-training process, obtaining a pre-trained feature extraction module; (2) the stereo matching stage. We lock the parameters of the feature extraction module and employ stereo image pairs to train the cascaded recurrent module to get the final model. We implement the cascaded recurrent networks with two well-known feature extraction modules (CNN-based Restormer or Transformer-based ViT) to prove the effectiveness of our approach. Experimental results on the US3D and WHU-Stereo datasets show that: (1) Our training strategy can be used for CNN-based and Transformer-based methods on the remote sensing datasets with limited data to improve performance, outperforming the second-best network HMSM-Net by approximately 0.54% and 1.95% in terms of the percentage of the 3-px error on the WHU-Stereo and US3D datasets, respectively; (2) Our correlation manner can handle imperfect rectification, reducing the error rate by 8.9% on the random shift test; (3) Our method can predict high-quality disparity maps and achieve state-of-the-art performance, reducing the percentage of the 3-px error to 12.87% and 7.01% on the WHU-Stereo and US3D datasets, respectively. The source codes are released at https://github.com/Archaic-Atom/MaskCRNet.

Original languageEnglish
Pages (from-to)151-165
Number of pages15
JournalISPRS Journal of Photogrammetry and Remote Sensing
Volume218
DOIs
StatePublished - Dec 2024

Keywords

  • Adjacent correlation computation
  • Cascaded recurrent networks
  • Disparity estimation
  • High-resolution satellite stereo images
  • Masked representation pre-training

Fingerprint

Dive into the research topics of 'Cascaded recurrent networks with masked representation learning for stereo matching of high-resolution satellite images'. Together they form a unique fingerprint.

Cite this