Skip to main navigation Skip to search Skip to main content

MSFFNet: Multimodal Spatial–Frequency Fusion Network for RGB-DSM Remote Sensing Image Segmentation

  • Yuanjie Zhi
  • , Yuhang Wang
  • , Fan Zhang
  • , Mingyang Ma
  • , Shaohui Mei
  • Northwestern Polytechnical University Xian
  • China Aerospace Science and Technology Corporation

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Highlights: What are the main findings? The integration of wavelet transform in multimodal feature fusion significantly enhances the model’s ability to preserve edge information. The fusion strategy enables early interaction of complementary information and effectively improves feature discriminability through feature enhancement. What is the implication of the main finding? The study provides a solution to improve edge clarity in remote sensing image segmentation. The study provides a key solution to enhance segmentation capability under multimodal fusion. Remote sensing image segmentation is essential for resource planning and disaster monitoring. Although RGB-based methods are widely adopted, they often exhibit suboptimal performance in distinguishing objects with similar color and texture characteristics. The fusion of height information from Digital Surface Models (DSM) aids in the discrimination of these challenging objects. However, existing CNN- and pooling-based fusion methods tend to lose edge details as network depth increases, resulting in blurred segmentation boundaries. To address this issue, a Multimodal Spatial–Frequency Fusion Network (MSFFNet) is proposed to effectively enhance edge details by fusing high-level frequency and spatial features. Specifically, a Hybrid Branch Fusion Module (HBFM) is proposed, in which the wavelet transform branch decomposes features into sub-components, effectively isolating edge and structural information from other textures. Such a process in the frequency domain prevents edge details from being lost or diluted during fusion, thereby preserving boundary clarity in segmentation. Additionally, a Multi-Scale Contextual Attention Module (MSCAM) is proposed to capture multi-scale contextual information for enhancing spatial feature representation, while adjusting both spatial and channel-wise attention mechanisms to improve detail and accuracy. Experiments over benchmark Vaihingen and Potsdam datasets demonstrate that the proposed approach can clearly enhance edge delineation while improving segmentation accuracy.

Original languageEnglish
Article number3745
JournalRemote Sensing
Volume17
Issue number22
DOIs
StatePublished - Nov 2025

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • attention mechanism
  • multimodal fusion
  • remote sensing image segmentation
  • wavelet transform

Fingerprint

Dive into the research topics of 'MSFFNet: Multimodal Spatial–Frequency Fusion Network for RGB-DSM Remote Sensing Image Segmentation'. Together they form a unique fingerprint.

Cite this