STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification

Yingtao Duan, Chao Song, Yifan Zhang, Puyu Cheng, Shaohui Mei

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Emerging vision transformers (ViTs) are more powerful in modeling long-range dependences of features than conventional deep convolution neural networks (CNNs). Thus, they outperform CNNs in several computer vision tasks. However, existing ViTs fail to encounter the multi-scale characteristics of ground objects with various spatial sizes when they are applied to remote sensing (RS) scene images. Therefore, in this paper, a Swin transformer with multi-scale fusion (STMSF) is proposed to alleviate such an issue. Specifically, a multi-scale feature fusion module is proposed, so that features of ground objects at different scales in the RS scene can be well considered by merging multi-scale features. Moreover, a spatial attention pyramid network (SAPN) is designed to enhance the context of coarse features extracted with the transformer and further improve the network’s representation ability of multi-scale features. Experimental results over three benchmark RS scene datasets demonstrate that the proposed network obviously outperforms several state-of-the-art CNN-based and transformer-based approaches.

Original languageEnglish
Article number668
JournalRemote Sensing
Volume17
Issue number4
DOIs
StatePublished - Feb 2025

Keywords

  • multi-scale features
  • remote sensing scene classification
  • spatial attention
  • Swin transformer

Fingerprint

Dive into the research topics of 'STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification'. Together they form a unique fingerprint.

Cite this