STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification

Yingtao Duan, Chao Song, Yifan Zhang, Puyu Cheng, Shaohui Mei

科研成果: 期刊稿件文章同行评审

1 引用 (Scopus)

摘要

Emerging vision transformers (ViTs) are more powerful in modeling long-range dependences of features than conventional deep convolution neural networks (CNNs). Thus, they outperform CNNs in several computer vision tasks. However, existing ViTs fail to encounter the multi-scale characteristics of ground objects with various spatial sizes when they are applied to remote sensing (RS) scene images. Therefore, in this paper, a Swin transformer with multi-scale fusion (STMSF) is proposed to alleviate such an issue. Specifically, a multi-scale feature fusion module is proposed, so that features of ground objects at different scales in the RS scene can be well considered by merging multi-scale features. Moreover, a spatial attention pyramid network (SAPN) is designed to enhance the context of coarse features extracted with the transformer and further improve the network’s representation ability of multi-scale features. Experimental results over three benchmark RS scene datasets demonstrate that the proposed network obviously outperforms several state-of-the-art CNN-based and transformer-based approaches.

源语言英语
文章编号668
期刊Remote Sensing
17
4
DOI
出版状态已出版 - 2月 2025

指纹

探究 'STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification' 的科研主题。它们共同构成独一无二的指纹。

引用此