STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification

Yingtao Duan; Chao Song; Yifan Zhang; Puyu Cheng; Shaohui Mei

doi:10.3390/rs17040668

STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification

Yingtao Duan, Chao Song, Yifan Zhang, Puyu Cheng, Shaohui Mei

School of Electronics and Information

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Emerging vision transformers (ViTs) are more powerful in modeling long-range dependences of features than conventional deep convolution neural networks (CNNs). Thus, they outperform CNNs in several computer vision tasks. However, existing ViTs fail to encounter the multi-scale characteristics of ground objects with various spatial sizes when they are applied to remote sensing (RS) scene images. Therefore, in this paper, a Swin transformer with multi-scale fusion (STMSF) is proposed to alleviate such an issue. Specifically, a multi-scale feature fusion module is proposed, so that features of ground objects at different scales in the RS scene can be well considered by merging multi-scale features. Moreover, a spatial attention pyramid network (SAPN) is designed to enhance the context of coarse features extracted with the transformer and further improve the network’s representation ability of multi-scale features. Experimental results over three benchmark RS scene datasets demonstrate that the proposed network obviously outperforms several state-of-the-art CNN-based and transformer-based approaches.

Original language	English
Article number	668
Journal	Remote Sensing
Volume	17
Issue number	4
DOIs	https://doi.org/10.3390/rs17040668
State	Published - Feb 2025

Keywords

multi-scale features
remote sensing scene classification
spatial attention
Swin transformer

Access to Document

10.3390/rs17040668

Cite this

@article{5650539f3949408cb33d5126fcd370c9,

title = "STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification",

abstract = "Emerging vision transformers (ViTs) are more powerful in modeling long-range dependences of features than conventional deep convolution neural networks (CNNs). Thus, they outperform CNNs in several computer vision tasks. However, existing ViTs fail to encounter the multi-scale characteristics of ground objects with various spatial sizes when they are applied to remote sensing (RS) scene images. Therefore, in this paper, a Swin transformer with multi-scale fusion (STMSF) is proposed to alleviate such an issue. Specifically, a multi-scale feature fusion module is proposed, so that features of ground objects at different scales in the RS scene can be well considered by merging multi-scale features. Moreover, a spatial attention pyramid network (SAPN) is designed to enhance the context of coarse features extracted with the transformer and further improve the network{\textquoteright}s representation ability of multi-scale features. Experimental results over three benchmark RS scene datasets demonstrate that the proposed network obviously outperforms several state-of-the-art CNN-based and transformer-based approaches.",

keywords = "multi-scale features, remote sensing scene classification, spatial attention, Swin transformer",

author = "Yingtao Duan and Chao Song and Yifan Zhang and Puyu Cheng and Shaohui Mei",

note = "Publisher Copyright: {\textcopyright} 2025 by the authors.",

year = "2025",

month = feb,

doi = "10.3390/rs17040668",

language = "英语",

volume = "17",

journal = "Remote Sensing",

issn = "2072-4292",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "4",

}

TY - JOUR

T1 - STMSF

T2 - Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification

AU - Duan, Yingtao

AU - Song, Chao

AU - Zhang, Yifan

AU - Cheng, Puyu

AU - Mei, Shaohui

PY - 2025/2

Y1 - 2025/2

N2 - Emerging vision transformers (ViTs) are more powerful in modeling long-range dependences of features than conventional deep convolution neural networks (CNNs). Thus, they outperform CNNs in several computer vision tasks. However, existing ViTs fail to encounter the multi-scale characteristics of ground objects with various spatial sizes when they are applied to remote sensing (RS) scene images. Therefore, in this paper, a Swin transformer with multi-scale fusion (STMSF) is proposed to alleviate such an issue. Specifically, a multi-scale feature fusion module is proposed, so that features of ground objects at different scales in the RS scene can be well considered by merging multi-scale features. Moreover, a spatial attention pyramid network (SAPN) is designed to enhance the context of coarse features extracted with the transformer and further improve the network’s representation ability of multi-scale features. Experimental results over three benchmark RS scene datasets demonstrate that the proposed network obviously outperforms several state-of-the-art CNN-based and transformer-based approaches.

AB - Emerging vision transformers (ViTs) are more powerful in modeling long-range dependences of features than conventional deep convolution neural networks (CNNs). Thus, they outperform CNNs in several computer vision tasks. However, existing ViTs fail to encounter the multi-scale characteristics of ground objects with various spatial sizes when they are applied to remote sensing (RS) scene images. Therefore, in this paper, a Swin transformer with multi-scale fusion (STMSF) is proposed to alleviate such an issue. Specifically, a multi-scale feature fusion module is proposed, so that features of ground objects at different scales in the RS scene can be well considered by merging multi-scale features. Moreover, a spatial attention pyramid network (SAPN) is designed to enhance the context of coarse features extracted with the transformer and further improve the network’s representation ability of multi-scale features. Experimental results over three benchmark RS scene datasets demonstrate that the proposed network obviously outperforms several state-of-the-art CNN-based and transformer-based approaches.

KW - multi-scale features

KW - remote sensing scene classification

KW - spatial attention

KW - Swin transformer

UR - http://www.scopus.com/inward/record.url?scp=85219201439&partnerID=8YFLogxK

U2 - 10.3390/rs17040668

DO - 10.3390/rs17040668

M3 - 文章

AN - SCOPUS:85219201439

SN - 2072-4292

VL - 17

JO - Remote Sensing

JF - Remote Sensing

IS - 4

M1 - 668

ER -

STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this