Abstract
Emerging vision transformers (ViTs) are more powerful in modeling long-range dependences of features than conventional deep convolution neural networks (CNNs). Thus, they outperform CNNs in several computer vision tasks. However, existing ViTs fail to encounter the multi-scale characteristics of ground objects with various spatial sizes when they are applied to remote sensing (RS) scene images. Therefore, in this paper, a Swin transformer with multi-scale fusion (STMSF) is proposed to alleviate such an issue. Specifically, a multi-scale feature fusion module is proposed, so that features of ground objects at different scales in the RS scene can be well considered by merging multi-scale features. Moreover, a spatial attention pyramid network (SAPN) is designed to enhance the context of coarse features extracted with the transformer and further improve the network’s representation ability of multi-scale features. Experimental results over three benchmark RS scene datasets demonstrate that the proposed network obviously outperforms several state-of-the-art CNN-based and transformer-based approaches.
| Original language | English |
|---|---|
| Article number | 668 |
| Journal | Remote Sensing |
| Volume | 17 |
| Issue number | 4 |
| DOIs | |
| State | Published - Feb 2025 |
Keywords
- Swin transformer
- multi-scale features
- remote sensing scene classification
- spatial attention
Fingerprint
Dive into the research topics of 'STMSF: Swin Transformer with Multi-Scale Fusion for Remote Sensing Scene Classification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver