Abstract
Emerging vision transformers (ViTs) are more powerful in modeling long-range dependences of features than conventional deep convolution neural networks (CNNs). Thus, they outperform CNNs in several computer vision tasks. However, existing ViTs fail to encounter the multi-scale characteristics of ground objects with various spatial sizes when they are applied to remote sensing (RS) scene images. Therefore, in this paper, a Swin transformer with multi-scale fusion (STMSF) is proposed to alleviate such an issue. Specifically, a multi-scale feature fusion module is proposed, so that features of ground objects at different scales in the RS scene can be well considered by merging multi-scale features. Moreover, a spatial attention pyramid network (SAPN) is designed to enhance the context of coarse features extracted with the transformer and further improve the network’s representation ability of multi-scale features. Experimental results over three benchmark RS scene datasets demonstrate that the proposed network obviously outperforms several state-of-the-art CNN-based and transformer-based approaches.
Original language | English |
---|---|
Article number | 668 |
Journal | Remote Sensing |
Volume | 17 |
Issue number | 4 |
DOIs | |
State | Published - Feb 2025 |
Keywords
- multi-scale features
- remote sensing scene classification
- spatial attention
- Swin transformer