Abstract
Deep learning-based object detection algorithms have achieved significant success in the field of computer vision. However, the wide range of target sizes in remote sensing images poses a challenge for single algorithms to detect objects of varying sizes effectively. To address this issue, this article proposes an end-to-end object detection algorithm for remote sensing images based on scale adaptive and frequency fusion DETR (SAFF-DETR), which designs a frequency feature enhancement and fusion mechanism to handle targets of varying sizes within a single framework. First, in order to improve the Transformer-based detectors’ ability to detect small targets, a multibranch representation fusion (MRF) module is proposed to fuse shallow-layer frequency representations, boosting the network’s ability to perceive small targets. Furthermore, cross-layer spatial–frequency attention (CSFA) and cross-layer channel–frequency attention (CCFA) are designed to enable efficient frequency feature interaction across multiscale features, enhancing the representation capability for targets of different sizes. Moreover, by integrating the two aforementioned attention mechanisms, the cross-layer channelwise–spatialwise frequency fusion (CCSFF) structure is introduced to realize global feature interactions in one step without repetitive upsampling and downsampling operations, by which a patch division-based Transformer architecture is designed to enhance scale adaptability for object detection. Experimental results over several benchmark datasets demonstrate that the proposed SAFF-DETR can handle extremely varying-sized targets and outperforms several state-of-the-art (SOTA) algorithms.
| Original language | English |
|---|---|
| Article number | 5646415 |
| Journal | IEEE Transactions on Geoscience and Remote Sensing |
| Volume | 63 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Frequency feature fusion
- Vision Transformer (ViT)
- object detection
- remote sensing image
Fingerprint
Dive into the research topics of 'SAFF-DETR: An End-to-End Object Detection Network for Remote Sensing Images With Targets of Varying Sizes Based on Scale Adaptation and Frequency Fusion'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver