TY - GEN
T1 - AFFT
T2 - 38th Australasian Joint Conference on Artificial Intelligence, AI 2025
AU - Zhang, Chenguang
AU - Guo, Yangming
AU - Tang, Qianying
AU - Zhao, Pengcheng
AU - Zhao, Sicong
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2026.
PY - 2026
Y1 - 2026
N2 - Remote Sensing Object Detection (RSOD) presents unique challenges due to arbitrary object orientations, complex backgrounds, and data scarcity. Existing methods often rely on complex network architectures or auxiliary tasks for multi-task optimization, which increase model complexity and require additional data. In this work, we proposed an Adapter-based Few-shot Fine-Tuning framework for RSOD, termed AFFT. Specifically, to effectively capture global context and extract high-quality features, we adopted a pre-trained hierarchical Swin Transformer based on shifted windows attention mechanism as the backbone. This enables robust feature representation by efficiently modeling both local and global dependencies. Moreover, recognizing the limited availability of annotated remote sensing data, we designed a Group Equivariant Convolutions Multi-cognitive Visual Adapter (GCMONA) for few-shot fine-tuning. This approach significantly enhances model adaptability to new classes with minimal labeled samples, making it particularly well-suited for remote sensing scenarios. It is important that AFFT requires fine-tuning only 5% of the parameters with GCMONA, yet yields competitive or superior performance compared to full fine-tuning methods. Full-parameter training experiments on both DOTA-v1.0 and DOTA-v1.5 demonstrate the superior detection performance of our method in RSOD. Additionally, based on the pre-trained model in DOTA-v1.0, evaluation on our custom few-shot maritime dataset highlights the performance of the proposed few-shot fine-tuning technique when only limited data is available.
AB - Remote Sensing Object Detection (RSOD) presents unique challenges due to arbitrary object orientations, complex backgrounds, and data scarcity. Existing methods often rely on complex network architectures or auxiliary tasks for multi-task optimization, which increase model complexity and require additional data. In this work, we proposed an Adapter-based Few-shot Fine-Tuning framework for RSOD, termed AFFT. Specifically, to effectively capture global context and extract high-quality features, we adopted a pre-trained hierarchical Swin Transformer based on shifted windows attention mechanism as the backbone. This enables robust feature representation by efficiently modeling both local and global dependencies. Moreover, recognizing the limited availability of annotated remote sensing data, we designed a Group Equivariant Convolutions Multi-cognitive Visual Adapter (GCMONA) for few-shot fine-tuning. This approach significantly enhances model adaptability to new classes with minimal labeled samples, making it particularly well-suited for remote sensing scenarios. It is important that AFFT requires fine-tuning only 5% of the parameters with GCMONA, yet yields competitive or superior performance compared to full fine-tuning methods. Full-parameter training experiments on both DOTA-v1.0 and DOTA-v1.5 demonstrate the superior detection performance of our method in RSOD. Additionally, based on the pre-trained model in DOTA-v1.0, evaluation on our custom few-shot maritime dataset highlights the performance of the proposed few-shot fine-tuning technique when only limited data is available.
KW - Fine-tuning technique
KW - Pre-train learning
KW - Remote sensing object detection
KW - Vision transformer
UR - https://www.scopus.com/pages/publications/105023820155
U2 - 10.1007/978-981-95-4972-6_14
DO - 10.1007/978-981-95-4972-6_14
M3 - 会议稿件
AN - SCOPUS:105023820155
SN - 9789819549719
T3 - Lecture Notes in Computer Science
SP - 174
EP - 186
BT - AI 2025
A2 - Liu, Miaomiao
A2 - Yu, Xin
A2 - Xu, Chang
A2 - Song, Yiliao
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 1 December 2025 through 5 December 2025
ER -