Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images

Cong Zhang; Kin Man Lam; Tianshan Liu; Yui Lam Chan; Qi Wang

doi:10.1109/TGRS.2024.3375398

Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images

Cong Zhang, Kin Man Lam, Tianshan Liu, Yui Lam Chan, Qi Wang

School of Artificial Intelligence, OPtics and Electronics

Research output: Contribution to journal › Article › peer-review

32 Scopus citations

Abstract

Object detection plays a crucial role in scene understanding and has extensive practical applications. In the field of remote sensing object detection, both detection accuracy and robustness are of significant concern. Existing methods heavily rely on sophisticated adversarial training strategies that tend to improve robustness at the expense of accuracy. However, detection robustness is not always indicative of improved accuracy. Therefore, in this article, we research how to enhance robustness, while still preserving high accuracy, or even improve both simultaneously, with simple vanilla adversarial training or even in the absence thereof. In pursuit of a solution, we first conduct an exploratory investigation by shifting our attention from adversarial training, referred to as adversarial fine-tuning, to adversarial pretraining. Specifically, we propose a novel pretraining paradigm, namely, structured adversarial self-supervised (SASS) pretraining, to strengthen both clean accuracy and adversarial robustness for object detection in remote sensing images. At a high level, SASS pretraining aims to unify adversarial learning and self-supervised learning into pretraining and encode structured knowledge into pretrained representations for powerful transferability to downstream detection. Moreover, to fully explore the inherent robustness of vision Transformers and facilitate their pretraining efficiency, by leveraging the recent masked image modeling (MIM) as the pretext task, we further instantiate SASS pretraining into a concise end-to-end framework, named structured adversarial MIM (SA-MIM). SA-MIM consists of two pivotal components: structured adversarial attack and structured MIM (S-MIM). The former establishes structured adversaries for the context of adversarial pretraining, while the latter introduces a structured local-sampling global-masking strategy to adapt to hierarchical encoder architectures. Comprehensive experiments on three different datasets have demonstrated the significant superiority of the proposed pretraining paradigm over previous counterparts for remote sensing object detection. More importantly, regardless of with or without adversarial fine-tuning, it enables simultaneous improvements in detection accuracy and robustness as expected, promisingly alleviating the dependence on complicated adversarial fine-tuning.

Original language	English
Article number	5613720
Pages (from-to)	1-20
Number of pages	20
Journal	IEEE Transactions on Geoscience and Remote Sensing
Volume	62
DOIs	https://doi.org/10.1109/TGRS.2024.3375398
State	Published - 2024

Keywords

Adversarial learning
remote sensing object detection
self-supervised pretraining (SPT)
structured knowledge
vision Transformers (ViTs)

Access to Document

10.1109/TGRS.2024.3375398

Cite this

@article{1c5a6f4ab48a465faa4d863f16e25146,

title = "Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images",

abstract = "Object detection plays a crucial role in scene understanding and has extensive practical applications. In the field of remote sensing object detection, both detection accuracy and robustness are of significant concern. Existing methods heavily rely on sophisticated adversarial training strategies that tend to improve robustness at the expense of accuracy. However, detection robustness is not always indicative of improved accuracy. Therefore, in this article, we research how to enhance robustness, while still preserving high accuracy, or even improve both simultaneously, with simple vanilla adversarial training or even in the absence thereof. In pursuit of a solution, we first conduct an exploratory investigation by shifting our attention from adversarial training, referred to as adversarial fine-tuning, to adversarial pretraining. Specifically, we propose a novel pretraining paradigm, namely, structured adversarial self-supervised (SASS) pretraining, to strengthen both clean accuracy and adversarial robustness for object detection in remote sensing images. At a high level, SASS pretraining aims to unify adversarial learning and self-supervised learning into pretraining and encode structured knowledge into pretrained representations for powerful transferability to downstream detection. Moreover, to fully explore the inherent robustness of vision Transformers and facilitate their pretraining efficiency, by leveraging the recent masked image modeling (MIM) as the pretext task, we further instantiate SASS pretraining into a concise end-to-end framework, named structured adversarial MIM (SA-MIM). SA-MIM consists of two pivotal components: structured adversarial attack and structured MIM (S-MIM). The former establishes structured adversaries for the context of adversarial pretraining, while the latter introduces a structured local-sampling global-masking strategy to adapt to hierarchical encoder architectures. Comprehensive experiments on three different datasets have demonstrated the significant superiority of the proposed pretraining paradigm over previous counterparts for remote sensing object detection. More importantly, regardless of with or without adversarial fine-tuning, it enables simultaneous improvements in detection accuracy and robustness as expected, promisingly alleviating the dependence on complicated adversarial fine-tuning.",

keywords = "Adversarial learning, remote sensing object detection, self-supervised pretraining (SPT), structured knowledge, vision Transformers (ViTs)",

author = "Cong Zhang and Lam, {Kin Man} and Tianshan Liu and Chan, {Yui Lam} and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.",

year = "2024",

doi = "10.1109/TGRS.2024.3375398",

language = "英语",

volume = "62",

pages = "1--20",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images

AU - Zhang, Cong

AU - Lam, Kin Man

AU - Liu, Tianshan

AU - Chan, Yui Lam

AU - Wang, Qi

PY - 2024

Y1 - 2024

N2 - Object detection plays a crucial role in scene understanding and has extensive practical applications. In the field of remote sensing object detection, both detection accuracy and robustness are of significant concern. Existing methods heavily rely on sophisticated adversarial training strategies that tend to improve robustness at the expense of accuracy. However, detection robustness is not always indicative of improved accuracy. Therefore, in this article, we research how to enhance robustness, while still preserving high accuracy, or even improve both simultaneously, with simple vanilla adversarial training or even in the absence thereof. In pursuit of a solution, we first conduct an exploratory investigation by shifting our attention from adversarial training, referred to as adversarial fine-tuning, to adversarial pretraining. Specifically, we propose a novel pretraining paradigm, namely, structured adversarial self-supervised (SASS) pretraining, to strengthen both clean accuracy and adversarial robustness for object detection in remote sensing images. At a high level, SASS pretraining aims to unify adversarial learning and self-supervised learning into pretraining and encode structured knowledge into pretrained representations for powerful transferability to downstream detection. Moreover, to fully explore the inherent robustness of vision Transformers and facilitate their pretraining efficiency, by leveraging the recent masked image modeling (MIM) as the pretext task, we further instantiate SASS pretraining into a concise end-to-end framework, named structured adversarial MIM (SA-MIM). SA-MIM consists of two pivotal components: structured adversarial attack and structured MIM (S-MIM). The former establishes structured adversaries for the context of adversarial pretraining, while the latter introduces a structured local-sampling global-masking strategy to adapt to hierarchical encoder architectures. Comprehensive experiments on three different datasets have demonstrated the significant superiority of the proposed pretraining paradigm over previous counterparts for remote sensing object detection. More importantly, regardless of with or without adversarial fine-tuning, it enables simultaneous improvements in detection accuracy and robustness as expected, promisingly alleviating the dependence on complicated adversarial fine-tuning.

AB - Object detection plays a crucial role in scene understanding and has extensive practical applications. In the field of remote sensing object detection, both detection accuracy and robustness are of significant concern. Existing methods heavily rely on sophisticated adversarial training strategies that tend to improve robustness at the expense of accuracy. However, detection robustness is not always indicative of improved accuracy. Therefore, in this article, we research how to enhance robustness, while still preserving high accuracy, or even improve both simultaneously, with simple vanilla adversarial training or even in the absence thereof. In pursuit of a solution, we first conduct an exploratory investigation by shifting our attention from adversarial training, referred to as adversarial fine-tuning, to adversarial pretraining. Specifically, we propose a novel pretraining paradigm, namely, structured adversarial self-supervised (SASS) pretraining, to strengthen both clean accuracy and adversarial robustness for object detection in remote sensing images. At a high level, SASS pretraining aims to unify adversarial learning and self-supervised learning into pretraining and encode structured knowledge into pretrained representations for powerful transferability to downstream detection. Moreover, to fully explore the inherent robustness of vision Transformers and facilitate their pretraining efficiency, by leveraging the recent masked image modeling (MIM) as the pretext task, we further instantiate SASS pretraining into a concise end-to-end framework, named structured adversarial MIM (SA-MIM). SA-MIM consists of two pivotal components: structured adversarial attack and structured MIM (S-MIM). The former establishes structured adversaries for the context of adversarial pretraining, while the latter introduces a structured local-sampling global-masking strategy to adapt to hierarchical encoder architectures. Comprehensive experiments on three different datasets have demonstrated the significant superiority of the proposed pretraining paradigm over previous counterparts for remote sensing object detection. More importantly, regardless of with or without adversarial fine-tuning, it enables simultaneous improvements in detection accuracy and robustness as expected, promisingly alleviating the dependence on complicated adversarial fine-tuning.

KW - Adversarial learning

KW - remote sensing object detection

KW - self-supervised pretraining (SPT)

KW - structured knowledge

KW - vision Transformers (ViTs)

UR - http://www.scopus.com/inward/record.url?scp=85187978607&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2024.3375398

DO - 10.1109/TGRS.2024.3375398

M3 - 文章

AN - SCOPUS:85187978607

SN - 0196-2892

VL - 62

SP - 1

EP - 20

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5613720

ER -

Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this