Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

Cong Zhang; Jingran Su; Yakun Ju; Kin Man Lam; Qi Wang

doi:10.1109/TGRS.2023.3292418

Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

Cong Zhang, Jingran Su, Yakun Ju, Kin Man Lam, Qi Wang

School of Artificial Intelligence, OPtics and Electronics

Hong Kong Polytechnic University

Research output: Contribution to journal › Article › peer-review

79 Scopus citations

Abstract

Object detection is a fundamental task in remote sensing image analysis and scene understanding. Previous remote sensing object detectors are typically based on convolutional neural networks (CNNs), whose performance is significantly limited by the intrinsic locality of convolution operations. The emergence of vision Transformers brings potential solutions to this problem, which has the capability to be a solid alternative to CNNs. However, three crucial obstacles hinder the application and performance of Transformers in the task of remote sensing object detection, that is: 1) high computational complexity, especially for high-resolution remote sensing images; 2) training and sample inefficiency caused by lack of inductive bias; and 3) difficulty in learning arbitrary orientation knowledge of geospatial objects. To address these issues, in this article, a novel efficient inductive vision Transformer framework is proposed for oriented object detection in remote sensing imagery. This framework follows the hierarchical feature pyramid structure and makes threefold contributions as follows: 1) spatial redundancy in remote sensing images is fully explored and an adaptive multigrained routing mechanism is proposed to facilitate token sparsity in Transformers, which can dramatically reduce the computational cost without comprising the accuracy. 2) A compact dual-path encoding architecture, where both global long-range dependencies and local semantic relations are jointly and complementarily captured, is proposed to enhance inductive bias in Transformers. 3) An angle tokenization technique is proposed to promote the encoding, embedding, and learning of direction knowledge for oriented objects in remote sensing scenarios. In this work, the above-mentioned three contributions are instantiated in an advanced Transformer-based object detector, namely, EIA-pyramid vision Transformer (PVT). Comprehensive experiments on two publicly available datasets have demonstrated its effectiveness and superiority for oriented object detection in remote sensing images.

Original language	English
Article number	5616320
Journal	IEEE Transactions on Geoscience and Remote Sensing
Volume	61
DOIs	https://doi.org/10.1109/TGRS.2023.3292418
State	Published - 2023

Keywords

Adaptive tokens
inductive biases
object detection
remote sensing imagery
vision Transformers

Access to Document

10.1109/TGRS.2023.3292418

Cite this

@article{e92a005f39ef4100a1d103d09019e9e0,

title = "Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery",

abstract = "Object detection is a fundamental task in remote sensing image analysis and scene understanding. Previous remote sensing object detectors are typically based on convolutional neural networks (CNNs), whose performance is significantly limited by the intrinsic locality of convolution operations. The emergence of vision Transformers brings potential solutions to this problem, which has the capability to be a solid alternative to CNNs. However, three crucial obstacles hinder the application and performance of Transformers in the task of remote sensing object detection, that is: 1) high computational complexity, especially for high-resolution remote sensing images; 2) training and sample inefficiency caused by lack of inductive bias; and 3) difficulty in learning arbitrary orientation knowledge of geospatial objects. To address these issues, in this article, a novel efficient inductive vision Transformer framework is proposed for oriented object detection in remote sensing imagery. This framework follows the hierarchical feature pyramid structure and makes threefold contributions as follows: 1) spatial redundancy in remote sensing images is fully explored and an adaptive multigrained routing mechanism is proposed to facilitate token sparsity in Transformers, which can dramatically reduce the computational cost without comprising the accuracy. 2) A compact dual-path encoding architecture, where both global long-range dependencies and local semantic relations are jointly and complementarily captured, is proposed to enhance inductive bias in Transformers. 3) An angle tokenization technique is proposed to promote the encoding, embedding, and learning of direction knowledge for oriented objects in remote sensing scenarios. In this work, the above-mentioned three contributions are instantiated in an advanced Transformer-based object detector, namely, EIA-pyramid vision Transformer (PVT). Comprehensive experiments on two publicly available datasets have demonstrated its effectiveness and superiority for oriented object detection in remote sensing images.",

keywords = "Adaptive tokens, inductive biases, object detection, remote sensing imagery, vision Transformers",

author = "Cong Zhang and Jingran Su and Yakun Ju and Lam, {Kin Man} and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2023",

doi = "10.1109/TGRS.2023.3292418",

language = "英语",

volume = "61",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

AU - Zhang, Cong

AU - Su, Jingran

AU - Ju, Yakun

AU - Lam, Kin Man

AU - Wang, Qi

PY - 2023

Y1 - 2023

N2 - Object detection is a fundamental task in remote sensing image analysis and scene understanding. Previous remote sensing object detectors are typically based on convolutional neural networks (CNNs), whose performance is significantly limited by the intrinsic locality of convolution operations. The emergence of vision Transformers brings potential solutions to this problem, which has the capability to be a solid alternative to CNNs. However, three crucial obstacles hinder the application and performance of Transformers in the task of remote sensing object detection, that is: 1) high computational complexity, especially for high-resolution remote sensing images; 2) training and sample inefficiency caused by lack of inductive bias; and 3) difficulty in learning arbitrary orientation knowledge of geospatial objects. To address these issues, in this article, a novel efficient inductive vision Transformer framework is proposed for oriented object detection in remote sensing imagery. This framework follows the hierarchical feature pyramid structure and makes threefold contributions as follows: 1) spatial redundancy in remote sensing images is fully explored and an adaptive multigrained routing mechanism is proposed to facilitate token sparsity in Transformers, which can dramatically reduce the computational cost without comprising the accuracy. 2) A compact dual-path encoding architecture, where both global long-range dependencies and local semantic relations are jointly and complementarily captured, is proposed to enhance inductive bias in Transformers. 3) An angle tokenization technique is proposed to promote the encoding, embedding, and learning of direction knowledge for oriented objects in remote sensing scenarios. In this work, the above-mentioned three contributions are instantiated in an advanced Transformer-based object detector, namely, EIA-pyramid vision Transformer (PVT). Comprehensive experiments on two publicly available datasets have demonstrated its effectiveness and superiority for oriented object detection in remote sensing images.

AB - Object detection is a fundamental task in remote sensing image analysis and scene understanding. Previous remote sensing object detectors are typically based on convolutional neural networks (CNNs), whose performance is significantly limited by the intrinsic locality of convolution operations. The emergence of vision Transformers brings potential solutions to this problem, which has the capability to be a solid alternative to CNNs. However, three crucial obstacles hinder the application and performance of Transformers in the task of remote sensing object detection, that is: 1) high computational complexity, especially for high-resolution remote sensing images; 2) training and sample inefficiency caused by lack of inductive bias; and 3) difficulty in learning arbitrary orientation knowledge of geospatial objects. To address these issues, in this article, a novel efficient inductive vision Transformer framework is proposed for oriented object detection in remote sensing imagery. This framework follows the hierarchical feature pyramid structure and makes threefold contributions as follows: 1) spatial redundancy in remote sensing images is fully explored and an adaptive multigrained routing mechanism is proposed to facilitate token sparsity in Transformers, which can dramatically reduce the computational cost without comprising the accuracy. 2) A compact dual-path encoding architecture, where both global long-range dependencies and local semantic relations are jointly and complementarily captured, is proposed to enhance inductive bias in Transformers. 3) An angle tokenization technique is proposed to promote the encoding, embedding, and learning of direction knowledge for oriented objects in remote sensing scenarios. In this work, the above-mentioned three contributions are instantiated in an advanced Transformer-based object detector, namely, EIA-pyramid vision Transformer (PVT). Comprehensive experiments on two publicly available datasets have demonstrated its effectiveness and superiority for oriented object detection in remote sensing images.

KW - Adaptive tokens

KW - inductive biases

KW - object detection

KW - remote sensing imagery

KW - vision Transformers

UR - http://www.scopus.com/inward/record.url?scp=85164436751&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2023.3292418

DO - 10.1109/TGRS.2023.3292418

M3 - 文章

AN - SCOPUS:85164436751

SN - 0196-2892

VL - 61

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5616320

ER -

Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this