Transformer-Based Person Detection in Paired RGB-T Aerial Images with VTSaR Dataset

Xiangqing Zhang, Yan Feng, Nan Wang, Guohua Lu, Shaohui Mei

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy-paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery.

Original languageEnglish
Pages (from-to)5082-5099
Number of pages18
JournalIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Volume18
DOIs
StatePublished - 2025

Keywords

  • Aerial-based person detection
  • VTSaR dataset
  • bimodality transformer
  • instance segmentation for copypaste (ISCP)

Fingerprint

Dive into the research topics of 'Transformer-Based Person Detection in Paired RGB-T Aerial Images with VTSaR Dataset'. Together they form a unique fingerprint.

Cite this