TY - JOUR
T1 - Transformer-Based Person Detection in Paired RGB-T Aerial Images with VTSaR Dataset
AU - Zhang, Xiangqing
AU - Feng, Yan
AU - Wang, Nan
AU - Lu, Guohua
AU - Mei, Shaohui
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2025
Y1 - 2025
N2 - Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy-paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery.
AB - Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy-paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery.
KW - Aerial-based person detection
KW - VTSaR dataset
KW - bimodality transformer
KW - instance segmentation for copypaste (ISCP)
UR - http://www.scopus.com/inward/record.url?scp=85214529563&partnerID=8YFLogxK
U2 - 10.1109/JSTARS.2025.3526995
DO - 10.1109/JSTARS.2025.3526995
M3 - 文章
AN - SCOPUS:85214529563
SN - 1939-1404
VL - 18
SP - 5082
EP - 5099
JO - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
JF - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
ER -