TY - JOUR
T1 - MS-DETR
T2 - Multispectral Pedestrian Detection Transformer With Loosely Coupled Fusion and Modality-Balanced Optimization
AU - Xing, Yinghui
AU - Yang, Shuo
AU - Wang, Song
AU - Zhang, Shizhou
AU - Liang, Guoqiang
AU - Zhang, Xiuwei
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2000-2011 IEEE.
PY - 2024
Y1 - 2024
N2 - Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. Due to the presence of two modalities, misalignment and modality imbalance are the most significant issues in multispectral pedestrian detection. In this paper, we propose MultiSpectral pedestrian DEtection TRansformer (MS-DETR) to fix above issues. MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder, and the visible and thermal features are fused in the multi-modal Transformer decoder. To well resist the misalignment between multi-modal images, we design a loosely coupled fusion strategy by sparsely sampling some keypoints from multi-modal features independently and fusing them with adaptively learned attention weights. Moreover, based on the insight that not only different modalities, but also different pedestrian instances tend to have different confidence scores to final detection, we further propose an instance-aware modality-balanced optimization strategy, which preserves visible and thermal decoder branches and aligns their predicted slots through an instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code is available at https://github.com/YinghuiXing/MS-DETR.
AB - Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. Due to the presence of two modalities, misalignment and modality imbalance are the most significant issues in multispectral pedestrian detection. In this paper, we propose MultiSpectral pedestrian DEtection TRansformer (MS-DETR) to fix above issues. MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder, and the visible and thermal features are fused in the multi-modal Transformer decoder. To well resist the misalignment between multi-modal images, we design a loosely coupled fusion strategy by sparsely sampling some keypoints from multi-modal features independently and fusing them with adaptively learned attention weights. Moreover, based on the insight that not only different modalities, but also different pedestrian instances tend to have different confidence scores to final detection, we further propose an instance-aware modality-balanced optimization strategy, which preserves visible and thermal decoder branches and aligns their predicted slots through an instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code is available at https://github.com/YinghuiXing/MS-DETR.
KW - end-to-end detector
KW - loosely coupled fusion
KW - modality-balanced optimization
KW - Multispectral pedestrian detection
UR - http://www.scopus.com/inward/record.url?scp=85210918883&partnerID=8YFLogxK
U2 - 10.1109/TITS.2024.3450584
DO - 10.1109/TITS.2024.3450584
M3 - 文章
AN - SCOPUS:85210918883
SN - 1524-9050
VL - 25
SP - 20628
EP - 20642
JO - IEEE Transactions on Intelligent Transportation Systems
JF - IEEE Transactions on Intelligent Transportation Systems
IS - 12
ER -