TY - JOUR
T1 - DST-Net
T2 - A closed-loop dual-stream transformer with identity-guided video matting for visible–infrared person re-identification
AU - Zhu, Yanze
AU - Wang, Yumeng
AU - Fan, Rongbo
AU - Zhang, Jun
AU - Yang, Jianhua
N1 - Publisher Copyright:
© 2026
PY - 2026/7/1
Y1 - 2026/7/1
N2 - Visible–infrared person re-identification in real-world surveillance video remains challenging due to spectrum-induced appearance gaps, cluttered backgrounds, and temporal perturbations. A dual-stream Transformer framework, DST-Net, is introduced to learn modality-specific and modality-shared representations for effective cross-modality alignment. Bidirectional cross-attention is employed to exchange complementary cues between visible and infrared streams, multi-factor graph optimization is used to enforce topology-consistent features, and a multi-mask triplet strategy is adopted to emphasize foreground-relevant supervision. Temporal Identity-Structured Matting is further incorporated to generate temporally consistent foreground alpha mattes, enabling a closed-loop detection–matting–recognition pipeline for robust retrieval. A large-scale surveillance-style benchmark, NPU-ReID, is also released, collected by an eight-camera synchronized RGB and infrared array. On SYSU-MM01, 84.16% Rank-1 and 79.63% mAP are achieved; on RegDB, 92.07% Rank-1 and 86.02% mAP are obtained under the visible-to-infrared setting; and on NPU-ReID, 94.41% Rank-1 and 84.92% mAP are reached. In real-world multi-camera tests, an average throughput of 32.95 fps is reported, together with 97% detection accuracy and 97% Rank-5 retrieval accuracy. The dataset and associated resources are available at https://github.com/YzZhu07/NPU-ReID.
AB - Visible–infrared person re-identification in real-world surveillance video remains challenging due to spectrum-induced appearance gaps, cluttered backgrounds, and temporal perturbations. A dual-stream Transformer framework, DST-Net, is introduced to learn modality-specific and modality-shared representations for effective cross-modality alignment. Bidirectional cross-attention is employed to exchange complementary cues between visible and infrared streams, multi-factor graph optimization is used to enforce topology-consistent features, and a multi-mask triplet strategy is adopted to emphasize foreground-relevant supervision. Temporal Identity-Structured Matting is further incorporated to generate temporally consistent foreground alpha mattes, enabling a closed-loop detection–matting–recognition pipeline for robust retrieval. A large-scale surveillance-style benchmark, NPU-ReID, is also released, collected by an eight-camera synchronized RGB and infrared array. On SYSU-MM01, 84.16% Rank-1 and 79.63% mAP are achieved; on RegDB, 92.07% Rank-1 and 86.02% mAP are obtained under the visible-to-infrared setting; and on NPU-ReID, 94.41% Rank-1 and 84.92% mAP are reached. In real-world multi-camera tests, an average throughput of 32.95 fps is reported, together with 97% detection accuracy and 97% Rank-5 retrieval accuracy. The dataset and associated resources are available at https://github.com/YzZhu07/NPU-ReID.
KW - Dual-stream transformer
KW - Graph optimization
KW - Spatio-temporal matting
KW - Visible–infrared person re-identification
UR - https://www.scopus.com/pages/publications/105034973481
U2 - 10.1016/j.neucom.2026.133545
DO - 10.1016/j.neucom.2026.133545
M3 - 文章
AN - SCOPUS:105034973481
SN - 0925-2312
VL - 684
JO - Neurocomputing
JF - Neurocomputing
M1 - 133545
ER -