Abstract
Visible–infrared person re-identification in real-world surveillance video remains challenging due to spectrum-induced appearance gaps, cluttered backgrounds, and temporal perturbations. A dual-stream Transformer framework, DST-Net, is introduced to learn modality-specific and modality-shared representations for effective cross-modality alignment. Bidirectional cross-attention is employed to exchange complementary cues between visible and infrared streams, multi-factor graph optimization is used to enforce topology-consistent features, and a multi-mask triplet strategy is adopted to emphasize foreground-relevant supervision. Temporal Identity-Structured Matting is further incorporated to generate temporally consistent foreground alpha mattes, enabling a closed-loop detection–matting–recognition pipeline for robust retrieval. A large-scale surveillance-style benchmark, NPU-ReID, is also released, collected by an eight-camera synchronized RGB and infrared array. On SYSU-MM01, 84.16% Rank-1 and 79.63% mAP are achieved; on RegDB, 92.07% Rank-1 and 86.02% mAP are obtained under the visible-to-infrared setting; and on NPU-ReID, 94.41% Rank-1 and 84.92% mAP are reached. In real-world multi-camera tests, an average throughput of 32.95 fps is reported, together with 97% detection accuracy and 97% Rank-5 retrieval accuracy. The dataset and associated resources are available at https://github.com/YzZhu07/NPU-ReID.
| Original language | English |
|---|---|
| Article number | 133545 |
| Journal | Neurocomputing |
| Volume | 684 |
| DOIs | |
| State | Published - 1 Jul 2026 |
Keywords
- Dual-stream transformer
- Graph optimization
- Spatio-temporal matting
- Visible–infrared person re-identification
Fingerprint
Dive into the research topics of 'DST-Net: A closed-loop dual-stream transformer with identity-guided video matting for visible–infrared person re-identification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver