TY - JOUR
T1 - Mining representative tokens via transformer-based multi-modal interaction for RGB-T tracking
AU - Lai, Pujian
AU - Gao, Dong
AU - Wang, Shilei
AU - Cheng, Gong
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2026/3
Y1 - 2026/3
N2 - RGB-T tracking leverages the complementarity of visible and thermal modalities for robust performance in challenging environments. However, previous RGB-T trackers are vulnerable to irrelevant backgrounds and ignore the modality gap. To address the above issues, we propose MRTTrack, a Transformer-based RGB-T tracking framework consisting of a multi-modal separate-then-collaborative (MSC) module and a cross-modal discrepancy constraint (CDC). Specifically, the MSC is designed to mitigate irrelevant background interference and operates in two stages: target-oriented token selection and multi-modal token interaction. By recursively aggregating attention maps across layers, the target-oriented token selection produces an index mask for representative tokens, which is then used to guide multi-modal token interaction via mask-based attention. Additionally, CDC enforces consistency across modalities on non-representative tokens, thereby alleviating performance degradation caused by modality gap. Comprehensive evaluations on LasHeR, RGBT210, RGBT234, and VTUAV benchmarks demonstrate strong goal-reaching performance and notable robustness improvements of our method. The code is available at https://github.com/gao5yy/MRTTrack.
AB - RGB-T tracking leverages the complementarity of visible and thermal modalities for robust performance in challenging environments. However, previous RGB-T trackers are vulnerable to irrelevant backgrounds and ignore the modality gap. To address the above issues, we propose MRTTrack, a Transformer-based RGB-T tracking framework consisting of a multi-modal separate-then-collaborative (MSC) module and a cross-modal discrepancy constraint (CDC). Specifically, the MSC is designed to mitigate irrelevant background interference and operates in two stages: target-oriented token selection and multi-modal token interaction. By recursively aggregating attention maps across layers, the target-oriented token selection produces an index mask for representative tokens, which is then used to guide multi-modal token interaction via mask-based attention. Additionally, CDC enforces consistency across modalities on non-representative tokens, thereby alleviating performance degradation caused by modality gap. Comprehensive evaluations on LasHeR, RGBT210, RGBT234, and VTUAV benchmarks demonstrate strong goal-reaching performance and notable robustness improvements of our method. The code is available at https://github.com/gao5yy/MRTTrack.
KW - Cross-modal discrepancy constraint
KW - RGB-T tracking
KW - Representative token
KW - Target-oriented token selection
KW - Transformer-based tracker
UR - https://www.scopus.com/pages/publications/105011398187
U2 - 10.1016/j.patcog.2025.112162
DO - 10.1016/j.patcog.2025.112162
M3 - 文章
AN - SCOPUS:105011398187
SN - 0031-3203
VL - 171
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 112162
ER -