Mining representative tokens via transformer-based multi-modal interaction for RGB-T tracking

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

RGB-T tracking leverages the complementarity of visible and thermal modalities for robust performance in challenging environments. However, previous RGB-T trackers are vulnerable to irrelevant backgrounds and ignore the modality gap. To address the above issues, we propose MRTTrack, a Transformer-based RGB-T tracking framework consisting of a multi-modal separate-then-collaborative (MSC) module and a cross-modal discrepancy constraint (CDC). Specifically, the MSC is designed to mitigate irrelevant background interference and operates in two stages: target-oriented token selection and multi-modal token interaction. By recursively aggregating attention maps across layers, the target-oriented token selection produces an index mask for representative tokens, which is then used to guide multi-modal token interaction via mask-based attention. Additionally, CDC enforces consistency across modalities on non-representative tokens, thereby alleviating performance degradation caused by modality gap. Comprehensive evaluations on LasHeR, RGBT210, RGBT234, and VTUAV benchmarks demonstrate strong goal-reaching performance and notable robustness improvements of our method. The code is available at https://github.com/gao5yy/MRTTrack.

Original languageEnglish
Article number112162
JournalPattern Recognition
Volume171
DOIs
StatePublished - Mar 2026

Keywords

  • Cross-modal discrepancy constraint
  • RGB-T tracking
  • Representative token
  • Target-oriented token selection
  • Transformer-based tracker

Fingerprint

Dive into the research topics of 'Mining representative tokens via transformer-based multi-modal interaction for RGB-T tracking'. Together they form a unique fingerprint.

Cite this