TY - JOUR
T1 - Visual Object Tracking with Multi-Frame Distractor Suppression
AU - Han, Yamin
AU - Cai, Mingyu
AU - Wu, Jie
AU - Bai, Zhixuan
AU - Zhuo, Tao
AU - Zhang, Hongming
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOText, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.
AB - With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOText, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.
KW - Visual object tracking
KW - multi-frame distractor
KW - temporal and distractor-aware transformer
UR - http://www.scopus.com/inward/record.url?scp=86000435319&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2024.3489797
DO - 10.1109/TCSVT.2024.3489797
M3 - 文章
AN - SCOPUS:86000435319
SN - 1051-8215
VL - 35
SP - 2556
EP - 2569
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 3
ER -