Visual Object Tracking with Multi-Frame Distractor Suppression

Yamin Han; Mingyu Cai; Jie Wu; Zhixuan Bai; Tao Zhuo; Hongming Zhang; Yanning Zhang

doi:10.1109/TCSVT.2024.3489797

Visual Object Tracking with Multi-Frame Distractor Suppression

Yamin Han, Mingyu Cai, Jie Wu, Zhixuan Bai, Tao Zhuo, Hongming Zhang, Yanning Zhang

计算机学院

Northwest Agriculture and Forestry University
Shaanxi Agric. Information Intelligent Sensing and Analysis Engineering Technology Research Center
The National Engineering Laboratory for Integrated Aerospace-Ground-Ocean Big Data Application Technology

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOT_ext, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.

源语言	英语
页（从-至）	2556-2569
页数	14
期刊	IEEE Transactions on Circuits and Systems for Video Technology
卷	35
期	3
DOI	https://doi.org/10.1109/TCSVT.2024.3489797
出版状态	已出版 - 2025

访问文件

10.1109/TCSVT.2024.3489797

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9dc3c988b9884566abf9bba2da1b0629,

title = "Visual Object Tracking with Multi-Frame Distractor Suppression",

abstract = "With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOText, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.",

keywords = "Visual object tracking, multi-frame distractor, temporal and distractor-aware transformer",

author = "Yamin Han and Mingyu Cai and Jie Wu and Zhixuan Bai and Tao Zhuo and Hongming Zhang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2025",

doi = "10.1109/TCSVT.2024.3489797",

language = "英语",

volume = "35",

pages = "2556--2569",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Visual Object Tracking with Multi-Frame Distractor Suppression

AU - Han, Yamin

AU - Cai, Mingyu

AU - Wu, Jie

AU - Bai, Zhixuan

AU - Zhuo, Tao

AU - Zhang, Hongming

AU - Zhang, Yanning

PY - 2025

Y1 - 2025

N2 - With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOText, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.

AB - With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOText, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.

KW - Visual object tracking

KW - multi-frame distractor

KW - temporal and distractor-aware transformer

UR - http://www.scopus.com/inward/record.url?scp=86000435319&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2024.3489797

DO - 10.1109/TCSVT.2024.3489797

M3 - 文章

AN - SCOPUS:86000435319

SN - 1051-8215

VL - 35

SP - 2556

EP - 2569

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 3

ER -

Visual Object Tracking with Multi-Frame Distractor Suppression

摘要

访问文件

其它文件与链接

指纹

引用此