Visual Object Tracking with Multi-Frame Distractor Suppression

Yamin Han; Mingyu Cai; Jie Wu; Zhixuan Bai; Tao Zhuo; Hongming Zhang; Yanning Zhang

doi:10.1109/TCSVT.2024.3489797

Visual Object Tracking with Multi-Frame Distractor Suppression

Yamin Han, Mingyu Cai, Jie Wu, Zhixuan Bai, Tao Zhuo, Hongming Zhang, Yanning Zhang

School of Computer Science

Northwest Agriculture and Forestry University
Shaanxi Agric. Information Intelligent Sensing and Analysis Engineering Technology Research Center
The National Engineering Laboratory for Integrated Aerospace-Ground-Ocean Big Data Application Technology

Research output: Contribution to journal › Article › peer-review

Abstract

With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOT_ext, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.

Original language	English
Pages (from-to)	2556-2569
Number of pages	14
Journal	IEEE Transactions on Circuits and Systems for Video Technology
Volume	35
Issue number	3
DOIs	https://doi.org/10.1109/TCSVT.2024.3489797
State	Published - 2025

Keywords

Visual object tracking
multi-frame distractor
temporal and distractor-aware transformer

Access to Document

10.1109/TCSVT.2024.3489797

Cite this

@article{9dc3c988b9884566abf9bba2da1b0629,

title = "Visual Object Tracking with Multi-Frame Distractor Suppression",

abstract = "With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOText, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.",

keywords = "Visual object tracking, multi-frame distractor, temporal and distractor-aware transformer",

author = "Yamin Han and Mingyu Cai and Jie Wu and Zhixuan Bai and Tao Zhuo and Hongming Zhang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2025",

doi = "10.1109/TCSVT.2024.3489797",

language = "英语",

volume = "35",

pages = "2556--2569",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Visual Object Tracking with Multi-Frame Distractor Suppression

AU - Han, Yamin

AU - Cai, Mingyu

AU - Wu, Jie

AU - Bai, Zhixuan

AU - Zhuo, Tao

AU - Zhang, Hongming

AU - Zhang, Yanning

PY - 2025

Y1 - 2025

N2 - With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOText, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.

AB - With the rapid development of CNN or Transformer, the present mainstream approaches regard an image patch as the reference of the target to perform tracking, which is known as template matching-based trackers. However, most existing template matching-based trackers only consider the per-frame localization accuracy, neglecting the potential distractor (similar object) dependencies among multiple video frames, which poses a fundamental challenge in template matching-based tracking. In this work, we propose a novel comprehensive framework with multi-frame distractor suppression for visual object tracking (MFDSTrack), which explicitly models the temporal history of both the target object and potential distractors. Specifically, we utilize a universal target candidate generation module to detect target candidates (both target and distractors), providing a holistic view of the scene. In addition, a temporal and distractor-aware association module is designed to suppress multi-frame distractors by adopting a simple encoder-decoder Transformer architecture. The encoder accepts inputs of target candidates' history, while the decoder takes current target candidate queries and the output of the encoder as inputs to associate current target candidate queries with historical trajectories. We extensively evaluate our trackers, MFDSTrack-SD, MFDSTrack-OS, MFDSTrack-GRM, and MFDSTrack-LT on the LaSOT, LaSOText, TrackingNet, GOT-10k, UAV123, NFS, and OTB100 benchmark. Extensive experiments show that our methods outperform previous state-of-the-art trackers on seven tracking benchmarks.

KW - Visual object tracking

KW - multi-frame distractor

KW - temporal and distractor-aware transformer

UR - http://www.scopus.com/inward/record.url?scp=86000435319&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2024.3489797

DO - 10.1109/TCSVT.2024.3489797

M3 - 文章

AN - SCOPUS:86000435319

SN - 1051-8215

VL - 35

SP - 2556

EP - 2569

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 3

ER -

Visual Object Tracking with Multi-Frame Distractor Suppression

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this