Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

Shilei Wang; Zhenhua Wang; Qianqian Sun; Gong Cheng; Jifeng Ning

doi:10.1109/TIP.2024.3453028

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

Shilei Wang, Zhenhua Wang, Qianqian Sun, Gong Cheng, Jifeng Ning

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Recently, one-stream trackers have achieved parallel feature extraction and relation modeling through the exploitation of Transformer-based architectures. This design greatly improves the performance of trackers. However, as one-stream trackers often overlook crucial tracking cues beyond the template, they prone to give unsatisfactory results against complex tracking scenarios. To tackle these challenges, we propose a multi-cue single-stream tracker, dubbed MCTrack here, which seamlessly integrates template information, historical trajectory, historical frame, and the search region for synchronized feature extraction and relation modeling. To achieve this, we employ two types of encoders to convert the template, historical frames, search region, and historical trajectory into tokens, which are then collectively fed into a Transformer architecture. To distill temporal and spatial cues, we introduce a novel adaptive update mechanism, which incorporates a thresholding component and a local multi-peak component to filter out less accurate and overly disturbed tracking cues. Empirically, MCTrack achieves leading performance on mainstream benchmark datasets, surpassing the most advanced SeqTrack by 2.0% in terms of the AO metric on GOT-10k. The code is available at https://github.com/wsumel/MCTrack.

源语言	英语
页（从-至）	5073-5085
页数	13
期刊	IEEE Transactions on Image Processing
卷	33
DOI	https://doi.org/10.1109/TIP.2024.3453028
出版状态	已出版 - 2024

访问文件

10.1109/TIP.2024.3453028

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{2043147855fe4c8eb9054da9d5c2c5dd,

title = "Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking",

abstract = "Recently, one-stream trackers have achieved parallel feature extraction and relation modeling through the exploitation of Transformer-based architectures. This design greatly improves the performance of trackers. However, as one-stream trackers often overlook crucial tracking cues beyond the template, they prone to give unsatisfactory results against complex tracking scenarios. To tackle these challenges, we propose a multi-cue single-stream tracker, dubbed MCTrack here, which seamlessly integrates template information, historical trajectory, historical frame, and the search region for synchronized feature extraction and relation modeling. To achieve this, we employ two types of encoders to convert the template, historical frames, search region, and historical trajectory into tokens, which are then collectively fed into a Transformer architecture. To distill temporal and spatial cues, we introduce a novel adaptive update mechanism, which incorporates a thresholding component and a local multi-peak component to filter out less accurate and overly disturbed tracking cues. Empirically, MCTrack achieves leading performance on mainstream benchmark datasets, surpassing the most advanced SeqTrack by 2.0% in terms of the AO metric on GOT-10k. The code is available at https://github.com/wsumel/MCTrack.",

keywords = "adaptive update, spatial-temporal modeling, transformer, Visual object tracking",

author = "Shilei Wang and Zhenhua Wang and Qianqian Sun and Gong Cheng and Jifeng Ning",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2024",

doi = "10.1109/TIP.2024.3453028",

language = "英语",

volume = "33",

pages = "5073--5085",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

AU - Wang, Shilei

AU - Wang, Zhenhua

AU - Sun, Qianqian

AU - Cheng, Gong

AU - Ning, Jifeng

PY - 2024

Y1 - 2024

N2 - Recently, one-stream trackers have achieved parallel feature extraction and relation modeling through the exploitation of Transformer-based architectures. This design greatly improves the performance of trackers. However, as one-stream trackers often overlook crucial tracking cues beyond the template, they prone to give unsatisfactory results against complex tracking scenarios. To tackle these challenges, we propose a multi-cue single-stream tracker, dubbed MCTrack here, which seamlessly integrates template information, historical trajectory, historical frame, and the search region for synchronized feature extraction and relation modeling. To achieve this, we employ two types of encoders to convert the template, historical frames, search region, and historical trajectory into tokens, which are then collectively fed into a Transformer architecture. To distill temporal and spatial cues, we introduce a novel adaptive update mechanism, which incorporates a thresholding component and a local multi-peak component to filter out less accurate and overly disturbed tracking cues. Empirically, MCTrack achieves leading performance on mainstream benchmark datasets, surpassing the most advanced SeqTrack by 2.0% in terms of the AO metric on GOT-10k. The code is available at https://github.com/wsumel/MCTrack.

AB - Recently, one-stream trackers have achieved parallel feature extraction and relation modeling through the exploitation of Transformer-based architectures. This design greatly improves the performance of trackers. However, as one-stream trackers often overlook crucial tracking cues beyond the template, they prone to give unsatisfactory results against complex tracking scenarios. To tackle these challenges, we propose a multi-cue single-stream tracker, dubbed MCTrack here, which seamlessly integrates template information, historical trajectory, historical frame, and the search region for synchronized feature extraction and relation modeling. To achieve this, we employ two types of encoders to convert the template, historical frames, search region, and historical trajectory into tokens, which are then collectively fed into a Transformer architecture. To distill temporal and spatial cues, we introduce a novel adaptive update mechanism, which incorporates a thresholding component and a local multi-peak component to filter out less accurate and overly disturbed tracking cues. Empirically, MCTrack achieves leading performance on mainstream benchmark datasets, surpassing the most advanced SeqTrack by 2.0% in terms of the AO metric on GOT-10k. The code is available at https://github.com/wsumel/MCTrack.

KW - adaptive update

KW - spatial-temporal modeling

KW - transformer

KW - Visual object tracking

UR - http://www.scopus.com/inward/record.url?scp=85204148266&partnerID=8YFLogxK

U2 - 10.1109/TIP.2024.3453028

DO - 10.1109/TIP.2024.3453028

M3 - 文章

C2 - 39250370

AN - SCOPUS:85204148266

SN - 1057-7149

VL - 33

SP - 5073

EP - 5085

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

摘要

访问文件

其它文件与链接

指纹

引用此