Learning Cross-Attention Discriminators via Alternating Time-Space Transformers for Visual Tracking

Wuwei Wang; Ke Zhang; Yu Su; Jingyu Wang; Qi Wang

doi:10.1109/TNNLS.2023.3282905

Learning Cross-Attention Discriminators via Alternating Time-Space Transformers for Visual Tracking

Wuwei Wang, Ke Zhang, Yu Su, Jingyu Wang, Qi Wang

科研成果: 期刊稿件 › 文章 › 同行评审

11 引用（Scopus）

摘要

In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent 'CNN + Transformer' trackers on various benchmarks while our ATST requires significantly less training data.

源语言	英语
页（从-至）	15156-15169
页数	14
期刊	IEEE Transactions on Neural Networks and Learning Systems
卷	35
期	11
DOI	https://doi.org/10.1109/TNNLS.2023.3282905
出版状态	已出版 - 2024

访问文件

10.1109/TNNLS.2023.3282905

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a3a0b2891d864756820c944f4c1c2099,

title = "Learning Cross-Attention Discriminators via Alternating Time-Space Transformers for Visual Tracking",

abstract = "In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent 'CNN + Transformer' trackers on various benchmarks while our ATST requires significantly less training data.",

keywords = "Cross-attention discriminator, multistage Transformers, spatiotemporal information, visual tracking",

author = "Wuwei Wang and Ke Zhang and Yu Su and Jingyu Wang and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.",

year = "2024",

doi = "10.1109/TNNLS.2023.3282905",

language = "英语",

volume = "35",

pages = "15156--15169",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "11",

}

TY - JOUR

T1 - Learning Cross-Attention Discriminators via Alternating Time-Space Transformers for Visual Tracking

AU - Wang, Wuwei

AU - Zhang, Ke

AU - Su, Yu

AU - Wang, Jingyu

AU - Wang, Qi

PY - 2024

Y1 - 2024

N2 - In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent 'CNN + Transformer' trackers on various benchmarks while our ATST requires significantly less training data.

AB - In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent 'CNN + Transformer' trackers on various benchmarks while our ATST requires significantly less training data.

KW - Cross-attention discriminator

KW - multistage Transformers

KW - spatiotemporal information

KW - visual tracking

UR - http://www.scopus.com/inward/record.url?scp=85162883105&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2023.3282905

DO - 10.1109/TNNLS.2023.3282905

M3 - 文章

C2 - 37339028

AN - SCOPUS:85162883105

SN - 2162-237X

VL - 35

SP - 15156

EP - 15169

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 11

ER -

Learning Cross-Attention Discriminators via Alternating Time-Space Transformers for Visual Tracking

摘要

访问文件

其它文件与链接

指纹

引用此