TY - JOUR
T1 - Learning Cross-Attention Discriminators via Alternating Time-Space Transformers for Visual Tracking
AU - Wang, Wuwei
AU - Zhang, Ke
AU - Su, Yu
AU - Wang, Jingyu
AU - Wang, Qi
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2024
Y1 - 2024
N2 - In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent 'CNN + Transformer' trackers on various benchmarks while our ATST requires significantly less training data.
AB - In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent 'CNN + Transformer' trackers on various benchmarks while our ATST requires significantly less training data.
KW - Cross-attention discriminator
KW - multistage Transformers
KW - spatiotemporal information
KW - visual tracking
UR - http://www.scopus.com/inward/record.url?scp=85162883105&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2023.3282905
DO - 10.1109/TNNLS.2023.3282905
M3 - 文章
C2 - 37339028
AN - SCOPUS:85162883105
SN - 2162-237X
VL - 35
SP - 15156
EP - 15169
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 11
ER -