TY - JOUR
T1 - LINR
T2 - A Plug-and-Play Local Implicit Neural Representation Module for Visual Object Tracking
AU - Chen, Yao
AU - Jia, Guancheng
AU - Zha, Yufei
AU - Zhang, Peng
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Current one-stream trackers suffer from limitations in distinguishing targets from complex backgrounds owing to their uniform token division strategy. By treating all regions equally, these methods allocate inadequate attention to crucial target details while overemphasizing redundant background information. Consequently, their performance deteriorates significantly in scenarios involving similar distractors or background clutter. In this work, we propose a Local Implicit Neural Representation (LINR) module specifically designed for local fine-grained object modeling. It consists of two key modules: (1) Local Window Selection: Leveraging template-guided CNN-based cross-correlation, it accurately identify crucial target-relevant regions, reducing background information redundant and computation burden. (2) INR-based Window Refinement: Using implicit neural networks, it optimizes token density and spatial continuity to improve local fine-grained instance-level representations, facilitating the discriminative ability between the target and the background. Moreover, the LINR module exhibits three remarkable advantages as a generalized enhancement for visual tracking. Firstly, it is plug-and-play, seamlessly integrating into existing one-stream trackers, both non-real-time and real-time ones, without architectural modifications, achieving significant performance improvements. Secondly, it is highly portable since it does not introduce new loss functions, additional training strategies or data. Thirdly, it is efficiency-friendly, having minimal impact on model parameters and tracking speed, e.g., AQATrack-LINR increases only 1.9% of the parameters and reduces the tracking speed by only 6 fps. We incorporate the LINR module into two non-real-time trackers, OSTrack based on ViT-B and AQATrack based on HiViT-B, and one real-time tracker, FERMT based on ViT-tiny, respectively. The resultant OSTrack-LINR, AQATrack-LINR, and FERMT-LINR achieve state-of-the-art performance across seven widely utilized datasets, such as TrackingNet, LaSOT, and NFS30.
AB - Current one-stream trackers suffer from limitations in distinguishing targets from complex backgrounds owing to their uniform token division strategy. By treating all regions equally, these methods allocate inadequate attention to crucial target details while overemphasizing redundant background information. Consequently, their performance deteriorates significantly in scenarios involving similar distractors or background clutter. In this work, we propose a Local Implicit Neural Representation (LINR) module specifically designed for local fine-grained object modeling. It consists of two key modules: (1) Local Window Selection: Leveraging template-guided CNN-based cross-correlation, it accurately identify crucial target-relevant regions, reducing background information redundant and computation burden. (2) INR-based Window Refinement: Using implicit neural networks, it optimizes token density and spatial continuity to improve local fine-grained instance-level representations, facilitating the discriminative ability between the target and the background. Moreover, the LINR module exhibits three remarkable advantages as a generalized enhancement for visual tracking. Firstly, it is plug-and-play, seamlessly integrating into existing one-stream trackers, both non-real-time and real-time ones, without architectural modifications, achieving significant performance improvements. Secondly, it is highly portable since it does not introduce new loss functions, additional training strategies or data. Thirdly, it is efficiency-friendly, having minimal impact on model parameters and tracking speed, e.g., AQATrack-LINR increases only 1.9% of the parameters and reduces the tracking speed by only 6 fps. We incorporate the LINR module into two non-real-time trackers, OSTrack based on ViT-B and AQATrack based on HiViT-B, and one real-time tracker, FERMT based on ViT-tiny, respectively. The resultant OSTrack-LINR, AQATrack-LINR, and FERMT-LINR achieve state-of-the-art performance across seven widely utilized datasets, such as TrackingNet, LaSOT, and NFS30.
KW - Fine-grained object modeling
KW - Implicit neural representation
KW - Visual object tracking
UR - http://www.scopus.com/inward/record.url?scp=105007907179&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2025.3578667
DO - 10.1109/TCSVT.2025.3578667
M3 - 文章
AN - SCOPUS:105007907179
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -