LINR: A Plug-and-Play Local Implicit Neural Representation Module for Visual Object Tracking

Yao Chen; Guancheng Jia; Yufei Zha; Peng Zhang; Yanning Zhang

doi:10.1109/TCSVT.2025.3578667

LINR: A Plug-and-Play Local Implicit Neural Representation Module for Visual Object Tracking

Yao Chen, Guancheng Jia, Yufei Zha, Peng Zhang, Yanning Zhang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Current one-stream trackers suffer from limitations in distinguishing targets from complex backgrounds owing to their uniform token division strategy. By treating all regions equally, these methods allocate inadequate attention to crucial target details while overemphasizing redundant background information. Consequently, their performance deteriorates significantly in scenarios involving similar distractors or background clutter. In this work, we propose a Local Implicit Neural Representation (LINR) module specifically designed for local fine-grained object modeling. It consists of two key modules: (1) Local Window Selection: Leveraging template-guided CNN-based cross-correlation, it accurately identify crucial target-relevant regions, reducing background information redundant and computation burden. (2) INR-based Window Refinement: Using implicit neural networks, it optimizes token density and spatial continuity to improve local fine-grained instance-level representations, facilitating the discriminative ability between the target and the background. Moreover, the LINR module exhibits three remarkable advantages as a generalized enhancement for visual tracking. Firstly, it is plug-and-play, seamlessly integrating into existing one-stream trackers, both non-real-time and real-time ones, without architectural modifications, achieving significant performance improvements. Secondly, it is highly portable since it does not introduce new loss functions, additional training strategies or data. Thirdly, it is efficiency-friendly, having minimal impact on model parameters and tracking speed, e.g., AQATrack-LINR increases only 1.9% of the parameters and reduces the tracking speed by only 6 fps. We incorporate the LINR module into two non-real-time trackers, OSTrack based on ViT-B and AQATrack based on HiViT-B, and one real-time tracker, FERMT based on ViT-tiny, respectively. The resultant OSTrack-LINR, AQATrack-LINR, and FERMT-LINR achieve state-of-the-art performance across seven widely utilized datasets, such as TrackingNet, LaSOT, and NFS30.

源语言	英语
期刊	IEEE Transactions on Circuits and Systems for Video Technology
DOI	https://doi.org/10.1109/TCSVT.2025.3578667
出版状态	已接受/待刊 - 2025

访问文件

10.1109/TCSVT.2025.3578667

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{6efe3f8cf6364150972b26757ac4cb3d,

title = "LINR: A Plug-and-Play Local Implicit Neural Representation Module for Visual Object Tracking",

abstract = "Current one-stream trackers suffer from limitations in distinguishing targets from complex backgrounds owing to their uniform token division strategy. By treating all regions equally, these methods allocate inadequate attention to crucial target details while overemphasizing redundant background information. Consequently, their performance deteriorates significantly in scenarios involving similar distractors or background clutter. In this work, we propose a Local Implicit Neural Representation (LINR) module specifically designed for local fine-grained object modeling. It consists of two key modules: (1) Local Window Selection: Leveraging template-guided CNN-based cross-correlation, it accurately identify crucial target-relevant regions, reducing background information redundant and computation burden. (2) INR-based Window Refinement: Using implicit neural networks, it optimizes token density and spatial continuity to improve local fine-grained instance-level representations, facilitating the discriminative ability between the target and the background. Moreover, the LINR module exhibits three remarkable advantages as a generalized enhancement for visual tracking. Firstly, it is plug-and-play, seamlessly integrating into existing one-stream trackers, both non-real-time and real-time ones, without architectural modifications, achieving significant performance improvements. Secondly, it is highly portable since it does not introduce new loss functions, additional training strategies or data. Thirdly, it is efficiency-friendly, having minimal impact on model parameters and tracking speed, e.g., AQATrack-LINR increases only 1.9% of the parameters and reduces the tracking speed by only 6 fps. We incorporate the LINR module into two non-real-time trackers, OSTrack based on ViT-B and AQATrack based on HiViT-B, and one real-time tracker, FERMT based on ViT-tiny, respectively. The resultant OSTrack-LINR, AQATrack-LINR, and FERMT-LINR achieve state-of-the-art performance across seven widely utilized datasets, such as TrackingNet, LaSOT, and NFS30.",

keywords = "Fine-grained object modeling, Implicit neural representation, Visual object tracking",

author = "Yao Chen and Guancheng Jia and Yufei Zha and Peng Zhang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2025",

doi = "10.1109/TCSVT.2025.3578667",

language = "英语",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - LINR

T2 - A Plug-and-Play Local Implicit Neural Representation Module for Visual Object Tracking

AU - Chen, Yao

AU - Jia, Guancheng

AU - Zha, Yufei

AU - Zhang, Peng

AU - Zhang, Yanning

PY - 2025

Y1 - 2025

N2 - Current one-stream trackers suffer from limitations in distinguishing targets from complex backgrounds owing to their uniform token division strategy. By treating all regions equally, these methods allocate inadequate attention to crucial target details while overemphasizing redundant background information. Consequently, their performance deteriorates significantly in scenarios involving similar distractors or background clutter. In this work, we propose a Local Implicit Neural Representation (LINR) module specifically designed for local fine-grained object modeling. It consists of two key modules: (1) Local Window Selection: Leveraging template-guided CNN-based cross-correlation, it accurately identify crucial target-relevant regions, reducing background information redundant and computation burden. (2) INR-based Window Refinement: Using implicit neural networks, it optimizes token density and spatial continuity to improve local fine-grained instance-level representations, facilitating the discriminative ability between the target and the background. Moreover, the LINR module exhibits three remarkable advantages as a generalized enhancement for visual tracking. Firstly, it is plug-and-play, seamlessly integrating into existing one-stream trackers, both non-real-time and real-time ones, without architectural modifications, achieving significant performance improvements. Secondly, it is highly portable since it does not introduce new loss functions, additional training strategies or data. Thirdly, it is efficiency-friendly, having minimal impact on model parameters and tracking speed, e.g., AQATrack-LINR increases only 1.9% of the parameters and reduces the tracking speed by only 6 fps. We incorporate the LINR module into two non-real-time trackers, OSTrack based on ViT-B and AQATrack based on HiViT-B, and one real-time tracker, FERMT based on ViT-tiny, respectively. The resultant OSTrack-LINR, AQATrack-LINR, and FERMT-LINR achieve state-of-the-art performance across seven widely utilized datasets, such as TrackingNet, LaSOT, and NFS30.

AB - Current one-stream trackers suffer from limitations in distinguishing targets from complex backgrounds owing to their uniform token division strategy. By treating all regions equally, these methods allocate inadequate attention to crucial target details while overemphasizing redundant background information. Consequently, their performance deteriorates significantly in scenarios involving similar distractors or background clutter. In this work, we propose a Local Implicit Neural Representation (LINR) module specifically designed for local fine-grained object modeling. It consists of two key modules: (1) Local Window Selection: Leveraging template-guided CNN-based cross-correlation, it accurately identify crucial target-relevant regions, reducing background information redundant and computation burden. (2) INR-based Window Refinement: Using implicit neural networks, it optimizes token density and spatial continuity to improve local fine-grained instance-level representations, facilitating the discriminative ability between the target and the background. Moreover, the LINR module exhibits three remarkable advantages as a generalized enhancement for visual tracking. Firstly, it is plug-and-play, seamlessly integrating into existing one-stream trackers, both non-real-time and real-time ones, without architectural modifications, achieving significant performance improvements. Secondly, it is highly portable since it does not introduce new loss functions, additional training strategies or data. Thirdly, it is efficiency-friendly, having minimal impact on model parameters and tracking speed, e.g., AQATrack-LINR increases only 1.9% of the parameters and reduces the tracking speed by only 6 fps. We incorporate the LINR module into two non-real-time trackers, OSTrack based on ViT-B and AQATrack based on HiViT-B, and one real-time tracker, FERMT based on ViT-tiny, respectively. The resultant OSTrack-LINR, AQATrack-LINR, and FERMT-LINR achieve state-of-the-art performance across seven widely utilized datasets, such as TrackingNet, LaSOT, and NFS30.

KW - Fine-grained object modeling

KW - Implicit neural representation

KW - Visual object tracking

UR - http://www.scopus.com/inward/record.url?scp=105007907179&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2025.3578667

DO - 10.1109/TCSVT.2025.3578667

M3 - 文章

AN - SCOPUS:105007907179

SN - 1051-8215

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

ER -