Unidirectional Cross-Modal Fusion for RGB-T Tracking

Xiao Guo; Hangfei Li; Yufei Zha; Peng Zhang

doi:10.3233/FAIA240525

Unidirectional Cross-Modal Fusion for RGB-T Tracking

Xiao Guo, Hangfei Li, Yufei Zha, Peng Zhang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

The key issue of RGB-T tracking is to obtain an effective multimodal representation of targets by utilizing complementary RGB and TIR modality information. Previous methods of template fusion or bidirectional search-template interaction potentially diminish the target representation, resulting from noise information of both templates and search regions. Meanwhile, the direct fusion of sole search features without interacting with templates cannot fully utilize target-relevant contextual information. To mitigate these issues, we present UCTrack, which fuses complementary multimodal search features conditioned on undisturbed RGB and TIR template features. Specifically, we design a Unidirectional Cross-modal Fusion (UCF) module to effectively minimize the influence of background noise on templates by pruning the unnecessary template-to-search cross-modal interaction and to mutually enhance RGB and TIR search features with target-relevant information through multimodal spatial fusion. Furthermore, this module is seamlessly integrated into different layers of a ViT backbone to facilitate feature extraction and cross-modal fusion for RGB-T tracking. Benefiting from the UCF module, UCTrack can effectively and accurately represent multimodal target features without unnecessary template-to-search interaction flow and direct template fusion, making the first proposal of unidirectional cross-modal fusion paradigm for RGB-T tracking to our best knowledge. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves state-of-the-art performance.

源语言	英语
主期刊名	ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings
编辑	Ulle Endriss, Francisco S. Melo, Kerstin Bach, Alberto Bugarin-Diz, Jose M. Alonso-Moral, Senen Barro, Fredrik Heintz
出版商	IOS Press BV
页	490-497
页数	8
ISBN（电子版）	9781643685489
DOI	https://doi.org/10.3233/FAIA240525
出版状态	已出版 - 16 10月 2024
活动	27th European Conference on Artificial Intelligence, ECAI 2024 - Santiago de Compostela, 西班牙期限: 19 10月 2024 → 24 10月 2024

出版系列

姓名	Frontiers in Artificial Intelligence and Applications
卷	392
ISSN（印刷版）	0922-6389
ISSN（电子版）	1879-8314

会议

会议	27th European Conference on Artificial Intelligence, ECAI 2024
国家/地区	西班牙
市	Santiago de Compostela
时期	19/10/24 → 24/10/24

访问文件

10.3233/FAIA240525

其它文件与链接

链接到 Scopus 的出版物

引用此

Guo, X., Li, H., Zha, Y., & Zhang, P. (2024). Unidirectional Cross-Modal Fusion for RGB-T Tracking. 在 U. Endriss, F. S. Melo, K. Bach, A. Bugarin-Diz, J. M. Alonso-Moral, S. Barro, & F. Heintz (编辑), ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings (页码 490-497). (Frontiers in Artificial Intelligence and Applications; 卷 392). IOS Press BV. https://doi.org/10.3233/FAIA240525

Guo, Xiao ; Li, Hangfei ; Zha, Yufei 等. / Unidirectional Cross-Modal Fusion for RGB-T Tracking. ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings. 编辑 / Ulle Endriss ; Francisco S. Melo ; Kerstin Bach ; Alberto Bugarin-Diz ; Jose M. Alonso-Moral ; Senen Barro ; Fredrik Heintz. IOS Press BV, 2024. 页码 490-497 (Frontiers in Artificial Intelligence and Applications).

@inproceedings{846b5a91c7d04b569d098ed20e12f94e,

title = "Unidirectional Cross-Modal Fusion for RGB-T Tracking",

abstract = "The key issue of RGB-T tracking is to obtain an effective multimodal representation of targets by utilizing complementary RGB and TIR modality information. Previous methods of template fusion or bidirectional search-template interaction potentially diminish the target representation, resulting from noise information of both templates and search regions. Meanwhile, the direct fusion of sole search features without interacting with templates cannot fully utilize target-relevant contextual information. To mitigate these issues, we present UCTrack, which fuses complementary multimodal search features conditioned on undisturbed RGB and TIR template features. Specifically, we design a Unidirectional Cross-modal Fusion (UCF) module to effectively minimize the influence of background noise on templates by pruning the unnecessary template-to-search cross-modal interaction and to mutually enhance RGB and TIR search features with target-relevant information through multimodal spatial fusion. Furthermore, this module is seamlessly integrated into different layers of a ViT backbone to facilitate feature extraction and cross-modal fusion for RGB-T tracking. Benefiting from the UCF module, UCTrack can effectively and accurately represent multimodal target features without unnecessary template-to-search interaction flow and direct template fusion, making the first proposal of unidirectional cross-modal fusion paradigm for RGB-T tracking to our best knowledge. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves state-of-the-art performance.",

author = "Xiao Guo and Hangfei Li and Yufei Zha and Peng Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 The Authors.; 27th European Conference on Artificial Intelligence, ECAI 2024 ; Conference date: 19-10-2024 Through 24-10-2024",

year = "2024",

month = oct,

day = "16",

doi = "10.3233/FAIA240525",

language = "英语",

series = "Frontiers in Artificial Intelligence and Applications",

publisher = "IOS Press BV",

pages = "490--497",

editor = "Ulle Endriss and Melo, {Francisco S.} and Kerstin Bach and Alberto Bugarin-Diz and Alonso-Moral, {Jose M.} and Senen Barro and Fredrik Heintz",

booktitle = "ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings",

}

Guo, X, Li, H, Zha, Y & Zhang, P 2024, Unidirectional Cross-Modal Fusion for RGB-T Tracking. 在 U Endriss, FS Melo, K Bach, A Bugarin-Diz, JM Alonso-Moral, S Barro & F Heintz (编辑), ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings. Frontiers in Artificial Intelligence and Applications, 卷 392, IOS Press BV, 页码 490-497, 27th European Conference on Artificial Intelligence, ECAI 2024, Santiago de Compostela, 西班牙, 19/10/24. https://doi.org/10.3233/FAIA240525

Unidirectional Cross-Modal Fusion for RGB-T Tracking. / Guo, Xiao; Li, Hangfei; Zha, Yufei 等.
ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings. 编辑 / Ulle Endriss; Francisco S. Melo; Kerstin Bach; Alberto Bugarin-Diz; Jose M. Alonso-Moral; Senen Barro; Fredrik Heintz. IOS Press BV, 2024. 页码 490-497 (Frontiers in Artificial Intelligence and Applications; 卷 392).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Unidirectional Cross-Modal Fusion for RGB-T Tracking

AU - Guo, Xiao

AU - Li, Hangfei

AU - Zha, Yufei

AU - Zhang, Peng

PY - 2024/10/16

Y1 - 2024/10/16

N2 - The key issue of RGB-T tracking is to obtain an effective multimodal representation of targets by utilizing complementary RGB and TIR modality information. Previous methods of template fusion or bidirectional search-template interaction potentially diminish the target representation, resulting from noise information of both templates and search regions. Meanwhile, the direct fusion of sole search features without interacting with templates cannot fully utilize target-relevant contextual information. To mitigate these issues, we present UCTrack, which fuses complementary multimodal search features conditioned on undisturbed RGB and TIR template features. Specifically, we design a Unidirectional Cross-modal Fusion (UCF) module to effectively minimize the influence of background noise on templates by pruning the unnecessary template-to-search cross-modal interaction and to mutually enhance RGB and TIR search features with target-relevant information through multimodal spatial fusion. Furthermore, this module is seamlessly integrated into different layers of a ViT backbone to facilitate feature extraction and cross-modal fusion for RGB-T tracking. Benefiting from the UCF module, UCTrack can effectively and accurately represent multimodal target features without unnecessary template-to-search interaction flow and direct template fusion, making the first proposal of unidirectional cross-modal fusion paradigm for RGB-T tracking to our best knowledge. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves state-of-the-art performance.

AB - The key issue of RGB-T tracking is to obtain an effective multimodal representation of targets by utilizing complementary RGB and TIR modality information. Previous methods of template fusion or bidirectional search-template interaction potentially diminish the target representation, resulting from noise information of both templates and search regions. Meanwhile, the direct fusion of sole search features without interacting with templates cannot fully utilize target-relevant contextual information. To mitigate these issues, we present UCTrack, which fuses complementary multimodal search features conditioned on undisturbed RGB and TIR template features. Specifically, we design a Unidirectional Cross-modal Fusion (UCF) module to effectively minimize the influence of background noise on templates by pruning the unnecessary template-to-search cross-modal interaction and to mutually enhance RGB and TIR search features with target-relevant information through multimodal spatial fusion. Furthermore, this module is seamlessly integrated into different layers of a ViT backbone to facilitate feature extraction and cross-modal fusion for RGB-T tracking. Benefiting from the UCF module, UCTrack can effectively and accurately represent multimodal target features without unnecessary template-to-search interaction flow and direct template fusion, making the first proposal of unidirectional cross-modal fusion paradigm for RGB-T tracking to our best knowledge. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves state-of-the-art performance.

UR - http://www.scopus.com/inward/record.url?scp=85213369966&partnerID=8YFLogxK

U2 - 10.3233/FAIA240525

DO - 10.3233/FAIA240525

M3 - 会议稿件

AN - SCOPUS:85213369966

T3 - Frontiers in Artificial Intelligence and Applications

SP - 490

EP - 497

BT - ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings

A2 - Endriss, Ulle

A2 - Melo, Francisco S.

A2 - Bach, Kerstin

A2 - Bugarin-Diz, Alberto

A2 - Alonso-Moral, Jose M.

A2 - Barro, Senen

A2 - Heintz, Fredrik

PB - IOS Press BV

T2 - 27th European Conference on Artificial Intelligence, ECAI 2024

Y2 - 19 October 2024 through 24 October 2024

ER -

Guo X, Li H, Zha Y, Zhang P. Unidirectional Cross-Modal Fusion for RGB-T Tracking. 在 Endriss U, Melo FS, Bach K, Bugarin-Diz A, Alonso-Moral JM, Barro S, Heintz F, 编辑, ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings. IOS Press BV. 2024. 页码 490-497. (Frontiers in Artificial Intelligence and Applications). doi: 10.3233/FAIA240525

Unidirectional Cross-Modal Fusion for RGB-T Tracking

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此