TY - JOUR
T1 - Addressing Information Inequality for Text-Based Person Search via Pedestrian-Centric Visual Denoising and Bias-Aware Alignments
AU - Gao, Liying
AU - Niu, Kai
AU - Jiao, Bingliang
AU - Wang, Peng
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2023/12/1
Y1 - 2023/12/1
N2 - Text-based person search is an important task in video surveillance, which aims to retrieve the corresponding pedestrian images with a given description. In this fine-grained retrieval task, accurate cross-modal information matching is an essential yet challenging problem. However, existing methods usually ignore the information inequality between modalities, which could introduce great difficulties to cross-modal matching. Specifically, in this task, the images inevitably contain some pedestrian-irrelevant noise like background and occlusion, and the descriptions could be biased to partial pedestrian content in images. With that in mind, in this paper, we propose a Text-Guided Denoising and Alignment (TGDA) model to alleviate the information inequality and realize effective cross-modal matching. In TGDA, we first design a prototype-based denoising module, which integrates pedestrian knowledge from textual features into a prototype vector and uses it as guidance to filter out pedestrian-irrelevant noise from visual features. Thereafter, a bias-aware alignment module is introduced, which guides our model to focus on the description-biased pedestrian content in cross-modal features consistently. Through extensive experiments, the effectiveness of both modules has been validated. Besides, our TGDA achieves state-of-the-art performance on various related benchmarks.
AB - Text-based person search is an important task in video surveillance, which aims to retrieve the corresponding pedestrian images with a given description. In this fine-grained retrieval task, accurate cross-modal information matching is an essential yet challenging problem. However, existing methods usually ignore the information inequality between modalities, which could introduce great difficulties to cross-modal matching. Specifically, in this task, the images inevitably contain some pedestrian-irrelevant noise like background and occlusion, and the descriptions could be biased to partial pedestrian content in images. With that in mind, in this paper, we propose a Text-Guided Denoising and Alignment (TGDA) model to alleviate the information inequality and realize effective cross-modal matching. In TGDA, we first design a prototype-based denoising module, which integrates pedestrian knowledge from textual features into a prototype vector and uses it as guidance to filter out pedestrian-irrelevant noise from visual features. Thereafter, a bias-aware alignment module is introduced, which guides our model to focus on the description-biased pedestrian content in cross-modal features consistently. Through extensive experiments, the effectiveness of both modules has been validated. Besides, our TGDA achieves state-of-the-art performance on various related benchmarks.
KW - Text-based person search
KW - information inequality
KW - text-guided denoising and alignment
UR - http://www.scopus.com/inward/record.url?scp=85159846411&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3273719
DO - 10.1109/TCSVT.2023.3273719
M3 - 文章
AN - SCOPUS:85159846411
SN - 1051-8215
VL - 33
SP - 7884
EP - 7899
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 12
ER -