Abstract
Text-based person search is an important task in video surveillance, which aims to retrieve the corresponding pedestrian images with a given description. In this fine-grained retrieval task, accurate cross-modal information matching is an essential yet challenging problem. However, existing methods usually ignore the information inequality between modalities, which could introduce great difficulties to cross-modal matching. Specifically, in this task, the images inevitably contain some pedestrian-irrelevant noise like background and occlusion, and the descriptions could be biased to partial pedestrian content in images. With that in mind, in this paper, we propose a Text-Guided Denoising and Alignment (TGDA) model to alleviate the information inequality and realize effective cross-modal matching. In TGDA, we first design a prototype-based denoising module, which integrates pedestrian knowledge from textual features into a prototype vector and uses it as guidance to filter out pedestrian-irrelevant noise from visual features. Thereafter, a bias-aware alignment module is introduced, which guides our model to focus on the description-biased pedestrian content in cross-modal features consistently. Through extensive experiments, the effectiveness of both modules has been validated. Besides, our TGDA achieves state-of-the-art performance on various related benchmarks.
| Original language | English |
|---|---|
| Pages (from-to) | 7884-7899 |
| Number of pages | 16 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 33 |
| Issue number | 12 |
| DOIs | |
| State | Published - 1 Dec 2023 |
Keywords
- Text-based person search
- information inequality
- text-guided denoising and alignment
Fingerprint
Dive into the research topics of 'Addressing Information Inequality for Text-Based Person Search via Pedestrian-Centric Visual Denoising and Bias-Aware Alignments'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver