Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

Zhimin Wei, Zhipeng Zhang, Peng Wu, Ji Wang, Peng Wang, Yanning Zhang

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Text-based Person Retrieval aims to search the target pedestrian image from video surveillance or a large image database with a text description. Previous works have recognized the significance of mining local information in images and descriptions and performing fine-grained alignment. These approaches adopt hard division or auxiliary networks for locating local visual regions. However, the two existing ways are not flexible enough for various images and may even bring noise. Meanwhile, the Vision-Language Pre-training models like CLIP exhibit strong generalization and zero-shot abilities, which provide an available way to this issue. In this paper, we propose a novel Fine-Granularity Alignment model with Semantics-Centric Visual Division (SCVD). Our method contains a Semantics Deconstructor (SD), a Cross-modal Guided Interaction (CGI) module, and a Dynamic Focus Alignment (DFA) module. The SD aims to extract fine-grained semantic prompts from the raw description which is easy-understand for CLIP. In CGI, we propose a Text-Guided Visual Localization (TVL) module to generate local visual representations according to the semantic prompts and a Vision-Guided Semantics Reconstruction (VSR) module to integrate the prompts into the textual representation. The DFA is used finally to align vision-text fine-grained information. The extensive experiments demonstrate that our proposed framework significantly outperforms current state-of-the-art methods in terms of Rank@1 metric on three benchmarks by an absolute gain of 6.56%, 8.93%, and 11.53%, respectively. Our code is available in https://github.com/tujun233/SCVD.git.

Original languageEnglish
Pages (from-to)8242-8252
Number of pages11
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number9
DOIs
StatePublished - 2024

Keywords

  • Text-to-image retrieval
  • vision-language pre-training

Fingerprint

Dive into the research topics of 'Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division'. Together they form a unique fingerprint.

Cite this