Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

Zhimin Wei; Zhipeng Zhang; Peng Wu; Ji Wang; Peng Wang; Yanning Zhang

doi:10.1109/TCSVT.2024.3392831

Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

Zhimin Wei, Zhipeng Zhang, Peng Wu, Ji Wang, Peng Wang, Yanning Zhang

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

Text-based Person Retrieval aims to search the target pedestrian image from video surveillance or a large image database with a text description. Previous works have recognized the significance of mining local information in images and descriptions and performing fine-grained alignment. These approaches adopt hard division or auxiliary networks for locating local visual regions. However, the two existing ways are not flexible enough for various images and may even bring noise. Meanwhile, the Vision-Language Pre-training models like CLIP exhibit strong generalization and zero-shot abilities, which provide an available way to this issue. In this paper, we propose a novel Fine-Granularity Alignment model with Semantics-Centric Visual Division (SCVD). Our method contains a Semantics Deconstructor (SD), a Cross-modal Guided Interaction (CGI) module, and a Dynamic Focus Alignment (DFA) module. The SD aims to extract fine-grained semantic prompts from the raw description which is easy-understand for CLIP. In CGI, we propose a Text-Guided Visual Localization (TVL) module to generate local visual representations according to the semantic prompts and a Vision-Guided Semantics Reconstruction (VSR) module to integrate the prompts into the textual representation. The DFA is used finally to align vision-text fine-grained information. The extensive experiments demonstrate that our proposed framework significantly outperforms current state-of-the-art methods in terms of Rank@1 metric on three benchmarks by an absolute gain of 6.56%, 8.93%, and 11.53%, respectively. Our code is available in https://github.com/tujun233/SCVD.git.

Original language	English
Pages (from-to)	8242-8252
Number of pages	11
Journal	IEEE Transactions on Circuits and Systems for Video Technology
Volume	34
Issue number	9
DOIs	https://doi.org/10.1109/TCSVT.2024.3392831
State	Published - 2024

Keywords

Text-to-image retrieval
vision-language pre-training

Access to Document

10.1109/TCSVT.2024.3392831

Cite this

@article{8ca9a9b87aff4e6fa942d1bd144a0f44,

title = "Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division",

abstract = "Text-based Person Retrieval aims to search the target pedestrian image from video surveillance or a large image database with a text description. Previous works have recognized the significance of mining local information in images and descriptions and performing fine-grained alignment. These approaches adopt hard division or auxiliary networks for locating local visual regions. However, the two existing ways are not flexible enough for various images and may even bring noise. Meanwhile, the Vision-Language Pre-training models like CLIP exhibit strong generalization and zero-shot abilities, which provide an available way to this issue. In this paper, we propose a novel Fine-Granularity Alignment model with Semantics-Centric Visual Division (SCVD). Our method contains a Semantics Deconstructor (SD), a Cross-modal Guided Interaction (CGI) module, and a Dynamic Focus Alignment (DFA) module. The SD aims to extract fine-grained semantic prompts from the raw description which is easy-understand for CLIP. In CGI, we propose a Text-Guided Visual Localization (TVL) module to generate local visual representations according to the semantic prompts and a Vision-Guided Semantics Reconstruction (VSR) module to integrate the prompts into the textual representation. The DFA is used finally to align vision-text fine-grained information. The extensive experiments demonstrate that our proposed framework significantly outperforms current state-of-the-art methods in terms of Rank@1 metric on three benchmarks by an absolute gain of 6.56%, 8.93%, and 11.53%, respectively. Our code is available in https://github.com/tujun233/SCVD.git.",

keywords = "Text-to-image retrieval, vision-language pre-training",

author = "Zhimin Wei and Zhipeng Zhang and Peng Wu and Ji Wang and Peng Wang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2024",

doi = "10.1109/TCSVT.2024.3392831",

language = "英语",

volume = "34",

pages = "8242--8252",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "9",

}

TY - JOUR

T1 - Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

AU - Wei, Zhimin

AU - Zhang, Zhipeng

AU - Wu, Peng

AU - Wang, Ji

AU - Wang, Peng

AU - Zhang, Yanning

PY - 2024

Y1 - 2024

N2 - Text-based Person Retrieval aims to search the target pedestrian image from video surveillance or a large image database with a text description. Previous works have recognized the significance of mining local information in images and descriptions and performing fine-grained alignment. These approaches adopt hard division or auxiliary networks for locating local visual regions. However, the two existing ways are not flexible enough for various images and may even bring noise. Meanwhile, the Vision-Language Pre-training models like CLIP exhibit strong generalization and zero-shot abilities, which provide an available way to this issue. In this paper, we propose a novel Fine-Granularity Alignment model with Semantics-Centric Visual Division (SCVD). Our method contains a Semantics Deconstructor (SD), a Cross-modal Guided Interaction (CGI) module, and a Dynamic Focus Alignment (DFA) module. The SD aims to extract fine-grained semantic prompts from the raw description which is easy-understand for CLIP. In CGI, we propose a Text-Guided Visual Localization (TVL) module to generate local visual representations according to the semantic prompts and a Vision-Guided Semantics Reconstruction (VSR) module to integrate the prompts into the textual representation. The DFA is used finally to align vision-text fine-grained information. The extensive experiments demonstrate that our proposed framework significantly outperforms current state-of-the-art methods in terms of Rank@1 metric on three benchmarks by an absolute gain of 6.56%, 8.93%, and 11.53%, respectively. Our code is available in https://github.com/tujun233/SCVD.git.

AB - Text-based Person Retrieval aims to search the target pedestrian image from video surveillance or a large image database with a text description. Previous works have recognized the significance of mining local information in images and descriptions and performing fine-grained alignment. These approaches adopt hard division or auxiliary networks for locating local visual regions. However, the two existing ways are not flexible enough for various images and may even bring noise. Meanwhile, the Vision-Language Pre-training models like CLIP exhibit strong generalization and zero-shot abilities, which provide an available way to this issue. In this paper, we propose a novel Fine-Granularity Alignment model with Semantics-Centric Visual Division (SCVD). Our method contains a Semantics Deconstructor (SD), a Cross-modal Guided Interaction (CGI) module, and a Dynamic Focus Alignment (DFA) module. The SD aims to extract fine-grained semantic prompts from the raw description which is easy-understand for CLIP. In CGI, we propose a Text-Guided Visual Localization (TVL) module to generate local visual representations according to the semantic prompts and a Vision-Guided Semantics Reconstruction (VSR) module to integrate the prompts into the textual representation. The DFA is used finally to align vision-text fine-grained information. The extensive experiments demonstrate that our proposed framework significantly outperforms current state-of-the-art methods in terms of Rank@1 metric on three benchmarks by an absolute gain of 6.56%, 8.93%, and 11.53%, respectively. Our code is available in https://github.com/tujun233/SCVD.git.

KW - Text-to-image retrieval

KW - vision-language pre-training

UR - http://www.scopus.com/inward/record.url?scp=85191341636&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2024.3392831

DO - 10.1109/TCSVT.2024.3392831

M3 - 文章

AN - SCOPUS:85191341636

SN - 1051-8215

VL - 34

SP - 8242

EP - 8252

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 9

ER -

Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this