Cross-modal Co-occurrence Attributes Alignments for Person Search by Language

  • Kai Niu
  • , Linjiang Huang
  • , Yan Huang
  • , Peng Wang
  • , Liang Wang
  • , Yanning Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

30 Scopus citations

Abstract

Person search by language refers to retrieving the interested pedestrian images based on a free-form natural language description, which has important applications in smart video surveillance. Although great efforts have been made to align images with sentences, the challenge of reporting bias, i.e., attributes are only partially matched across modalities, still incurs large noise and influences the accurate retrieval seriously. To address this challenge, we propose a novel cross-modal matching method named Cross-modal Co-occurrence Attributes Alignments (C2A2), which can better deal with noise and obtain significant improvements in retrieval performance for person search by language. First, we construct visual and textual attribute dictionaries relying on matrix decomposition, and carry out cross-modal alignments using denoising reconstruction features to address the noise from pedestrian-unrelated elements. Second, we re-gather pixels of image and words of sentence under the guidance of learned attribute dictionaries, to adaptively constitute more discriminative co-occurrence attributes in both modalities. And the re-gathered co-occurrence attributes are carefully captured by imposing explicit cross-modal one-to-one alignments which consider relations across modalities, better alleviating the noise from non-correspondence attributes. The whole C_2A_2 method can be trained end-to-end without any pre-processing, i.e., requiring negligible additional computation overheads. It significantly outperforms the existing solutions, and finally achieves the new state-of-the-art retrieval performance on two large-scale benchmarks, CUHK-PEDES and RSTPReid datasets.

Original languageEnglish
Title of host publicationMM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages4426-4434
Number of pages9
ISBN (Electronic)9781450392037
DOIs
StatePublished - 10 Oct 2022
Event30th ACM International Conference on Multimedia, MM 2022 - Lisboa, Portugal
Duration: 10 Oct 202214 Oct 2022

Publication series

NameMM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

Conference

Conference30th ACM International Conference on Multimedia, MM 2022
Country/TerritoryPortugal
CityLisboa
Period10/10/2214/10/22

Keywords

  • cross-modal retrieval
  • matrix decomposition
  • person search by language

Fingerprint

Dive into the research topics of 'Cross-modal Co-occurrence Attributes Alignments for Person Search by Language'. Together they form a unique fingerprint.

Cite this