TY - GEN
T1 - Cross-modal Co-occurrence Attributes Alignments for Person Search by Language
AU - Niu, Kai
AU - Huang, Linjiang
AU - Huang, Yan
AU - Wang, Peng
AU - Wang, Liang
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/10
Y1 - 2022/10/10
N2 - Person search by language refers to retrieving the interested pedestrian images based on a free-form natural language description, which has important applications in smart video surveillance. Although great efforts have been made to align images with sentences, the challenge of reporting bias, i.e., attributes are only partially matched across modalities, still incurs large noise and influences the accurate retrieval seriously. To address this challenge, we propose a novel cross-modal matching method named Cross-modal Co-occurrence Attributes Alignments (C2A2), which can better deal with noise and obtain significant improvements in retrieval performance for person search by language. First, we construct visual and textual attribute dictionaries relying on matrix decomposition, and carry out cross-modal alignments using denoising reconstruction features to address the noise from pedestrian-unrelated elements. Second, we re-gather pixels of image and words of sentence under the guidance of learned attribute dictionaries, to adaptively constitute more discriminative co-occurrence attributes in both modalities. And the re-gathered co-occurrence attributes are carefully captured by imposing explicit cross-modal one-to-one alignments which consider relations across modalities, better alleviating the noise from non-correspondence attributes. The whole C_2A_2 method can be trained end-to-end without any pre-processing, i.e., requiring negligible additional computation overheads. It significantly outperforms the existing solutions, and finally achieves the new state-of-the-art retrieval performance on two large-scale benchmarks, CUHK-PEDES and RSTPReid datasets.
AB - Person search by language refers to retrieving the interested pedestrian images based on a free-form natural language description, which has important applications in smart video surveillance. Although great efforts have been made to align images with sentences, the challenge of reporting bias, i.e., attributes are only partially matched across modalities, still incurs large noise and influences the accurate retrieval seriously. To address this challenge, we propose a novel cross-modal matching method named Cross-modal Co-occurrence Attributes Alignments (C2A2), which can better deal with noise and obtain significant improvements in retrieval performance for person search by language. First, we construct visual and textual attribute dictionaries relying on matrix decomposition, and carry out cross-modal alignments using denoising reconstruction features to address the noise from pedestrian-unrelated elements. Second, we re-gather pixels of image and words of sentence under the guidance of learned attribute dictionaries, to adaptively constitute more discriminative co-occurrence attributes in both modalities. And the re-gathered co-occurrence attributes are carefully captured by imposing explicit cross-modal one-to-one alignments which consider relations across modalities, better alleviating the noise from non-correspondence attributes. The whole C_2A_2 method can be trained end-to-end without any pre-processing, i.e., requiring negligible additional computation overheads. It significantly outperforms the existing solutions, and finally achieves the new state-of-the-art retrieval performance on two large-scale benchmarks, CUHK-PEDES and RSTPReid datasets.
KW - cross-modal retrieval
KW - matrix decomposition
KW - person search by language
UR - https://www.scopus.com/pages/publications/85151142033
U2 - 10.1145/3503161.3547753
DO - 10.1145/3503161.3547753
M3 - 会议稿件
AN - SCOPUS:85151142033
T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
SP - 4426
EP - 4434
BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 30th ACM International Conference on Multimedia, MM 2022
Y2 - 10 October 2022 through 14 October 2022
ER -