TY - JOUR
T1 - Improving Inconspicuous Attributes Modeling for Person Search by Language
AU - Niu, Kai
AU - Huang, Tao
AU - Huang, Linjiang
AU - Wang, Liang
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1992-2012 IEEE.
PY - 2023
Y1 - 2023
N2 - Person search by language aims to retrieve the interested pedestrian images based on natural language sentences. Although great efforts have been made to address the cross-modal heterogeneity, most of the current solutions suffer from only capturing salient attributes while ignoring inconspicuous ones, being weak in distinguishing very similar pedestrians. In this work, we propose the Adaptive Salient Attribute Mask Network (ASAMN) to adaptively mask the salient attributes for cross-modal alignments, and therefore induce the model to simultaneously focus on inconspicuous attributes. Specifically, we consider the uni-modal and cross-modal relations for masking salient attributes in the Uni-modal Salient Attribute Mask (USAM) and Cross-modal Salient Attribute Mask (CSAM) modules, respectively. Then the Attribute Modeling Balance (AMB) module is presented to randomly select a proportion of masked features for cross-modal alignments, ensuring the balance of modeling capacity of both salient attributes and inconspicuous ones. Extensive experiments and analyses have been carried out to validate the effectiveness and generalization capacity of our proposed ASAMN method, and we have obtained the state-of-the-art retrieval performance on the widely-used CUHK-PEDES and ICFG-PEDES benchmarks.
AB - Person search by language aims to retrieve the interested pedestrian images based on natural language sentences. Although great efforts have been made to address the cross-modal heterogeneity, most of the current solutions suffer from only capturing salient attributes while ignoring inconspicuous ones, being weak in distinguishing very similar pedestrians. In this work, we propose the Adaptive Salient Attribute Mask Network (ASAMN) to adaptively mask the salient attributes for cross-modal alignments, and therefore induce the model to simultaneously focus on inconspicuous attributes. Specifically, we consider the uni-modal and cross-modal relations for masking salient attributes in the Uni-modal Salient Attribute Mask (USAM) and Cross-modal Salient Attribute Mask (CSAM) modules, respectively. Then the Attribute Modeling Balance (AMB) module is presented to randomly select a proportion of masked features for cross-modal alignments, ensuring the balance of modeling capacity of both salient attributes and inconspicuous ones. Extensive experiments and analyses have been carried out to validate the effectiveness and generalization capacity of our proposed ASAMN method, and we have obtained the state-of-the-art retrieval performance on the widely-used CUHK-PEDES and ICFG-PEDES benchmarks.
KW - cross-modal retrieval
KW - Person search by language
KW - smart video surveillance
UR - http://www.scopus.com/inward/record.url?scp=85162635724&partnerID=8YFLogxK
U2 - 10.1109/TIP.2023.3285426
DO - 10.1109/TIP.2023.3285426
M3 - 文章
C2 - 37310815
AN - SCOPUS:85162635724
SN - 1057-7149
VL - 32
SP - 3429
EP - 3441
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -