Improving Inconspicuous Attributes Modeling for Person Search by Language

Kai Niu; Tao Huang; Linjiang Huang; Liang Wang; Yanning Zhang

doi:10.1109/TIP.2023.3285426

Improving Inconspicuous Attributes Modeling for Person Search by Language

Kai Niu, Tao Huang, Linjiang Huang, Liang Wang, Yanning Zhang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

12 引用（Scopus）

摘要

Person search by language aims to retrieve the interested pedestrian images based on natural language sentences. Although great efforts have been made to address the cross-modal heterogeneity, most of the current solutions suffer from only capturing salient attributes while ignoring inconspicuous ones, being weak in distinguishing very similar pedestrians. In this work, we propose the Adaptive Salient Attribute Mask Network (ASAMN) to adaptively mask the salient attributes for cross-modal alignments, and therefore induce the model to simultaneously focus on inconspicuous attributes. Specifically, we consider the uni-modal and cross-modal relations for masking salient attributes in the Uni-modal Salient Attribute Mask (USAM) and Cross-modal Salient Attribute Mask (CSAM) modules, respectively. Then the Attribute Modeling Balance (AMB) module is presented to randomly select a proportion of masked features for cross-modal alignments, ensuring the balance of modeling capacity of both salient attributes and inconspicuous ones. Extensive experiments and analyses have been carried out to validate the effectiveness and generalization capacity of our proposed ASAMN method, and we have obtained the state-of-the-art retrieval performance on the widely-used CUHK-PEDES and ICFG-PEDES benchmarks.

源语言	英语
页（从-至）	3429-3441
页数	13
期刊	IEEE Transactions on Image Processing
卷	32
DOI	https://doi.org/10.1109/TIP.2023.3285426
出版状态	已出版 - 2023

访问文件

10.1109/TIP.2023.3285426

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{21ca01f19f514e4cae9fd8c88e7376ec,

title = "Improving Inconspicuous Attributes Modeling for Person Search by Language",

abstract = "Person search by language aims to retrieve the interested pedestrian images based on natural language sentences. Although great efforts have been made to address the cross-modal heterogeneity, most of the current solutions suffer from only capturing salient attributes while ignoring inconspicuous ones, being weak in distinguishing very similar pedestrians. In this work, we propose the Adaptive Salient Attribute Mask Network (ASAMN) to adaptively mask the salient attributes for cross-modal alignments, and therefore induce the model to simultaneously focus on inconspicuous attributes. Specifically, we consider the uni-modal and cross-modal relations for masking salient attributes in the Uni-modal Salient Attribute Mask (USAM) and Cross-modal Salient Attribute Mask (CSAM) modules, respectively. Then the Attribute Modeling Balance (AMB) module is presented to randomly select a proportion of masked features for cross-modal alignments, ensuring the balance of modeling capacity of both salient attributes and inconspicuous ones. Extensive experiments and analyses have been carried out to validate the effectiveness and generalization capacity of our proposed ASAMN method, and we have obtained the state-of-the-art retrieval performance on the widely-used CUHK-PEDES and ICFG-PEDES benchmarks.",

keywords = "cross-modal retrieval, Person search by language, smart video surveillance",

author = "Kai Niu and Tao Huang and Linjiang Huang and Liang Wang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2023",

doi = "10.1109/TIP.2023.3285426",

language = "英语",

volume = "32",

pages = "3429--3441",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Improving Inconspicuous Attributes Modeling for Person Search by Language

AU - Niu, Kai

AU - Huang, Tao

AU - Huang, Linjiang

AU - Wang, Liang

AU - Zhang, Yanning

PY - 2023

Y1 - 2023

N2 - Person search by language aims to retrieve the interested pedestrian images based on natural language sentences. Although great efforts have been made to address the cross-modal heterogeneity, most of the current solutions suffer from only capturing salient attributes while ignoring inconspicuous ones, being weak in distinguishing very similar pedestrians. In this work, we propose the Adaptive Salient Attribute Mask Network (ASAMN) to adaptively mask the salient attributes for cross-modal alignments, and therefore induce the model to simultaneously focus on inconspicuous attributes. Specifically, we consider the uni-modal and cross-modal relations for masking salient attributes in the Uni-modal Salient Attribute Mask (USAM) and Cross-modal Salient Attribute Mask (CSAM) modules, respectively. Then the Attribute Modeling Balance (AMB) module is presented to randomly select a proportion of masked features for cross-modal alignments, ensuring the balance of modeling capacity of both salient attributes and inconspicuous ones. Extensive experiments and analyses have been carried out to validate the effectiveness and generalization capacity of our proposed ASAMN method, and we have obtained the state-of-the-art retrieval performance on the widely-used CUHK-PEDES and ICFG-PEDES benchmarks.

AB - Person search by language aims to retrieve the interested pedestrian images based on natural language sentences. Although great efforts have been made to address the cross-modal heterogeneity, most of the current solutions suffer from only capturing salient attributes while ignoring inconspicuous ones, being weak in distinguishing very similar pedestrians. In this work, we propose the Adaptive Salient Attribute Mask Network (ASAMN) to adaptively mask the salient attributes for cross-modal alignments, and therefore induce the model to simultaneously focus on inconspicuous attributes. Specifically, we consider the uni-modal and cross-modal relations for masking salient attributes in the Uni-modal Salient Attribute Mask (USAM) and Cross-modal Salient Attribute Mask (CSAM) modules, respectively. Then the Attribute Modeling Balance (AMB) module is presented to randomly select a proportion of masked features for cross-modal alignments, ensuring the balance of modeling capacity of both salient attributes and inconspicuous ones. Extensive experiments and analyses have been carried out to validate the effectiveness and generalization capacity of our proposed ASAMN method, and we have obtained the state-of-the-art retrieval performance on the widely-used CUHK-PEDES and ICFG-PEDES benchmarks.

KW - cross-modal retrieval

KW - Person search by language

KW - smart video surveillance

UR - http://www.scopus.com/inward/record.url?scp=85162635724&partnerID=8YFLogxK

U2 - 10.1109/TIP.2023.3285426

DO - 10.1109/TIP.2023.3285426

M3 - 文章

C2 - 37310815

AN - SCOPUS:85162635724

SN - 1057-7149

VL - 32

SP - 3429

EP - 3441

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Improving Inconspicuous Attributes Modeling for Person Search by Language

摘要

访问文件

其它文件与链接

指纹

引用此