Disentangled Representation Learning for Cross-Modal Biometric Matching

Hailong Ning; Xiangtao Zheng; Xiaoqiang Lu; Yuan Yuan

doi:10.1109/TMM.2021.3071243

Disentangled Representation Learning for Cross-Modal Biometric Matching

Hailong Ning, Xiangtao Zheng, Xiaoqiang Lu, Yuan Yuan

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

36 引用（Scopus）

摘要

Cross-modal biometric matching (CMBM) aims to determine the corresponding voice from a face, or identify the corresponding face from a voice. Recently, many CMBM methods have been proposed by forcing the distance between two modal features to be narrowed. However, these methods ignore the alignability between the two modal features. Because the feature is extracted under the supervision of identity information from single modal data, it can only reflect the identity information of single modal data. In order to address this problem, a disentangled representation learning method is proposed to disentangle the alignable latent identity factors and nonalignable the modality-dependent factors for CMBM. The proposed method consists of two main steps: 1) feature extraction and 2) disentangled representation learning. Firstly, an image feature extraction network is adopted to obtain face features, and a voice feature extraction network is applied to learn voice features. Secondly, a disentangled latent variable is explored to disentangle the latent identity factors that are shared across the modalities from the modality-dependent factors. The modality-dependent factors are filtered out, while the latent identity factors from the two modalities are enforced to be narrowed to align the same identity information. Then, the disentangled latent identity factors are considered as pure identity information to bridge the two modalities for cross-modal verification, 1:N matching, and retrieval. Note that the proposed method learns the identity information from the input face images and voice segments with only identity label as supervised information. Extensive experiments on the challenging VoxCeleb dataset demonstrate the proposed method outperforms the state-of-the-art methods.

源语言	英语
页（从-至）	1763-1774
页数	12
期刊	IEEE Transactions on Multimedia
卷	24
DOI	https://doi.org/10.1109/TMM.2021.3071243
出版状态	已出版 - 2022

访问文件

10.1109/TMM.2021.3071243

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{3353bc535ff04ecbacf63df31dc036b5,

title = "Disentangled Representation Learning for Cross-Modal Biometric Matching",

abstract = "Cross-modal biometric matching (CMBM) aims to determine the corresponding voice from a face, or identify the corresponding face from a voice. Recently, many CMBM methods have been proposed by forcing the distance between two modal features to be narrowed. However, these methods ignore the alignability between the two modal features. Because the feature is extracted under the supervision of identity information from single modal data, it can only reflect the identity information of single modal data. In order to address this problem, a disentangled representation learning method is proposed to disentangle the alignable latent identity factors and nonalignable the modality-dependent factors for CMBM. The proposed method consists of two main steps: 1) feature extraction and 2) disentangled representation learning. Firstly, an image feature extraction network is adopted to obtain face features, and a voice feature extraction network is applied to learn voice features. Secondly, a disentangled latent variable is explored to disentangle the latent identity factors that are shared across the modalities from the modality-dependent factors. The modality-dependent factors are filtered out, while the latent identity factors from the two modalities are enforced to be narrowed to align the same identity information. Then, the disentangled latent identity factors are considered as pure identity information to bridge the two modalities for cross-modal verification, 1:N matching, and retrieval. Note that the proposed method learns the identity information from the input face images and voice segments with only identity label as supervised information. Extensive experiments on the challenging VoxCeleb dataset demonstrate the proposed method outperforms the state-of-the-art methods.",

keywords = "Cross-modal biometric matching, disentangled representation learning, latent identity factors, modality-dependent factors",

author = "Hailong Ning and Xiangtao Zheng and Xiaoqiang Lu and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2022",

doi = "10.1109/TMM.2021.3071243",

language = "英语",

volume = "24",

pages = "1763--1774",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Disentangled Representation Learning for Cross-Modal Biometric Matching

AU - Ning, Hailong

AU - Zheng, Xiangtao

AU - Lu, Xiaoqiang

AU - Yuan, Yuan

PY - 2022

Y1 - 2022

N2 - Cross-modal biometric matching (CMBM) aims to determine the corresponding voice from a face, or identify the corresponding face from a voice. Recently, many CMBM methods have been proposed by forcing the distance between two modal features to be narrowed. However, these methods ignore the alignability between the two modal features. Because the feature is extracted under the supervision of identity information from single modal data, it can only reflect the identity information of single modal data. In order to address this problem, a disentangled representation learning method is proposed to disentangle the alignable latent identity factors and nonalignable the modality-dependent factors for CMBM. The proposed method consists of two main steps: 1) feature extraction and 2) disentangled representation learning. Firstly, an image feature extraction network is adopted to obtain face features, and a voice feature extraction network is applied to learn voice features. Secondly, a disentangled latent variable is explored to disentangle the latent identity factors that are shared across the modalities from the modality-dependent factors. The modality-dependent factors are filtered out, while the latent identity factors from the two modalities are enforced to be narrowed to align the same identity information. Then, the disentangled latent identity factors are considered as pure identity information to bridge the two modalities for cross-modal verification, 1:N matching, and retrieval. Note that the proposed method learns the identity information from the input face images and voice segments with only identity label as supervised information. Extensive experiments on the challenging VoxCeleb dataset demonstrate the proposed method outperforms the state-of-the-art methods.

AB - Cross-modal biometric matching (CMBM) aims to determine the corresponding voice from a face, or identify the corresponding face from a voice. Recently, many CMBM methods have been proposed by forcing the distance between two modal features to be narrowed. However, these methods ignore the alignability between the two modal features. Because the feature is extracted under the supervision of identity information from single modal data, it can only reflect the identity information of single modal data. In order to address this problem, a disentangled representation learning method is proposed to disentangle the alignable latent identity factors and nonalignable the modality-dependent factors for CMBM. The proposed method consists of two main steps: 1) feature extraction and 2) disentangled representation learning. Firstly, an image feature extraction network is adopted to obtain face features, and a voice feature extraction network is applied to learn voice features. Secondly, a disentangled latent variable is explored to disentangle the latent identity factors that are shared across the modalities from the modality-dependent factors. The modality-dependent factors are filtered out, while the latent identity factors from the two modalities are enforced to be narrowed to align the same identity information. Then, the disentangled latent identity factors are considered as pure identity information to bridge the two modalities for cross-modal verification, 1:N matching, and retrieval. Note that the proposed method learns the identity information from the input face images and voice segments with only identity label as supervised information. Extensive experiments on the challenging VoxCeleb dataset demonstrate the proposed method outperforms the state-of-the-art methods.

KW - Cross-modal biometric matching

KW - disentangled representation learning

KW - latent identity factors

KW - modality-dependent factors

UR - http://www.scopus.com/inward/record.url?scp=85104244361&partnerID=8YFLogxK

U2 - 10.1109/TMM.2021.3071243

DO - 10.1109/TMM.2021.3071243

M3 - 文章

AN - SCOPUS:85104244361

SN - 1520-9210

VL - 24

SP - 1763

EP - 1774

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Disentangled Representation Learning for Cross-Modal Biometric Matching

摘要

访问文件

其它文件与链接

指纹

引用此