Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach

Xiaoting Wu; Xueyi Zhang; Xiaoyi Feng; Miguel Bordallo Lopez; Li Liu

doi:10.1109/TCYB.2022.3220040

Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach

Xiaoting Wu, Xueyi Zhang, Xiaoyi Feng, Miguel Bordallo Lopez, Li Liu

School of Electronics and Information

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.

Original language	English
Pages (from-to)	1523-1536
Number of pages	14
Journal	IEEE Transactions on Cybernetics
Volume	54
Issue number	3
DOIs	https://doi.org/10.1109/TCYB.2022.3220040
State	Published - 1 Mar 2024

Keywords

Adversarial learning
audio-visual
benchmark dataset
kinship verification
multimodal fusion

Access to Document

10.1109/TCYB.2022.3220040

Cite this

@article{117c56c1d9424d3aa74f6d442e7d1d6f,

title = "Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach",

abstract = "Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.",

keywords = "Adversarial learning, audio-visual, benchmark dataset, kinship verification, multimodal fusion",

author = "Xiaoting Wu and Xueyi Zhang and Xiaoyi Feng and {Bordallo Lopez}, Miguel and Li Liu",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2024",

month = mar,

day = "1",

doi = "10.1109/TCYB.2022.3220040",

language = "英语",

volume = "54",

pages = "1523--1536",

journal = "IEEE Transactions on Cybernetics",

issn = "2168-2267",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Audio-Visual Kinship Verification

T2 - A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach

AU - Wu, Xiaoting

AU - Zhang, Xueyi

AU - Feng, Xiaoyi

AU - Bordallo Lopez, Miguel

AU - Liu, Li

PY - 2024/3/1

Y1 - 2024/3/1

N2 - Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.

AB - Facial kinship verification refers to automatically determining whether two people have a kin relation from their faces. It has become a popular research topic due to potential practical applications. Over the past decade, many efforts have been devoted to improving the verification performance from human faces only while lacking other biometric information, for example, speaking voice. In this article, to interpret and benefit from multiple modalities, we propose for the first time to combine human faces and voices to verify kinship, which we refer it as the audio-visual kinship verification study. We first establish a comprehensive audio-visual kinship dataset that consists of familial talking facial videos under various scenarios, called TALKIN-Family. Based on the dataset, we present the extensive evaluation of kinship verification from faces and voices. In particular, we propose a deep-learning-based fusion method, called unified adaptive adversarial multimodal learning (UAAML). It consists of the adversarial network and the attention module on the basis of unified multimodal features. Experiments show that audio (voice) information is complementary to facial features and useful for the kinship verification problem. Furthermore, the proposed fusion method outperforms baseline methods. In addition, we also evaluate the human verification ability on a subset of TALKIN-Family. It indicates that humans have higher accuracy when they have access to both faces and voices. The machine-learning methods could effectively and efficiently outperform the human ability. Finally, we include the future work and research opportunities with the TALKIN-Family dataset.

KW - Adversarial learning

KW - audio-visual

KW - benchmark dataset

KW - kinship verification

KW - multimodal fusion

UR - http://www.scopus.com/inward/record.url?scp=85144056155&partnerID=8YFLogxK

U2 - 10.1109/TCYB.2022.3220040

DO - 10.1109/TCYB.2022.3220040

M3 - 文章

C2 - 36417714

AN - SCOPUS:85144056155

SN - 2168-2267

VL - 54

SP - 1523

EP - 1536

JO - IEEE Transactions on Cybernetics

JF - IEEE Transactions on Cybernetics

IS - 3

ER -

Audio-Visual Kinship Verification: A New Dataset and a Unified Adaptive Adversarial Multimodal Learning Approach

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this