Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis

Ke Zhang; Yuanqing Li; Jingyu Wang; Zhen Wang; Xuelong Li

doi:10.1109/LSP.2021.3112314

Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis

Ke Zhang, Yuanqing Li, Jingyu Wang, Zhen Wang, Xuelong Li

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

32 引用（Scopus）

摘要

Fusion of multimodal features is a momentous problem for video emotion recognition. As the development of deep learning, directly fusing feature matrixes of each mode through neural networks at feature level becomes mainstream method. However, unlike unimodal issues, for multimodal analysis, finding the correlations between different modal is as important as discovering effective unimodal features. To make up the deficiency in unearthing the intrinsic relationships between multimodal, a novel modularized multimodal emotion recognition model based on deep canonical correlation analysis (MERDCCA) is proposed in this letter. In MERDCCA, four utterances are gathered as a new group and each utterance contains text, audio and visual information as multimodal input. Gated recurrent unit layers are used to extract the unimodal features. Deep canonical correlation analysis based on encoder-decoder network is designed to extract cross-modal correlations by maximizing the relevance between multimodal. The experiments on two public datasets show that MERDCCA achieves the better results.

源语言	英语
页（从-至）	1898-1902
页数	5
期刊	IEEE Signal Processing Letters
卷	28
DOI	https://doi.org/10.1109/LSP.2021.3112314
出版状态	已出版 - 2021

访问文件

10.1109/LSP.2021.3112314

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{0d11b451dd3c43a4bf32b220a21ae105,

title = "Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis",

abstract = "Fusion of multimodal features is a momentous problem for video emotion recognition. As the development of deep learning, directly fusing feature matrixes of each mode through neural networks at feature level becomes mainstream method. However, unlike unimodal issues, for multimodal analysis, finding the correlations between different modal is as important as discovering effective unimodal features. To make up the deficiency in unearthing the intrinsic relationships between multimodal, a novel modularized multimodal emotion recognition model based on deep canonical correlation analysis (MERDCCA) is proposed in this letter. In MERDCCA, four utterances are gathered as a new group and each utterance contains text, audio and visual information as multimodal input. Gated recurrent unit layers are used to extract the unimodal features. Deep canonical correlation analysis based on encoder-decoder network is designed to extract cross-modal correlations by maximizing the relevance between multimodal. The experiments on two public datasets show that MERDCCA achieves the better results.",

keywords = "Deep canonical correlation analysis, gated recurrent unit, multimodal emotion recognition",

author = "Ke Zhang and Yuanqing Li and Jingyu Wang and Zhen Wang and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2021",

doi = "10.1109/LSP.2021.3112314",

language = "英语",

volume = "28",

pages = "1898--1902",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis

AU - Zhang, Ke

AU - Li, Yuanqing

AU - Wang, Jingyu

AU - Wang, Zhen

AU - Li, Xuelong

PY - 2021

Y1 - 2021

N2 - Fusion of multimodal features is a momentous problem for video emotion recognition. As the development of deep learning, directly fusing feature matrixes of each mode through neural networks at feature level becomes mainstream method. However, unlike unimodal issues, for multimodal analysis, finding the correlations between different modal is as important as discovering effective unimodal features. To make up the deficiency in unearthing the intrinsic relationships between multimodal, a novel modularized multimodal emotion recognition model based on deep canonical correlation analysis (MERDCCA) is proposed in this letter. In MERDCCA, four utterances are gathered as a new group and each utterance contains text, audio and visual information as multimodal input. Gated recurrent unit layers are used to extract the unimodal features. Deep canonical correlation analysis based on encoder-decoder network is designed to extract cross-modal correlations by maximizing the relevance between multimodal. The experiments on two public datasets show that MERDCCA achieves the better results.

AB - Fusion of multimodal features is a momentous problem for video emotion recognition. As the development of deep learning, directly fusing feature matrixes of each mode through neural networks at feature level becomes mainstream method. However, unlike unimodal issues, for multimodal analysis, finding the correlations between different modal is as important as discovering effective unimodal features. To make up the deficiency in unearthing the intrinsic relationships between multimodal, a novel modularized multimodal emotion recognition model based on deep canonical correlation analysis (MERDCCA) is proposed in this letter. In MERDCCA, four utterances are gathered as a new group and each utterance contains text, audio and visual information as multimodal input. Gated recurrent unit layers are used to extract the unimodal features. Deep canonical correlation analysis based on encoder-decoder network is designed to extract cross-modal correlations by maximizing the relevance between multimodal. The experiments on two public datasets show that MERDCCA achieves the better results.

KW - Deep canonical correlation analysis

KW - gated recurrent unit

KW - multimodal emotion recognition

UR - http://www.scopus.com/inward/record.url?scp=85116324714&partnerID=8YFLogxK

U2 - 10.1109/LSP.2021.3112314

DO - 10.1109/LSP.2021.3112314

M3 - 文章

AN - SCOPUS:85116324714

SN - 1070-9908

VL - 28

SP - 1898

EP - 1902

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

ER -

Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis

摘要

访问文件

其它文件与链接

指纹

引用此