TY - JOUR
T1 - Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis
AU - Zhang, Ke
AU - Li, Yuanqing
AU - Wang, Jingyu
AU - Wang, Zhen
AU - Li, Xuelong
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2021
Y1 - 2021
N2 - Fusion of multimodal features is a momentous problem for video emotion recognition. As the development of deep learning, directly fusing feature matrixes of each mode through neural networks at feature level becomes mainstream method. However, unlike unimodal issues, for multimodal analysis, finding the correlations between different modal is as important as discovering effective unimodal features. To make up the deficiency in unearthing the intrinsic relationships between multimodal, a novel modularized multimodal emotion recognition model based on deep canonical correlation analysis (MERDCCA) is proposed in this letter. In MERDCCA, four utterances are gathered as a new group and each utterance contains text, audio and visual information as multimodal input. Gated recurrent unit layers are used to extract the unimodal features. Deep canonical correlation analysis based on encoder-decoder network is designed to extract cross-modal correlations by maximizing the relevance between multimodal. The experiments on two public datasets show that MERDCCA achieves the better results.
AB - Fusion of multimodal features is a momentous problem for video emotion recognition. As the development of deep learning, directly fusing feature matrixes of each mode through neural networks at feature level becomes mainstream method. However, unlike unimodal issues, for multimodal analysis, finding the correlations between different modal is as important as discovering effective unimodal features. To make up the deficiency in unearthing the intrinsic relationships between multimodal, a novel modularized multimodal emotion recognition model based on deep canonical correlation analysis (MERDCCA) is proposed in this letter. In MERDCCA, four utterances are gathered as a new group and each utterance contains text, audio and visual information as multimodal input. Gated recurrent unit layers are used to extract the unimodal features. Deep canonical correlation analysis based on encoder-decoder network is designed to extract cross-modal correlations by maximizing the relevance between multimodal. The experiments on two public datasets show that MERDCCA achieves the better results.
KW - Deep canonical correlation analysis
KW - gated recurrent unit
KW - multimodal emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=85116324714&partnerID=8YFLogxK
U2 - 10.1109/LSP.2021.3112314
DO - 10.1109/LSP.2021.3112314
M3 - 文章
AN - SCOPUS:85116324714
SN - 1070-9908
VL - 28
SP - 1898
EP - 1902
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
ER -