Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Zhixian Zhao; Haifeng Chen; Xi Li; Dongmei Jiang; Lei Xie

doi:10.1145/3689092.3689407

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Zhixian Zhao, Haifeng Chen, Xi Li, Dongmei Jiang, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Multimodal Emotion Recognition (MER) aims to automatically identify and understand human emotional states by integrating information from various modalities. However, the scarcity of annotated multimodal data significantly hinders the advancement of this research field. This paper presents our solution for the MER-SEMI sub-challenge of MER 2024. First, to better adapt acoustic modality features for the MER task, we experimentally evaluate the contributions of different layers of the pre-trained speech model HuBERT in emotion recognition. Based on these observations, we perform Parameter-Efficient Fine-Tuning (PEFT) on the layers identified as most effective for emotion recognition tasks, thereby achieving optimal adaptation for emotion recognition with a minimal number of learnable parameters. Second, leveraging the strengths of the acoustic modality, we propose a feature alignment pre-training method. This approach uses large-scale unlabeled data to train a visual encoder, thereby promoting the semantic alignment of visual features within the acoustic feature space. Finally, using the adapted acoustic features, aligned visual features, and lexical features, we employ an attention mechanism for feature fusion. On the MER2024-SEMI test set, the proposed method achieves a weighted F1 score of 88.90%, ranking fourth among all participating teams, validating the effectiveness of our approach.

Original language	English
Title of host publication	MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing
Publisher	Association for Computing Machinery, Inc
Pages	67-71
Number of pages	5
ISBN (Electronic)	9798400712036
DOIs	https://doi.org/10.1145/3689092.3689407
State	Published - 28 Oct 2024
Event	2nd International Workshop on Multimodal and Responsible Affective Computing, MRAC 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024

Publication series

Name	MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing

Conference

Conference	2nd International Workshop on Multimodal and Responsible Affective Computing, MRAC 2024
Country/Territory	Australia
City	Melbourne
Period	28/10/24 → 1/11/24

Keywords

contrastive learning
fine-tuning
multimodal emotion recognition

Access to Document

10.1145/3689092.3689407

Cite this

Zhao, Z., Chen, H., Li, X., Jiang, D., & Xie, L. (2024). Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment. In MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing (pp. 67-71). (MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing). Association for Computing Machinery, Inc. https://doi.org/10.1145/3689092.3689407

Zhao, Zhixian ; Chen, Haifeng ; Li, Xi et al. / Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment. MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing. Association for Computing Machinery, Inc, 2024. pp. 67-71 (MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing).

@inproceedings{62d339c9b7ea49ea95330213a1a62615,

title = "Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment",

abstract = "Multimodal Emotion Recognition (MER) aims to automatically identify and understand human emotional states by integrating information from various modalities. However, the scarcity of annotated multimodal data significantly hinders the advancement of this research field. This paper presents our solution for the MER-SEMI sub-challenge of MER 2024. First, to better adapt acoustic modality features for the MER task, we experimentally evaluate the contributions of different layers of the pre-trained speech model HuBERT in emotion recognition. Based on these observations, we perform Parameter-Efficient Fine-Tuning (PEFT) on the layers identified as most effective for emotion recognition tasks, thereby achieving optimal adaptation for emotion recognition with a minimal number of learnable parameters. Second, leveraging the strengths of the acoustic modality, we propose a feature alignment pre-training method. This approach uses large-scale unlabeled data to train a visual encoder, thereby promoting the semantic alignment of visual features within the acoustic feature space. Finally, using the adapted acoustic features, aligned visual features, and lexical features, we employ an attention mechanism for feature fusion. On the MER2024-SEMI test set, the proposed method achieves a weighted F1 score of 88.90%, ranking fourth among all participating teams, validating the effectiveness of our approach.",

keywords = "contrastive learning, fine-tuning, multimodal emotion recognition",

author = "Zhixian Zhao and Haifeng Chen and Xi Li and Dongmei Jiang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 2nd International Workshop on Multimodal and Responsible Affective Computing, MRAC 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3689092.3689407",

language = "英语",

series = "MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing",

publisher = "Association for Computing Machinery, Inc",

pages = "67--71",

booktitle = "MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing",

}

Zhao, Z, Chen, H, Li, X, Jiang, D & Xie, L 2024, Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment. in MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing. MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing, Association for Computing Machinery, Inc, pp. 67-71, 2nd International Workshop on Multimodal and Responsible Affective Computing, MRAC 2024, Melbourne, Australia, 28/10/24. https://doi.org/10.1145/3689092.3689407

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment. / Zhao, Zhixian; Chen, Haifeng; Li, Xi et al.
MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing. Association for Computing Machinery, Inc, 2024. p. 67-71 (MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

AU - Zhao, Zhixian

AU - Chen, Haifeng

AU - Li, Xi

AU - Jiang, Dongmei

AU - Xie, Lei

PY - 2024/10/28

Y1 - 2024/10/28

N2 - Multimodal Emotion Recognition (MER) aims to automatically identify and understand human emotional states by integrating information from various modalities. However, the scarcity of annotated multimodal data significantly hinders the advancement of this research field. This paper presents our solution for the MER-SEMI sub-challenge of MER 2024. First, to better adapt acoustic modality features for the MER task, we experimentally evaluate the contributions of different layers of the pre-trained speech model HuBERT in emotion recognition. Based on these observations, we perform Parameter-Efficient Fine-Tuning (PEFT) on the layers identified as most effective for emotion recognition tasks, thereby achieving optimal adaptation for emotion recognition with a minimal number of learnable parameters. Second, leveraging the strengths of the acoustic modality, we propose a feature alignment pre-training method. This approach uses large-scale unlabeled data to train a visual encoder, thereby promoting the semantic alignment of visual features within the acoustic feature space. Finally, using the adapted acoustic features, aligned visual features, and lexical features, we employ an attention mechanism for feature fusion. On the MER2024-SEMI test set, the proposed method achieves a weighted F1 score of 88.90%, ranking fourth among all participating teams, validating the effectiveness of our approach.

AB - Multimodal Emotion Recognition (MER) aims to automatically identify and understand human emotional states by integrating information from various modalities. However, the scarcity of annotated multimodal data significantly hinders the advancement of this research field. This paper presents our solution for the MER-SEMI sub-challenge of MER 2024. First, to better adapt acoustic modality features for the MER task, we experimentally evaluate the contributions of different layers of the pre-trained speech model HuBERT in emotion recognition. Based on these observations, we perform Parameter-Efficient Fine-Tuning (PEFT) on the layers identified as most effective for emotion recognition tasks, thereby achieving optimal adaptation for emotion recognition with a minimal number of learnable parameters. Second, leveraging the strengths of the acoustic modality, we propose a feature alignment pre-training method. This approach uses large-scale unlabeled data to train a visual encoder, thereby promoting the semantic alignment of visual features within the acoustic feature space. Finally, using the adapted acoustic features, aligned visual features, and lexical features, we employ an attention mechanism for feature fusion. On the MER2024-SEMI test set, the proposed method achieves a weighted F1 score of 88.90%, ranking fourth among all participating teams, validating the effectiveness of our approach.

KW - contrastive learning

KW - fine-tuning

KW - multimodal emotion recognition

UR - http://www.scopus.com/inward/record.url?scp=85210856347&partnerID=8YFLogxK

U2 - 10.1145/3689092.3689407

DO - 10.1145/3689092.3689407

M3 - 会议稿件

AN - SCOPUS:85210856347

T3 - MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing

SP - 67

EP - 71

BT - MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing

PB - Association for Computing Machinery, Inc

T2 - 2nd International Workshop on Multimodal and Responsible Affective Computing, MRAC 2024

Y2 - 28 October 2024 through 1 November 2024

ER -

Zhao Z, Chen H, Li X, Jiang D, Xie L. Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment. In MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing. Association for Computing Machinery, Inc. 2024. p. 67-71. (MRAC 2024 - Proceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing). doi: 10.1145/3689092.3689407

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this