TY - JOUR
T1 - MOCOLNet
T2 - A Momentum Contrastive Learning Network for Multimodal Aspect-Level Sentiment Analysis
AU - Mu, Jie
AU - Nie, Feiping
AU - Wang, Wei
AU - Xu, Jian
AU - Zhang, Jing
AU - Liu, Han
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Multimodal aspect-level sentiment analysis has attracted increasing attention in recent years. However, existing methods have two unaddressed limitations: (1) due to the lack of labelled pre-training data of dedicated sentiment analysis, the methods with a pre-training manner produce suboptimal prediction results; (2) most existing methods employ a self-attention encoder to fuse multimodal tokens, which not only ignores the alignment relationship between different modal tokens but also makes the model unable to capture the semantic links between images and texts. In this paper, we propose a momentum contrastive learning network (MOCOLNet) to overcome above limitations. First, we merge the pre-training stage with the training stage to design an end-to-end training manner which uses less labelled data dedicated to sentiment analysis to obtain better prediction results. Second, we propose a multimodal contrastive learning method to align the different modal representations before data fusing, and design a cross-modal matching strategy to provide semantic interactive information between texts and images. Moreover, we introduce an auxiliary momentum strategy to increase the robustness of model. We also analyse the effectiveness of the proposed multimodal contrastive learning method using a mutual information theory. Experiments verify that the proposed MOCOLNet is superior to other strong baselines.
AB - Multimodal aspect-level sentiment analysis has attracted increasing attention in recent years. However, existing methods have two unaddressed limitations: (1) due to the lack of labelled pre-training data of dedicated sentiment analysis, the methods with a pre-training manner produce suboptimal prediction results; (2) most existing methods employ a self-attention encoder to fuse multimodal tokens, which not only ignores the alignment relationship between different modal tokens but also makes the model unable to capture the semantic links between images and texts. In this paper, we propose a momentum contrastive learning network (MOCOLNet) to overcome above limitations. First, we merge the pre-training stage with the training stage to design an end-to-end training manner which uses less labelled data dedicated to sentiment analysis to obtain better prediction results. Second, we propose a multimodal contrastive learning method to align the different modal representations before data fusing, and design a cross-modal matching strategy to provide semantic interactive information between texts and images. Moreover, we introduce an auxiliary momentum strategy to increase the robustness of model. We also analyse the effectiveness of the proposed multimodal contrastive learning method using a mutual information theory. Experiments verify that the proposed MOCOLNet is superior to other strong baselines.
KW - Aspect-level sentiment analysis
KW - contrastive learning
KW - multimodal representation learning
UR - http://www.scopus.com/inward/record.url?scp=85182360347&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2023.3345022
DO - 10.1109/TKDE.2023.3345022
M3 - 文章
AN - SCOPUS:85182360347
SN - 1041-4347
VL - 36
SP - 8787
EP - 8800
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 12
ER -