Citation: | WEN Yuhua, LI Qifei, ZHOU Yingying, et al. Multimodal emotion recognition based on dual alignment and contrastive learning[J]. Journal of Signal Processing, 2025, 41(3): 533-543. DOI: 10.12466/xhcl.2025.03.011. |
Emotion recognition plays a crucial role in modern human-computer interaction, affective computing, and immersive virtual reality by enabling computers to automatically identify and classify human emotional states. With advancements in multimodal learning, emotion recognition has progressed from traditional single-modal approaches to more complex multimodal approaches. Multimodal emotion recognition involves processing data from various sources, such as text, speech, and visual information. These different modalities capture distinct emotional features, allowing models to better understand human emotional states. However, significant challenges remain in multimodal emotion recognition, including temporal misalignment and modality heterogeneity. To address these challenges, this paper presents a multimodal emotion recognition model that utilizes dual alignment and contrastive learning. The proposed model integrates text, speech, and visual modalities, achieving comprehensive alignment through a dual alignment module. Specifically, feature-level alignment employs a cross-attention mechanism to create dynamic temporal alignment, while sample-level alignment uses contrastive learning to align the modalities in feature space. Additionally, the model introduces supervised contrastive learning to leverage label information, allowing for the extraction of fine-grained emotional cues and enhancing the model’s robustness. Self-attention mechanisms are also applied to learn interactions among multimodal features, effectively improving overall model performance. Experimental results show that the proposed model performs exceptionally well on three public datasets, outperforming existing models on most metrics. In conclusion, the dual alignment and contrastive learning-based multimodal emotion recognition model successfully addresses key challenges in the field and achieves significant performance improvements.
[1] |
AHMAD A,SINGH V,UPRETI K. A systematic study on unimodal and multimodal human computer interface for emotion recognition[C]// International Conference on Computing,Intelligence and Data Analytics. Cham:Springer Nature Switzerland,2023:363- 375. doi:10.1007/978-3-031-53717-2_35 doi: 10.1007/978-3-031-53717-2_35
|
[2] |
SAVCHENKO A V,SAVCHENKO L V,MAKAROV I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network[J]. IEEE Transactions on Affective Computing,2022,13(4):2132- 2143. doi:10.1109/taffc.2022.3188390 doi: 10.1109/taffc.2022.3188390
|
[3] |
MARÍN-MORALES J,LLINARES C,GUIXERES J,et al. Emotion recognition in immersive virtual reality:From statistics to affective computing[J]. Sensors,2020,20(18):5163. doi:10.3390/s20185163 doi: 10.3390/s20185163
|
[4] |
WANG Yan,SONG Wei,TAO Wei,et al. A systematic review on affective computing:Emotion models,databases,and recent advances[J]. Information Fusion,2022,83:19- 52. doi:10.1016/j.inffus.2022.03.009 doi: 10.1016/j.inffus.2022.03.009
|
[5] |
EZZAMELI K,MAHERSIA H. Emotion recognition from unimodal to multimodal analysis:A review[J]. Information Fusion,2023,99:101847. doi:10.1016/j.inffus.2023.101847 doi: 10.1016/j.inffus.2023.101847
|
[6] |
GANDHI A,ADHVARYU K,PORIA S,et al. Multimodal sentiment analysis:A systematic review of history,datasets,multimodal fusion methods,applications,challenges and future directions[J]. Information Fusion,2023,91:424- 444. doi:10.1016/j.inffus.2022.09.025 doi: 10.1016/j.inffus.2022.09.025
|
[7] |
LIANG P P,ZADEH A,MORENCY L P. Foundations& trends in multimodal machine learning:Principles,challenges,and open questions[J]. ACM Computing Surveys,2024,56(10):1- 42. doi:10.1145/3656580 doi: 10.1145/3656580
|
[8] |
ZADEH A,LIANG P P,MAZUMDER N,et al. Memory fusion network for multi-view sequential learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence,2018,32(1):5634- 5641. doi:10.1609/aaai.v32i1.12021 doi: 10.1609/aaai.v32i1.12021
|
[9] |
TSAI Y H H,BAI Shaojie,LIANG P P,et al. Multimodal transformer for unaligned multimodal language sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,2019:6558- 6569. doi:10.18653/v1/p19-1656 doi: 10.18653/v1/p19-1656
|
[10] |
LV Fengmao,CHEN Xiang,HUANG Yanyong,et al. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:2554- 2562. doi:10.1109/cvpr46437.2021.00258 doi: 10.1109/cvpr46437.2021.00258
|
[11] |
ZADEH A,CHEN Minghai,PORIA S,et al. Tensor fusion network for multimodal sentiment analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,2017:1103- 1114. doi:10.18653/v1/d17-1115 doi: 10.18653/v1/d17-1115
|
[12] |
LIU Zhun,SHEN Ying,LAKSHMINARASIMHAN V B,et al. Efficient low-rank multimodal fusion with modality-specific factors[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),2018:2247- 2256. doi:10.18653/v1/p18-1209 doi: 10.18653/v1/p18-1209
|
[13] |
HAZARIKA D,ZIMMERMANN R,PORIA S. MISA:Modality-invariant and-specific representations for multimodal sentiment analysis[C]// Proceedings of the 28th ACM International Conference on Multimedia. Seattle WA USA. ACM,2020:1122- 1131. doi:10.1145/3394171.3413678 doi: 10.1145/3394171.3413678
|
[14] |
YANG Dingkang,HUANG Shuai,KUANG Haopeng,et al. Disentangled representation learning for multimodal emotion recognition[C]// Proceedings of the 30th ACM International Conference on Multimedia,2022:1642- 1651. doi:10.1145/3503161.3547754 doi: 10.1145/3503161.3547754
|
[15] |
HAN Wei,CHEN Hui,PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,2021:9180- 9192. doi:10.18653/v1/2021.emnlp-main.723 doi: 10.18653/v1/2021.emnlp-main.723
|
[16] |
CHEN Ting,KORNBLITH S,NOROUZI M,et al. A simple framework for contrastive learning of visual representations[C]// International Conference on Machine Learning. PMLR,2020:1597- 1607. doi:10.48550/arXiv.2002.05709 doi: 10.48550/arXiv.2002.05709
|
[17] |
RADFORD A,KIM J W,HALLACY C,et al. Learning transferable visual models from natural language supervision[C]// International Conference on Machine Learning. PMLR,2021:8748- 8763. doi:10.48550/arXiv.2103.00020 doi: 10.48550/arXiv.2103.00020
|
[18] |
KHOSLA P,TETERWAK P,WANG Chen,et al. Supervised contrastive learning[J]. Advances in Neural Information Processing Systems,2020,33:18661- 18673.
|
[19] |
Sijie MAI,ZENG Ying,ZHENG Shuangjia,et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing,2023,14(3):2276- 2289. doi:10.1109/taffc.2022.3172360 doi: 10.1109/taffc.2022.3172360
|
[20] |
LI Zhen,XU Bing,ZHU Conghui,et al. CLMLF:A contrastive learning and multi-layer fusion method for multimodal sentiment detection[EB/OL]. 2022:2204. 05515. https://arxiv.org/abs/2204.05515v4. doi:10.18653/v1/2022.findings-naacl.175 doi: 10.18653/v1/2022.findings-naacl.175
|
[21] |
YANG Jiuding,YU Yakun,NIU Di,et al. ConFEDE:Contrastive feature decomposition for multimodal sentiment analysis[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),2023:7617- 7630. doi:10.18653/v1/2023.acl-long.421 doi: 10.18653/v1/2023.acl-long.421
|
[22] |
LIU Peipei,ZHENG Xin,LI Hong,et al. Improving the modality representation with multi-view contrastive learning for multimodal sentiment analysis[C]// International Conference on Acoustics,Speech and Signal Processing(ICASSP). IEEE,2023:1- 5. doi:10.1109/icassp49357.2023.10096470 doi: 10.1109/icassp49357.2023.10096470
|
[23] |
LIANG V W,ZHANG Yuhui,KWON Y,et al. Mind the gap:Understanding the modality gap in multi-modal contrastive representation learning[J]. Advances in Neural Information Processing Systems,2022,35:17612- 17625.
|
[24] |
DEVLIN J,CHANG Mingwei,LEE K,et al. BERT:Pre-training of deep bidirectional transformers for language understanding[EB/OL]. 2018:1810. 04805. https://arxiv.org/abs/1810.04805v2. doi:10.18653/v1/n18-2 doi: 10.18653/v1/n18-2
|
[25] |
VASWANI A,SHAZEER N,PARMAR N,et al. Attention Is All You Need[EB/OL]. arXiv,2017. DOI:10.48550/arXiv.1706.03762. https://arxiv.org/abs/1706.03762. doi: 10.48550/arXiv.1706.03762
|
[26] |
GLOROT X,BORDES A,BENGIO Y. Deep sparse rectifier neural networks[C]// Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings,2011:315- 323.
|
[27] |
ZADEH A,ZELLERS R,PINCUS E,et al. Multimodal sentiment intensity analysis in videos:Facial gestures and verbal messages[J]. IEEE Intelligent Systems,2016,31(6):82- 88. doi:10.1109/mis.2016.94 doi: 10.1109/mis.2016.94
|
[28] |
ZADEH A A B,LIANG P P,PORIA S,et al. Multimodal language analysis in the wild:CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers),2018:2236- 2246. doi:10.18653/v1/p18-1208 doi: 10.18653/v1/p18-1208
|
[29] |
YU Wenmeng,XU Hua,MENG Fanyang,et al. CH-SIMS:A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,2020:3718- 3727. doi:10.18653/v1/2020.acl-main.343 doi: 10.18653/v1/2020.acl-main.343
|
[30] |
YU Wenmeng,XU Hua,YUAN Ziqi,et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]// Proceedings of the AAAI Conference on Artificial Intelligence,2021,35(12):10790- 10797. doi:10.1609/aaai.v35i12.17289 doi: 10.1609/aaai.v35i12.17289
|
[31] |
DEGOTTEX G,KANE J,DRUGMAN T,et al. COVAREP—a collaborative voice analysis repository for speech technologies[C]// International Conference on Acoustics,Speech and Signal Processing(ICASSP). IEEE,2014:960- 964. doi:10.1109/icassp.2014.6853739 doi: 10.1109/icassp.2014.6853739
|
[32] |
MCFEE B,RAFFEL C,LIANG Dawen,et al. Librosa:Audio and music signal analysis in python[C]// Proceedings of the Python in Science Conference,Proceedings of the 14th Python in Science Conference. Austin,Texas. SciPy,2015:18- 24. doi:10.25080/majora-7b98e3ed-003 doi: 10.25080/majora-7b98e3ed-003
|
[33] |
BALTRUSAITIS T,ZADEH A,LIM Y C,et al. OpenFace 2.0:Facial behavior analysis toolkit[C]// Proceedings of the 13th IEEE International Conference on Automatic Face& Gesture Recognition(FG 2018),2018:59- 66. doi:10.1109/fg.2018.00019 doi: 10.1109/fg.2018.00019
|
[34] |
KINGMA D P,BA J. Adam:A method for stochastic optimization[EB/OL]. 2014:1412. 6980. https://arxiv.org/abs/1412.6980v9.
|
[35] |
LIAN Zheng,SUN Licai,REN Yong,et al. MERBench:A unified evaluation benchmark for multimodal emotion recognition[EB/OL]. 2024:2401. 03429. https://arxiv.org/abs/2401.03429v3.
|