全局卷积与亲和度融合的多模态特征蒸馏情感识别方法
Multimodal Feature Distillation Emotion Recognition Method with Global Convolution and Affinity Fusion
-
摘要: 为提升人机交互时的用户体验以及满足多元化用途的需求,交互设备正逐步引入情感智能技术,其中,实现产业和技术有效融合的前提是可以对人类情感状态进行正确的识别,然而,这仍然是一个具有挑战性的话题。随着多媒体时代的快速发展,越来越多可利用的模态信息便逐步被应用到情感识别系统中。因此,本文提出一种基于特征蒸馏的多模态情感识别模型。考虑到情感表达往往与音频信号的全局信息密切相关,提出了适应性全局卷积(Adaptive Global Convolution, AGC)来提升有效感受野的范围,特征图重要性分析(Feature Map Importance Analysis, FMIA)模块进一步强化情感关键特征。音频亲和度融合(Audio Affinity Fusion, AAF)模块通过音频-文本模态间的内在相关性建模亲和度融合权重,使两种模态的情感信息得到有效融合。此外,为了提升模型泛化能力,有效利用教师模型中概率分布所携带的隐藏知识,帮助学生模型获取更高级别的语义特征,提出了在多模态模型上使用特征蒸馏。最后,在交互式情感二元动作捕捉(Interactive Emotional Dyadic Motion Capture, IEMOCAP)情感数据集上对该方法进行评估,加权准确率达到了75.2%,非加权准确率达到了75.8%,证明了该模型对提升情感识别效率的有效性。Abstract: To enhance the user experience in human-computer interaction and to meet the needs of a variety of applications, interactive devices are gradually introducing emotional intelligence technologies, where the effective integration of industry and technology presupposes the ability to correctly recognize human emotional states, but this remains a challenging topic. With the rapid development of the multimedia era, more and more available modal information is gradually being used in emotion recognition systems. Therefore, this paper proposes a multimodal emotion recognition model based on feature distillation. In consideration of the fact that emotion expressions are often closely related to the global information of audio signals, Adaptive Global Convolution (AGC) is proposed to enhance the range of the effective receptive field, and Feature Map Importance Analysis (FMIA) module is proposed to further strengthen the emotion key features. The Audio Affinity Fusion (AAF) module models the affinity fusion weights through the intrinsic correlation between audio-text modalities, allowing the emotional information of both modalities to be effectively fused. In addition, the use of Feature Distillation (FD) on multimodal models is proposed in order to promote model generalization and to effectively exploit the hidden knowledge carried by the probability distribution in the teacher's model and to help the student model to acquire higher-level semantic features. Finally, the method was evaluated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) emotion dataset and achieved a weighted accuracy (WA) of 75.2% and an unweighted accuracy (UA) of 75.8%, demonstrating the effectiveness of the model in improving the efficiency of emotion recognition.