ZHAO Ziping, GAO Tian, WANG Huan. Multimodal Feature Distillation Emotion Recognition Method with Global Convolution and Affinity Fusion[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(4): 667-677. DOI: 10.16798/j.issn.1003-0530.2023.04.008
Citation: ZHAO Ziping, GAO Tian, WANG Huan. Multimodal Feature Distillation Emotion Recognition Method with Global Convolution and Affinity Fusion[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(4): 667-677. DOI: 10.16798/j.issn.1003-0530.2023.04.008

Multimodal Feature Distillation Emotion Recognition Method with Global Convolution and Affinity Fusion

  • ‍ ‍To enhance the user experience in human-computer interaction and to meet the needs of a variety of applications, interactive devices are gradually introducing emotional intelligence technologies, where the effective integration of industry and technology presupposes the ability to correctly recognize human emotional states, but this remains a challenging topic. With the rapid development of the multimedia era, more and more available modal information is gradually being used in emotion recognition systems. Therefore, this paper proposes a multimodal emotion recognition model based on feature distillation. In consideration of the fact that emotion expressions are often closely related to the global information of audio signals, Adaptive Global Convolution (AGC) is proposed to enhance the range of the effective receptive field, and Feature Map Importance Analysis (FMIA) module is proposed to further strengthen the emotion key features. The Audio Affinity Fusion (AAF) module models the affinity fusion weights through the intrinsic correlation between audio-text modalities, allowing the emotional information of both modalities to be effectively fused. In addition, the use of Feature Distillation (FD) on multimodal models is proposed in order to promote model generalization and to effectively exploit the hidden knowledge carried by the probability distribution in the teacher's model and to help the student model to acquire higher-level semantic features. Finally, the method was evaluated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) emotion dataset and achieved a weighted accuracy (WA) of 75.2% and an unweighted accuracy (UA) of 75.8%, demonstrating the effectiveness of the model in improving the efficiency of emotion recognition.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return