基于多头注意力机制的模型层融合维度情感识别方法

Model Level Fusion Dimension Emotion Recognition Method Based on Transformer

  • 摘要:  近年来,情感识别成为了人机交互领域的研究热点问题,而多模态维度情感识别能够检测出细微情感变化,得到了越来越多的关注多模态维度情感识别中需要考虑如何进行不同模态情感信息的有效融合。针对特征层融合存在有效特征提取和模态同步的问题、决策层融合存在不同模态特征信息的关联问题,本文采用模型层融合策略,提出了基于多头注意力机制的多模态维度情感识别方法,分别构建音频模型、视频模型和多模态融合模型对信息流进行深层特征学习,最后放入双向长短时网络中得到最终情感预测值。所提方法相比于不同基线方法在激活度和愉悦度上均取得了最佳的性能,可以在高层维度对情感信息有效捕捉,进而更好的对音视频信息进行有效融合。

     

    Abstract: In recent years, emotion recognition had become a hot research topic in the field of human-computer interaction, and multi-modal dimensional emotion recognition could detect subtle emotional changes, which had attracted more and more attention. In multi-modal emotion recognition, it was necessary to consider how to effectively integrate different modal emotion information. Aiming at the problem of effective feature extraction and modal synchronization in feature level fusion, and the correlation problem of different modal feature information in decision level fusion, this paper adopted a model level fusion strategy and proposes a multi-modal dimension emotion recognition method based on Transformer. Respectively constructed audio model, video model and multi-modal fusion model to learn the deep features of the information flow, and finally put it into Bi-directional Long Short Term Memory to obtain the final emotional prediction value. Compared with different baseline methods, the proposed method achieves the best performance in terms of arousal and valence, and could effectively capture emotional information in high-level dimensions, and thus better effectively integrate audio and video information.

     

/

返回文章
返回