基于Wav2vec2.0与语境情感信息补偿的对话语音情感识别
Wav2vec2.0 and Context Emotional Information Compensation Based Dialogue Speech Emotion Recognition
-
摘要: 情感在人际交互中扮演着重要的角色。在日常对话中,一些语句往往存在情感色彩较弱、情感类别复杂、模糊性高等现象,使对话语音情感识别成为一项具有挑战性的任务。针对该问题,现有很多工作通过对全局对话进行情感信息检索,将全局情感信息用于预测。然而,当对话中前后的话语情感变化较大时,不加选择的引入前文情感信息容易给当前预测带来干扰。本文提出了基于Wav2vec2.0与语境情感信息补偿的方法,旨在从前文中选择与当前话语最相关的情感信息作为补偿。首先通过语境信息补偿模块从历史对话中选择可能对当前话语情感影响最大的话语的韵律信息,利用长短时记忆网络将韵律信息构建为语境情感信息补偿表征。然后,利用预训练模型Wav2vec2.0提取当前话语的嵌入表征,将嵌入表征与语境表征融合用于情感识别。本方法在IEMOCAP数据集上的识别性能为69.0%(WA),显著超过了基线模型。Abstract: Emotions play an important role in human interaction. In the sentences of daily dialogues, there exists phenomena like weak emotional feelings, complex emotional categories and high ambiguity, which makes dialogue speech emotion recognition a challenging task. In order to solve this problem, existing works use global emotional information for prediction by retrieving emotional information from the global dialogue. However, the indiscriminate use of preceding emotional information can interfere with prediction of the current one when the emotional changes between the preceding and subsequent utterances are large. This paper proposes a method based on Wav2vec2.0 and contextual emotional information compensation, aiming to select the most relevant emotional information from the preceding utterances as compensation. Firstly, through the contextual information compensation module, the prosodic information of importance to the current utterance in discourse is selected from the preceding, which is used to construct contextual emotion information compensation representation through the long-term and short-term memory network (LSTM). Then the embedded representation of the current utterance is extracted through Wav2vec2.0, concatenated with the contextual representation above to form a new emotional representation. The recognition performance of our method on the IEMOCAP dataset is 69.0% (WA), significantly outperforming the baseline model.