基于Sinc-Transformer模型的原始语音情感识别

Emotion recognition from Raw Speech based on Sinc-Transformer model

  • 摘要: 考虑传统语音情感识别任务中,手动提取声学特征的繁琐性,本文针对原始语音信号提出一种Sinc-Transformer(SincNet Transformer)模型来进行语音情感识别任务。该模型同时具备SincNet层及Transformer模型编码器的优点,利用SincNet滤波器从原始语音波形中捕捉一些重要的窄带情感特征,使其整个网络结构在特征提取过程中具有指导性,从而完成原始语音信号的浅层特征提取工作;利用两层Transformer模型编码器进行二次处理,以提取包含全局上下文信息的深层特征向量。在交互式情感二元动作捕捉数据库(IEMOCAP)的四类情感分类中,实验结果表明本文提出的Sinc-Transformer模型准确率与非加权平均召回率分别为64.14%和65.28%。同时与基线模型进行对比,所提模型能有效地提高语音情感识别性能。

     

    Abstract: Considering the complexity of manual extraction of acoustic features in traditional speech emotion recognition tasks, this paper proposed the Sinc-Transformer (SincNet Transformer) model for speech emotion recognition using raw speech. This model combined the advantages of SincNet and Transformer model encoder, and used SincNet filter to capture important narrow-band emotional features from the raw speech waveform, so that the whole network structure could be instructive in the process of feature extraction, so as to completed the shallow feature extraction work of raw speech signals;and used two layers of Transformer model encoders for secondary processing to extract deeper feature vectors that contain global context information. Among the four categories of speech emotion recognition in IEMOCAP database, experimental results show that the accuracy and unweighted average recall of Sinc-Transformer model proposed in this paper are 64.14% and 65.28% respectively. Meanwhile, compared with the baseline model, the proposed model can effectively improve speech emotion recognition performance.

     

/

返回文章
返回