基于声门波和声道特征的语音情感识别
Speech Emotion Recognition Based on Glottal Source and Vocal Tract Features
-
摘要: 语音情感识别是实现自然人机交互不可缺失的部分,是人工智能的重要组成部分。发音器官的调控引起情感语音声学特征的差异,从而被感知到不同的情感。传统的语音情感识别只是针对语音信号中的声学特征或听觉特征进行情感分类,忽略了声门波和声道等发音特征对情感感知的重要作用。在我们前期工作中,理论分析了声门波和声道形状对感知情感的重要影响,但未将声门波与声道特征用于语音情感识别。因此,本文从语音生成的角度重新探讨了声门波与声道特征对语音情感识别的可能性,提出一种基于源-滤波器模型的声门波和声道特征语音情感识别方法。首先,利用Liljencrants-Fant和Auto-Regressive eXogenous(ARX-LF)模型从语音信号中分离出情感语音的声门波和声道特征;然后,将分离出的声门波和声道特征送入双向门控循环单元(BiGRU)进行情感识别分类任务。在公开的情感数据集IEMOCAP上进行了情感识别验证,实验结果证明了声门波和声道特征可以有效的区分情感,且情感识别性能优于一些传统特征。本文从发音相关的声门波与声道研究语音情感识别,为语音情感识别技术提供了一种新思路。Abstract: Speech emotion recognition is an indispensable part of realizing natural human-computer interaction and an important part of artificial intelligence. The regulation of speech production organs causes differences in the acoustic features of the emotional speech signal, and thus different emotions are perceived. Traditional speech emotion recognition methods are only focused on classifying emotions based on acoustical features or auditory features, ignoring the important role of speech production directly related features such as glottal source waveform and vocal tract shape cues on emotion perception. In our previous study, the contributions of glottal source and vocal tract cues to the emotion perception in speech have been theoretically analyzed. However, the glottal source and vocal tract features have not been used for speech emotion recognition. Therefore, in this paper, we revisited the possibility of glottal source and vocal tract cues for speech emotion recognition from the point of view of speech production. Motivated by the source-filter model of speech production, we propose a new speech emotion recognition method based on the glottal source and vocal tract features. Firstly, the glottal source and vocal tract features were estimated simultaneously from emotional speech signals based on an analysis-by-synthesis approach with a source-filter model constructed of an Auto-Regressive eXogenous (ARX) model and the Liljencrants-Fant (LF) model. Then, the estimated glottal source and vocal tract features were fed into the Bidirectional Gated Recurrent Unit(BiGRU) network for the speech emotion recognition tasks. The emotion recognition verification were conducted on an public emotion dataset of interactive emotional dyadic motion capture database(IEMOCAP), and the experimental results showed that the glottal source and vocal tract features could effectively distinguish the emotions, and the emotion recognition accuracy of the glottal source and vocal tract features is superior to that of traditional emotion features. This paper is focused on the advantages of the glottal source and vocal tract features that are directly used for speech emotion recognition, which provides new insight into speech emotion recognition technology.