LI Yongwei, TAO Jianhua, LI Kai. Speech Emotion Recognition Based on Glottal Source and Vocal Tract Features[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(4): 632-638. DOI: 10.16798/j.issn.1003-0530.2023.04.004
Citation: LI Yongwei, TAO Jianhua, LI Kai. Speech Emotion Recognition Based on Glottal Source and Vocal Tract Features[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(4): 632-638. DOI: 10.16798/j.issn.1003-0530.2023.04.004

Speech Emotion Recognition Based on Glottal Source and Vocal Tract Features

  • ‍ ‍Speech emotion recognition is an indispensable part of realizing natural human-computer interaction and an important part of artificial intelligence. The regulation of speech production organs causes differences in the acoustic features of the emotional speech signal, and thus different emotions are perceived. Traditional speech emotion recognition methods are only focused on classifying emotions based on acoustical features or auditory features, ignoring the important role of speech production directly related features such as glottal source waveform and vocal tract shape cues on emotion perception. In our previous study, the contributions of glottal source and vocal tract cues to the emotion perception in speech have been theoretically analyzed. However, the glottal source and vocal tract features have not been used for speech emotion recognition. Therefore, in this paper, we revisited the possibility of glottal source and vocal tract cues for speech emotion recognition from the point of view of speech production. Motivated by the source-filter model of speech production, we propose a new speech emotion recognition method based on the glottal source and vocal tract features. Firstly, the glottal source and vocal tract features were estimated simultaneously from emotional speech signals based on an analysis-by-synthesis approach with a source-filter model constructed of an Auto-Regressive eXogenous (ARX) model and the Liljencrants-Fant (LF) model. Then, the estimated glottal source and vocal tract features were fed into the Bidirectional Gated Recurrent Unit(BiGRU) network for the speech emotion recognition tasks. The emotion recognition verification were conducted on an public emotion dataset of interactive emotional dyadic motion capture database(IEMOCAP), and the experimental results showed that the glottal source and vocal tract features could effectively distinguish the emotions, and the emotion recognition accuracy of the glottal source and vocal tract features is superior to that of traditional emotion features. This paper is focused on the advantages of the glottal source and vocal tract features that are directly used for speech emotion recognition, which provides new insight into speech emotion recognition technology.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return