Abstract:
In order to reduce the prosodic information lacking induced by utterance-term global statistic features which were widely used by traditional speech emotion recognition, a novel multi-granularity feature extraction method is proposed in this paper. This method is based on different time units which include short-term frame features, mid-term fragments features and long-term windowing features. To fuse these multi-granularity features, we propose a cognitive-inspired recurrent neural network (CIRNN). CIRNN assembles different time-level features to simulate the human being’s step by step process on audio signals and it realizes the multi-level information fusion by highlighting both the time-sequence of emotion and the role of content information. The proposed methods are further examined on the VAM database to estimate continuous emotion primitives in a three-dimensional feature space spanned by activation, valence, and dominance and the average correlation coefficient is 0.66. The experimental results show that, the proposed system has a significant improvement for speech emotion estimation compared with the commonly used ANN and SVR approaches.