基于CM-OMEMD和小波散射网络的语音情感识别
Speech Emotion Recognition Based on CM-OMEMD and Wavelet Scattering Network
-
摘要: 语音情感识别(Speech Emotion Recognition,SER)是人机交互的重要组成部分,具有广泛的研究和应用价值。针对当前SER中仍然存在着缺乏大规模语音情感数据集和语音情感特征的低鲁棒性而导致的语音情感识别准确率低等问题,提出了一种基于改进的经验模态分解方法(Empirical Mode Decomposition,EMD)和小波散射网络(Wavelet Scattering Network,WSN)的语音情感识别方法。首先,针对用于语音信号时频分析的EMD及其改进算法中存在的模态混叠问题(Mode Mixing)和噪声残余问题,提出了基于常数Q变换(Constant-Q Transform,CQT)和海洋捕食者算法(Marine Predator Algorithm,MPA)的优化掩模经验模态分解方法(Optimized Masking EMD based on CQT and MPA,CM-OMEMD)。采用CM-OMEMD算法对情感语音信号进行分解,得到固有模态函数(Intrinsic Mode Functions,IMFs),并从IMFs中提取了可以表征情感的时频特征作为第一个特征集。然后采用WSN提取了具有平移不变性和形变稳定性的散射系数特征作为第二个特征集。最后将两个特征集进行融合,采用支持向量机(Support Vector Machine,SVM)分类器进行分类。通过在含有七种情感状态的TESS数据集中的对比实验,证明了本文提出的系统的有效性。其中CM-OMEMD减小了模态混叠,提升了对情感语音信号时频分析的准确性,同时提出的SER系统显著提高了情绪识别的性能。Abstract: Speech emotion recognition (SER) is an essential part of human-computer interaction and has a wide range of research and application values. There are still problems in current SER, such as low accuracy of speech emotion recognition due to the lack of a large-scale speech emotion dataset and low robustness of speech emotion features. To address these problems, a SER method based on an improved empirical mode decomposition (EMD) and wavelet scattering network (WSN) was proposed. First, a novel optimized masking EMD based on constant-Q transform (CQT) and marine predator algorithm (MPA), named CM-OMEMD, was proposed to address the mode mixing and noise residual problems in EMD and its improved algorithms for time-frequency analysis of speech emotional signals. The CM-OMEMD was used to decompose the emotional speech signal to obtain intrinsic mode functions (IMFs). The time-frequency features that can characterize the emotion were extracted from the IMFs as the first feature set. Then the scattering coefficients with translational invariance and deformation stability were extracted using WSN as the second feature set. Finally, the two feature sets were fused, and a support vector machine (SVM) classifier was used for classification. The effectiveness of the proposed method was demonstrated by comparison experiments on the TESS dataset containing seven emotions. The CM-OMEMD reduced the mode mixing and improved the accuracy of time-frequency analysis of emotional speech signals, while the proposed SER method significantly improved the performance of SER.