基于CNN-SVM性别组合分类的单通道语音分离
CNN-SVM Gender Combination Classification Based Single-channel Speech Separation
-
摘要: 实际语音分离时,混合语音的说话人性别组合相关信息往往是未知的。若直接在普适的模型上进行分离,语音分离效果欠佳。为了更好地进行语音分离,本文提出一种基于卷积神经网络-支持向量机(CNN-SVM)的性别组合判别模型,来确定混合语音的两个说话人是男-男、男-女还是女-女组合,以便选用相应性别组合的分离模型进行语音分离。为了弥补传统单一特征表征性别组合信息不足的问题,本文提出一种挖掘深度融合特征的策略,使分类特征包含更多性别组合类别的信息。本文的基于CNN-SVM性别组合分类的单通道语音分离方法,首先使用卷积神经网络挖掘梅尔频率倒谱系数和滤波器组特征的深度特征,融合这两种深度特征作为性别组合的分类特征,然后利用支持向量机对混合语音性别组合进行识别,最后选择对应性别组合的深度神经网络/卷积神经网络(DNN/CNN)模型进行语音分离。实验结果表明,与传统的单一特征相比,本文所提的深度融合特征可以有效提高混合语音性别组合的识别率;本文所提的语音分离方法在主观语音质量评估(PESQ)、短时客观可懂度(STOI)、信号失真比(SDR)指标上均优于普适的语音分离模型。Abstract: In actual speech separation, the information related to the speaker gender combination of mixed speech is often unknown. If the mixed speech is separated directly on the universal model, the performance of speech separation is not satisfactory. In order to better carry out speech separation, a gender combination discrimination model based on convolutional neural network (CNN)-support vector machine (SVM) was proposed in this paper, which determined that the gender group of mixture speech is male-male, male-female or female-female, so as to select the corresponding gender separation model for speech separation task. To make up for the lack of gender combination information represented by traditional single feature, a strategy of mining deep fusion features was also proposed, so that the classification features contained more information of gender combination categories. The proposed single-channel speech separation method based on CNN-SVM gender combination classification first used CNN to mine the deep features of Mel frequency cepstrum coefficients and filter bank features, and fused these two deep features as gender combination classification features. Then, SVM was used to recognize the gender combination of mixed speech. Finally, the deep neural network (DNN) or CNN model corresponding to gender combination was selected for speech separation. The experimental results show that compared with the traditional single feature, the deep fusion feature proposed can effectively improve the recognition rate of gender combination of mixed speech. In signal distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and short-time target intelligibility (STOI), the proposed speech separation method is superior to the universal speech separation model.