SUN Linhui, ZHANG Meng, LIANG Wenqing. CNN-SVM Gender Combination Classification Based Single-channel Speech Separation[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(12): 2519-2531. DOI: 10.16798/j.issn.1003-0530.2022.12.007
Citation: SUN Linhui, ZHANG Meng, LIANG Wenqing. CNN-SVM Gender Combination Classification Based Single-channel Speech Separation[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(12): 2519-2531. DOI: 10.16798/j.issn.1003-0530.2022.12.007

CNN-SVM Gender Combination Classification Based Single-channel Speech Separation

  • ‍ ‍In actual speech separation, the information related to the speaker gender combination of mixed speech is often unknown. If the mixed speech is separated directly on the universal model, the performance of speech separation is not satisfactory. In order to better carry out speech separation, a gender combination discrimination model based on convolutional neural network (CNN)-support vector machine (SVM) was proposed in this paper, which determined that the gender group of mixture speech is male-male, male-female or female-female, so as to select the corresponding gender separation model for speech separation task. To make up for the lack of gender combination information represented by traditional single feature, a strategy of mining deep fusion features was also proposed, so that the classification features contained more information of gender combination categories. The proposed single-channel speech separation method based on CNN-SVM gender combination classification first used CNN to mine the deep features of Mel frequency cepstrum coefficients and filter bank features, and fused these two deep features as gender combination classification features. Then, SVM was used to recognize the gender combination of mixed speech. Finally, the deep neural network (DNN) or CNN model corresponding to gender combination was selected for speech separation. The experimental results show that compared with the traditional single feature, the deep fusion feature proposed can effectively improve the recognition rate of gender combination of mixed speech. In signal distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and short-time target intelligibility (STOI), the proposed speech separation method is superior to the universal speech separation model.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return