Time-domain Speech Separation Based on a Fully Convolutional Neural Network with Multitask Learning
-
Graphical Abstract
-
Abstract
When speech separation is performed based on a time-frequency mask using a deep neural network, the phase spectrum of the mixed signal is commonly used as the target signal phase, and the special processing for gender combination is lacking, which results in poor quality of separated speech. Aiming to address the problem, this study introduces a novel speech separation approach in the time domain based on a fully convolutional network and gender combination detection (FCN-GCD) with multitask learning. Its network is primarily composed of a speech separation module and a mixed speech gender combination detection module. In the speech separation module, an FCN is constructed, where the input of the network is time-domain mixed speech signals of two people, and the output is the clean speech signal of the target speaker. The FCN compressed features along the convolutional layers of the encoder and reconstructed features along the deconvolutional layers of the decoder, achieving end-to-end speech separation. Additionally, by employing the multitask learning approach, the GCD task for mixed speech is integrated into the speech separation network. Under the joint constraint of the two tasks, both the auxiliary information and speech separation features are obtained simultaneously. Subsequently, these deep features are combined to enhance the separation capability of the model for the mixed speech of different gender combinations. By incorporating the GCD task for the mixed speech as a secondary task in the speech separation network, parameter sharing is achieved between the main and secondary tasks, thereby strengthening the speech separation capability for the primary task. Compared with frequency domain methods, the proposed FCN-GCD method in the time domain eliminates the necessity for phase recovery and frequency-to-time reconstruction, which simplifies the processing and improves computational efficiency. Furthermore, it can extract effective auxiliary information features from the GCD task for mixed speech, achieving more effective speech separation. The results of the experiment demonstrate that compared with single-task speech separation methods, the proposed method improves the quality of speech in three gender combinations: male-male, female-female, and male-female, and achieves better performance on evaluation metrics such as Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), Signal-to-Interference Ratio (SIR), Signal-to-Distortion Ratio (SDR), and Signal-to-Artifact Ratio (SAR).
-
-