基于全卷积神经网络多任务学习的时域语音分离
Time-domain Speech Separation Based on a Fully Convolutional Neural Network with Multitask Learning
-
摘要: 基于深度神经网络时频掩码进行语音分离时,目标信号相位一般采用混合信号的相位谱,且对性别组合缺乏针对性处理,这导致分离语音的质量不佳。针对该问题,本文提出一种基于全卷积神经网络联合性别组合检测(Fully Convolutional Neural Network - Gender Combination Detection, FCN-GCD)多任务学习的时域语音分离方法。该方法首先在语音分离支路构建全卷积神经网络,该网络的输入为时域两人混合语音信号,输出为目标讲话者的纯净语音信号,运用卷积编码器和反卷积解码器对特征进行压缩和重建,实现端到端的语音分离。其次将混合语音性别组合检测任务整合到语音分离网络中,在两个任务联合约束下获取辅助信息特征和语音分离特征,并将这些深度特征相结合来提升语音分离质量。该FCN-GCD方法是一种时域语音分离方法,不需要进行相位恢复和频域到时域的重构,相比频域处理方法,该处理过程简单,从而提高了运算效率。另外,该方法从混合语音性别组合检测任务中提取有效的辅助信息特征,利用联合特征实现了更有效的语音分离。实验结果表明,与单任务的语音分离方法相比,本文所提出的FCN-GCD方法在男男、女女和男女三种性别组合下均有效提高了语音质量,在语音质量感知评估(Perceptual Evaluation of Speech Quality,PESQ)、短时客观可懂度(Short-Time Objective Intelligibility,STOI)、信号干扰比(Signal-to-Interference Ratio,SIR)、信号失真比(Signal-to-Distortion Ratio,SDR)和信号伪像比(Signal-to-Artifact Ratio,SAR)评价指标上均获得更佳的表现。Abstract: When speech separation is performed based on a time-frequency mask using a deep neural network, the phase spectrum of the mixed signal is commonly used as the target signal phase, and the special processing for gender combination is lacking, which results in poor quality of separated speech. Aiming to address the problem, this study introduces a novel speech separation approach in the time domain based on a fully convolutional network and gender combination detection (FCN-GCD) with multitask learning. Its network is primarily composed of a speech separation module and a mixed speech gender combination detection module. In the speech separation module, an FCN is constructed, where the input of the network is time-domain mixed speech signals of two people, and the output is the clean speech signal of the target speaker. The FCN compressed features along the convolutional layers of the encoder and reconstructed features along the deconvolutional layers of the decoder, achieving end-to-end speech separation. Additionally, by employing the multitask learning approach, the GCD task for mixed speech is integrated into the speech separation network. Under the joint constraint of the two tasks, both the auxiliary information and speech separation features are obtained simultaneously. Subsequently, these deep features are combined to enhance the separation capability of the model for the mixed speech of different gender combinations. By incorporating the GCD task for the mixed speech as a secondary task in the speech separation network, parameter sharing is achieved between the main and secondary tasks, thereby strengthening the speech separation capability for the primary task. Compared with frequency domain methods, the proposed FCN-GCD method in the time domain eliminates the necessity for phase recovery and frequency-to-time reconstruction, which simplifies the processing and improves computational efficiency. Furthermore, it can extract effective auxiliary information features from the GCD task for mixed speech, achieving more effective speech separation. The results of the experiment demonstrate that compared with single-task speech separation methods, the proposed method improves the quality of speech in three gender combinations: male-male, female-female, and male-female, and achieves better performance on evaluation metrics such as Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), Signal-to-Interference Ratio (SIR), Signal-to-Distortion Ratio (SDR), and Signal-to-Artifact Ratio (SAR).