基于时域卷积网络的两阶段语音增强算法
Two-stage Speech Enhancement Algorithm Based on Temporal Convolutional Networks
-
摘要: 在语音信号的传输过程中,通常会受到噪声和回声等因素的干扰,从而导致信号的质量和可懂度下降。为了从信号中去除噪声和干扰,提高语音信号的质量,语音增强算法应运而生。与传统算法相比,基于深度学习的语音增强算法取得了更好的效果。然而现有的算法存在以下问题:现有算法在设计时普遍只考虑到了含噪语音中的语音成分,未能充分考虑噪声成分,且现有的算法大多是用单一网络完成语音增强任务,这需要网络具有较高的性能。对此,本文提出了用于语音增强的频谱掩蔽两阶段时频处理网络(Spectral Masking Two-Stage Time-Frequency Processing Network,SM-TSTFN)。该网络将语音增强的过程分解为幅度谱预测和复数谱预测两个阶段,渐进地估计出纯净语音。在第一阶段,将噪声和语音作为学习的目标,利用含噪语音的幅度谱作为输入,初步估计出噪声和语音的幅度谱。第二阶段,使用含噪语音的复数谱作为输入,在第一阶段预测结果的帮助下,估计出纯净语音的频谱。在第二阶段中,本文还设计了一种时频处理模块(Time-Frequency Processing Module,TFPM)。该模块结合了长短时记忆网络(Long Short-Term Memory,LSTM)和时域卷积网络(Temporal Convolution Module,TCN),能够分别从时域和频域维度提取特征。在数据集上的实验结果表明,本文提出的SM-TSTFN相比于其他模型取得了更高的分数,能够更有效和更准确地改善语音信号的质量,并提升语音的可懂度。Abstract: During the transmission of speech signals, interference from factors such as noise and echoes commonly results in a degradation in signal quality and intelligibility. Speech enhancement algorithms have been developed to eliminate noise and interference from the signal and enhance its quality. Deep learning-based speech enhancement algorithms have demonstrated superior enhancement effects to conventional methods. However, existing algorithms suffer from certain limitations: they tend to predominantly focus on the speech components in noisy speech, neglecting to consider noise components. Additionally, most current algorithms rely on a single network to perform speech enhancement, demanding high network performance. To address these problems, this paper proposes a two-stage speech enhancement network, SM-TSTFN, which decomposes the speech enhancement process into amplitude spectrum prediction and complex spectrum prediction stages. The proposed algorithm uses a two-stage network approach, beginning with a small and simple task and gradually increasing in difficulty, by solving sub-tasks to complete more challenging tasks. In the first stage, noise and speech are used as learning targets, with the amplitude spectrum of noisy speech utilized as an input to preliminarily estimate the amplitude spectra of noise and speech. In the second stage, with the complex spectrum of noisy speech as input and aided by the predictions from the first stage, the complex spectrum of clean speech is estimated. The algorithm also introduces noise into the loss function and jointly trains the first and second-stage networks. In this second stage, a Time-Frequency Processing Module (TFPM), which integrates Long Short-Term Memory (LSTM) and Temporal Convolution Module (TCN), is designed to perform the complex spectrum estimation, enabling feature extraction in both the time and frequency domains. Experimental results on a public dataset show that the proposed network outperforms previous baseline models, demonstrating the effectiveness of the proposed two-stage algorithm. The proposed two-stage speech enhancement algorithm based on temporal convolutional networks is a promising approach to improving speech quality in noisy environments. The time-frequency processing module and joint training of the two-stage network provide a robust and effective solution to the problem of speech enhancement.