Two-stage Speech Enhancement Algorithm Based on Temporal Convolutional Networks
-
Graphical Abstract
-
Abstract
During the transmission of speech signals, interference from factors such as noise and echoes commonly results in a degradation in signal quality and intelligibility. Speech enhancement algorithms have been developed to eliminate noise and interference from the signal and enhance its quality. Deep learning-based speech enhancement algorithms have demonstrated superior enhancement effects to conventional methods. However, existing algorithms suffer from certain limitations: they tend to predominantly focus on the speech components in noisy speech, neglecting to consider noise components. Additionally, most current algorithms rely on a single network to perform speech enhancement, demanding high network performance. To address these problems, this paper proposes a two-stage speech enhancement network, SM-TSTFN, which decomposes the speech enhancement process into amplitude spectrum prediction and complex spectrum prediction stages. The proposed algorithm uses a two-stage network approach, beginning with a small and simple task and gradually increasing in difficulty, by solving sub-tasks to complete more challenging tasks. In the first stage, noise and speech are used as learning targets, with the amplitude spectrum of noisy speech utilized as an input to preliminarily estimate the amplitude spectra of noise and speech. In the second stage, with the complex spectrum of noisy speech as input and aided by the predictions from the first stage, the complex spectrum of clean speech is estimated. The algorithm also introduces noise into the loss function and jointly trains the first and second-stage networks. In this second stage, a Time-Frequency Processing Module (TFPM), which integrates Long Short-Term Memory (LSTM) and Temporal Convolution Module (TCN), is designed to perform the complex spectrum estimation, enabling feature extraction in both the time and frequency domains. Experimental results on a public dataset show that the proposed network outperforms previous baseline models, demonstrating the effectiveness of the proposed two-stage algorithm. The proposed two-stage speech enhancement algorithm based on temporal convolutional networks is a promising approach to improving speech quality in noisy environments. The time-frequency processing module and joint training of the two-stage network provide a robust and effective solution to the problem of speech enhancement.
-
-