Abstract:
Most of speech separation systems usually enhance the magnitude spectrum of the mixture. The phase spectrum is left unchanged, which is inherent in the short-time Fourier transform (STFT) coefficients of the input signal. However, recent studies have suggested that phase was important for perceptual quality. In order to simultaneously make full use of magnitude and phase, this work develops a novel end-to-end method for two-talker speech separation, based on an encoder-decoder fully-convolutional structure. Different from traditional speech separation systems, in this paper, deep neural network outputs one speaker’s signals exclusively. We evaluate the proposed model on the TIMIT dataset. The experimental results show that the proposed method significantly outperforms the permutation invariant training (PIT) baseline method, with a relative improvement of 16.06\% in signal-to-distortion ratio (SDR).