一种基于卷积神经网络的端到端语音分离方法

An End-to-End Speech Separation Method Based on Convolutional Neural Network

  • 摘要: 大部分的语音分离系统仅仅增强混合的幅值谱(短时傅里叶变换的系数),但是对于相位谱却不做任何处理。然而,最近的研究表明相位信息对于语音分离的质量起着很重要的作用。为了同时利用幅值和相位信息,本文提出了一种有效的端到端分离方法。这种方法是直接利用原始语音波行点作为特征,是一种基于编解码器的卷积神经网络结构。跟其他的说话人独立的语音分离系统不同,本文提出的方法其神经网络只输出一个说话人的信号,其他的语音可以由混合语音与网络输出信号的差值获得。我们在TIMIT数据集上验证本文提出的方法。实验结果表明,本文提出的方法明显优于句子级别的排列不变性训练(uPIT)基线方法,对于信号失真比(SDR)相对提高了16.06%。

     

    Abstract: Most of speech separation systems usually enhance the magnitude spectrum of the mixture. The phase spectrum is left unchanged, which is inherent in the short-time Fourier transform (STFT) coefficients of the input signal. However, recent studies have suggested that phase was important for perceptual quality. In order to simultaneously make full use of magnitude and phase, this work develops a novel end-to-end method for two-talker speech separation, based on an encoder-decoder fully-convolutional structure. Different from traditional speech separation systems, in this paper, deep neural network outputs one speaker’s signals exclusively. We evaluate the proposed model on the TIMIT dataset. The experimental results show that the proposed method significantly outperforms the permutation invariant training (PIT) baseline method, with a relative improvement of 16.06\% in signal-to-distortion ratio (SDR).

     

/

返回文章
返回