Abstract:
Considering the time frequency correlation characteristics of speech is not well utilized in the conventional deep neural network, a single channel speech enhancement method based on deep encoder-decoder neural network is proposed. At the coding end,the time-frequency representation of noisy speech is extracted step by step through convolution and pooling operations of convolution layer to obtain high level feature representation of the target speech. At the same time,the background noise is suppressed. The decoder and the encoder are symmetrical in structure, and the target speech features are reconstructed from the advanced feature representation obtained in the encoder step through de-convolution and up-sampling operations at decoding end. Skip connections are employed to solve the gradient dispersion problem in very deep neural networks. In this paper,low level feature maps which include the detail information of speech are delivered by skip connections from the coding end to the corresponding decoding end feature map in the decoding end. This will help the decoder recover the detailed features of the target speech better. The network is trained in two ways with L
1 loss and L
2 loss, the performance of two forms of connections, feature fusion and feature concatenation are evaluated in the experiments. The results demonstrate the effectiveness of proposed method.