Abstract:
Speech enhancement aims to extract target speech from noisy speech. Recently, neural networks (NN) have been effectively implement-ed for speech enhancement. In particular, network training with multi-objective joint optimization techniques, aiming to take advantage of the complementarity between different features, can significantly improve the quality and intelligibility of the target speech. However, in the network optimization of the multi-objective learning speech enhancement method, the loss function is usually calculated for a sin-gle output target separately, and the multiple targets are parallel, but the possible associations between the multiple targets are not fully utilized. This paper presents a speech enhancement framework using long-short term memory networks (LSTMs) with a dual-target output architecture. A multi-objective loss function is proposed for network training such that a balance between the global and local optima can be achieved. The framework estimates the target speech and noise to get the estimated noisy speech, and then optimizes the three parts jointly. Experimental results demonstrate the proposed method can effectively improve the noise suppression ability of the NNs. Through this strategy, enhanced speech with higher quality and less noise residue can be obtained.