神经网络辅助估计先验语音存在概率的多通道降噪方法
NN-Supported a Priori Speech Presence Probability Estimation for Multichannel Noise Reduction
-
摘要: 噪声功率谱密度矩阵的估计在波束形成中非常关键。基于多通道语音存在概率(Multichannel Speech Presence Probability, MCSPP)估计噪声功率谱密度矩阵的方法,利用语音存在概率逐帧更新噪声功率谱密度矩阵。因此,语音存在概率的精度直接影响到噪声功率谱密度矩阵的估计精度。传统方法估计语音存在概率时依赖于噪声平稳假设。在变化较快的非平稳噪声上,估计的语音存在概率存在拖尾现象,这会导致降噪效果变差。本文从理论上解释了传统方法估计语音存在概率的拖尾现象成因。传统方法中语音存在概率由长期信噪比(Signal to Noise Ratio, SNR)线性映射得到,而本文证明当语音存在时当前时刻的长期信噪比仅为上一时刻长期信噪比的小幅衰减。当噪声快速变化时,长期信噪比变化缓慢,这导致语音存在概率出现拖尾现象。为解决该问题,本文提出了一种神经网络辅助估计先验语音存在概率的多通道降噪方法。所提方法利用时域卷积网络(Temporal Convolutional Network, TCN)来估计单通道观测信号的先验语音存在概率,而后利用多通道观测信号的空间信息来改善先验语音存在概率的估计。时域卷积网络估计先验语音存在概率不依赖于噪声的平稳假设,提升了噪声功率谱密度矩阵估计的精度。本文在CHiME-3数据集上进行测试,当SNR为5 dB时,所提方法取得的PESQ相比传统方法提升了0.09,fwSegSNR提升了0.78,COVL提升了0.08。结果表明,所提方法在非平稳噪声情况下能取得更好的降噪效果。Abstract: The estimation of the noise power spectral density matrix is crucial in beamforming-based multichannel noise reduction methods. The multichannel speech presence probability (MCSPP) can be used to continuously control the adaptation of the noise power spectral density matrix. Accordingly, the estimation accuracy of the noise power spectral density matrix is directly related to the accuracy of the speech presence probability estimation. Traditional techniques for estimating speech presence probability are based on the assumption of stationary noise. However, they frequently encounter a parameter trailing issue when dealing with non-stationary noise, leading to diminished noise suppression in practical applications. In this study, we first theoretically explain the rationale for the trailing problem in traditional methods for speech presence probability estimation. Speech presence probability is linearly related to the long-term signal-to-noise ratio (SNR) in traditional methods. Furthermore, we found that the long-term SNR of the current frame is only a small attenuation of the long-term SNR of the last frame when speech exists. When noise changes rapidly, the long-term SNR changes slowly, resulting in estimation trailing problem in the estimated speech presence probability. To address this problem, we proposed using the temporal convolutional network (TCN) to estimate the a priori speech presence probability. Furthermore, by integrating the estimated a priori speech presence probability into the MCSPP framework, we achieve a more accurate estimation of the posterior speech presence probability. TCN can directly estimate speech presence probability without relying on the noise stationary assumption, and the trailing problem can be effectively avoided. Therefore, a priori speech presence probability estimated by TCN can improve the accuracy of the noise power spectral density matrix estimation with non-stationary noise. The performance of the different methods was assessed using the CHiME-3 dataset. Simulation results demonstrate that the proposed method outperforms other methods in terms of noise reduction and speech quality in non-stationary noise environments. Specifically, the proposed method achieved a PESQ improvement of 0.09, a fwSegSNR improvement of 0.78, and a COVL improvement of 0.08 over the traditional method on the test dataset with an SNR of 5 dB.