融合动态场景感知和注意力机制的声学回声消除算法

Acoustic Echo Cancellation Algorithm Incorporating Dynamic Scene Perception and Attention Mechanisms

  • 摘要: 在实时语音频通话系统中,如何去除声学回声得到清晰语音是目前最受关注的难题之一。声学回声消除(Acoustic echo cancellation,AEC)技术旨在消除语音频通话系统中的声学回声,提高通话过程中的语音质量,给予用户良好的通话体验,但是传统回声消除系统存在去回声效果不明显、存在非线性回声残留以及无法实时处理回声等问题。因此,为解决上述存在问题,提出了一种动态场景感知模块(Dynamic scene perception module,DSPM)和全局注意力机制(Global attention mechanism,GAM)相结合的声学回声消除算法。该算法以卷积循环网络(Convolutional recurrent network,CRN)作为基线模型,提取语音信号的序列特征;首先,在其编码器中引入DSPM模块替换原因果卷积,根据场景动态分配卷积内核数量,加强模型的自适应性;其次,在编码器最后两层中分别引入GAM模块,放大空间通道间关系以及统筹全局交互,提升对语音信号特征的提取能力以及消除回声的性能;最后,通过将MSE损失函数和HuberLoss损失函数线性相加生成一种新的损失函数——MSE-HuberLoss,进一步提高模型的鲁棒性。实验结果表明,提出的GAM-DSPM-CRN模型的回声消除性能优秀,且获得较基线模型更加清晰的重构语音信号;在双端通话环境下,提出的GAM-DSPM-CRN模型声学回声消除算法较其他对比算法性能有较大提升;在Microsoft AEC Challenges数据集上,MOS、ERLE和STOI的得分分别达到了4.09、57.43和0.78。

     

    Abstract: ‍ ‍The removal of acoustic echoes to obtain clear speech is one of the most important challenges for real-time audio and video communication systems. Acoustic echo cancellation technology is designed to eliminate acoustic echoes from audio and video communication systems to improve the voice quality during calls and give users a good call experience. However, conventional echo cancellation systems suffer from ineffective de-echoing, non-linear echo residuals, and the inability to process echoes in real time. Therefore, an acoustic echo cancellation algorithm that combines a dynamic scene perception module (DSPM) and global attention mechanism (GAM) is proposed to solve the above-mentioned problems. A convolutional recurrent network (CRN) was used as the baseline model to extract the sequential features of the speech signals. First, the DSPM module was used to replace the causal convolution in its encoder, which dynamically allocated the number of convolutional kernels according to the scene and enhanced the adaptive nature of the model. Second, the GAM module was introduced in each of the last two layers of the encoder to amplify the spatial inter-channel relationships and coordinate global interactions to improve the extraction of speech signal features and the echo-cancellation performance. Finally, the robustness of the model was further improved by linearly adding the MSE and HuberLoss loss functions to generate a new loss function (MSE-HuberLoss). Experimental results showed that the proposed GAM-DSPM-CRN model had an excellent echo-cancellation performance and obtained a clearer reconstructed speech signal than the baseline model. The proposed GAM-DSPM-CRN model acoustic echo cancellation algorithm provided a greater performance improvement than other comparative algorithms in a two-ended call environment. On the Microsoft AEC Challenges dataset, the MOS, ERLE, and STOI scores reached 4.09, 57.43, and 0.78, respectively.

     

/

返回文章
返回