基于时频注意力机制与U-Net的骨导语音鲁棒增强方法
Bone-Conducted Robust Speech Enhancement Based on Time-Frequency Domain Attention Mechanism and U-Net
-
摘要: 近年来,基于神经网络的方法大量应用于骨导语音增强中。然而,由于骨导数据集样本较少,骨导语音高频部分缺失,不同说话人高频部分失真程度不同,神经网络难以有效学习骨导语音的频谱特征。因此,现有骨导语音增强模型对于未知说话人骨导语音数据集增强效果不佳、鲁棒性不强。为充分利用骨导语音的时频信息,引导模型关注骨导语音的低频部分特征,提出一种基于时频注意力机制和U-Net的骨导语音增强方法。该方法将时频注意力机制引入U-Net结构中,首先根据骨导语音时间、频率方向特征信息的重要程度自动为其分配权重,而后以加权后的骨导语音谱作为输入,对应的气导语音谱作为目标进入U-Net结构训练,最后利用训练完成的增强模型重构骨导语音全频带的语音。仿真实验与可视化分析结果表明,对比基线U-Net结构与其他注意力机制,该方法对于未知说话人骨导语音数据集能够取得更高的PESQ和STOI客观评价指标,增强语音更加清晰。Abstract: In recent years, methods based on neural networks are applied to Bone-Conducted (BC) speech enhancement. However, due to the small number of BC speech datasets, the lack of BC speech in high-frequency part, and the different distortion degree of different speakers in high-frequency part, it is difficult for neural networks to effectively learn the spectrum characteristics. As a result, the existing BC speech enhancement methods are not effective and robust enough to unseen speakers. In order to make full use of the time-frequency information of BC speech and guide the model to pay attention to the characteristics of low-frequency spectrum, this paper proposes a robust enhancement method based on the time-frequency domain attention mechanism and U-Net. This method introduces the time-frequency attention mechanism into the U-Net structure. Weight is first automatically distributed according to the important information of the characteristic information in time and frequency direction. Then use the weighted BC spectrum as the input, and the corresponding Air-Conducted (AC) speech spectrum as the goal to enter the U-Net structure training, and finally uses the speech enhancement model to reconstruct full-band speech. The simulation and visual analysis results show that the method proposed in this paper can achieve higher objective evaluation scores of PESQ and STOI and better speech intelligibility than the baseline U-Net structure and other attention mechanisms on the unseen speaker datasets.