复谱映射下融合高效Transformer的语音增强方法

Speech Enhancement Method Based on Complex Spectrum Mapping with Efficient Transformer

  • 摘要: 针对卷积神经网络(Convolutional Neural Network, CNN)过去在语音增强中表现优异但对全局特征捕获不足,以及Transformer近年展现出长序列间依赖优势但又存在局部细节特征丢失、参数量大等问题,该文为了充分利用CNN与Transformer的优势并弥补各自不足,提出了一种在复频谱映射下的新型卷积模块与高效Transformer融合的单通道语音增强网络。该网络由编码层、传输层与双分支解码层组成:在编解码部分设计了一种协作学习模块(Collaborative Learning Block, CLB)来监督交互信息,在减少参数量的同时提高主干网络对复特征的获取能力;传输层中则提出一种时频空间注意Transformer模块分别对语音子频带和全频带信息建模,充分利用声学特性来模拟局部频谱模式并捕获谐波间依赖关系。将该模块进一步与通道注意分支相结合,设计了一种可学习的双分支注意融合(Dual-branch Attention Fusion, DAF)机制,从空间-通道角度提取上下文特征以加强信息的多维度传输;最后,在此基础上搭建一种高斯加权渐进网络作为中间传输层,通过堆叠DAF模块进行加权求和后输出以充分利用深层特征,使得解码过程更具鲁棒性。分别在英文VoiceBank-DEMAND数据集、中文THCHS30语料库与115种环境噪声下进行消融以及综合对比实验,结果表明,该文方法仅以最小0.68×106的参数量,相比于大部分最新相关网络模型取得了更优的主、客观指标,具有较为突出的增强性能与泛化能力。

     

    Abstract: ‍ ‍Convolution neural networks (CNNs) perform well in speech enhancement but fail to capture global features. In addition, in recent years, a transformer has shown the advantage of long sequence dependence, but there have been problems such as local detail feature loss and a large number of parameters. In order to make full use of the advantages of a CNN and transformer and make up for their shortcomings, this study investigated a single channel speech enhancement network based on the fusion of novel convolutional modules and an efficient transformer under complex spectrum mapping. The network was composed of a coding layer, transmission layer, and double-branch decoding layer. In the codec part, a collaborative learning block was designed to supervise the interactive information, which reduced the number of parameters and improved the ability of the backbone network to obtain complex features. In the transmission layer, a time-frequency spatial attention transformer module was proposed to model the voice sub-band and full band information, and made full use of acoustic characteristics to simulate local spectrum patterns and capture the harmonic dependency. By combining this module with a channel attention branch, a learnable dual-branch attention fusion (DAF) mechanism was designed to extract contextual features from the space-channel perspective to enhance multi-dimensional information transmission. Finally, a Gaussian weighted asymptotic network was built as the intermediate transmission layer, and the weighted sum output was performed by stacking DAF modules to make full use of the deep features and make the decoding process more robust. Ablation and comprehensive comparison experiments were conducted on the English VoiceBank-DEMAND dataset, the Chinese THCHS30 corpus, and 115 kinds of ambient noise. The results showed that the proposed method achieved better subjective and objective indicator results with the minimum number of parameters (0.68 × 106) compared with most of the latest correlation network models. It demonstrated outstanding enhancement performance and generalization ability.

     

/

返回文章
返回