一种面向自组织麦克风网络的多通道多人语音分离方法
A Multichannel Multitalker Speech Separation Method for Ad-hoc Microphones
-
摘要: 针对自组织麦克风网络,如何充分有效地利用多通道语音数据获得更好的语音分离性能是一个难题。本文介绍了一种新的多通道语音分离方法,通过引入压缩激励脊髓模块,在麦克风位置未知时,也能显式地学习潜在的通道关系,自适应地更新各个通道对应特征的权重,以增加少量的额外计算代价达到增强语音分离的效果。压缩激励脊髓模块通过将多通道的特征信息压缩到通道维度,获得全局通道依赖关系的表征,利用激活函数根据通道关系表征对瓶颈单元筛选出有价值的特征信息。瓶颈单元由脊髓网络组成,通过逐步输入的方式生成全局信息和重新配置权重,更有效地处理数据。本文在基于LibriSpeech仿真的多通道版本数据中进行实验,在评估指标SDR和SI-SDR上相比于单通道基线获得了明显的提升,并取得超越最先进的自组织麦克风多通道方法的效果。
Abstract: For ad-hoc microphones, it is a challenge that how to make the best of multichannel audio data to achieve better performance in multi-talker speech separation tasks. This paper introduces a new multichannel speech separation method, i.e., Squeeze-Excitation-Spinal (SES) module, which can explicitly learn latent channel-wise relationship and adaptively update the weights of each channel characteristics without knowing the positions of microphones in advance so that the enhanced effects of speech separation come at the least expense. SES module obtains the representation of global inter-channel dependency by squeezing multichannel feature information into the channel dimension and uses activation functions to screen out valuable feature information based on the representation in a bottleneck unit. The bottleneck unit consists of spinal modules that generate global information and redistributes weights through step-by-step input. We achieved significant improvements on the simulated multichannel LibriSpeech corpus in the evaluation metrics SDR and SI-SDR compared to the single-channel baseline, achieving results comparable to the state-of-the-art (SOTA) ad-hoc microphone multichannel approach.