基于多尺度可变形注意力编码与多路径融合的未知说话人语音分离

王春丽; 刘素倩; 陈善立

doi:10.12466/xhcl.2025.04.012

基于多尺度可变形注意力编码与多路径融合的未知说话人语音分离

Unknown Speaker Speech Separation Based on Multiscale Deformable Attention Encoding and Multi-Path Fusion

摘要

摘要: 针对在含有噪声和混响的复杂环境中对未知说话人语音分离任务的研究，提出了一种基于多尺度可变形注意力编码与多路径融合的未知说话人语音分离模型。现有的针对未知说话人的语音分离模型是在纯净的实验环境条件下分析的模型性能，不符合现实中复杂的背景环境需求。为使模型可以在现实应用复杂条件下灵活应对混合语音信号中的多变性与非平稳性，采用多尺度可变形注意力机制与Transformer编码器构成（Transformer Encoder Multi-Scale deformable attention，TEMDA）模块，利用多尺度可变形注意力机制的偏移层在不同位置上进行动态计算，扩展模型的感受野，同时使模型更有效地聚焦于重要的时间点，减少噪声和混响的影响。为了更好地获取上下文信息，在多路径融合策略中，通过在双路径模块的基础上增加通道间的Conformer组成三路径模块，用于提取多说话人之间的特征信息，这样的处理方式可以更好地融合单一说话人和多说话人之间的信息，提升语音分离性能。实验表明，所提出的模型分别在纯净和带噪声的Libri2Mix、Libri3Mix数据集上达到了显著的分离效果，并且在LRS2-2Mix数据集中模型可以更好地减少噪声和混响对语音分离的影响，尺度不变信噪比改善（Scale-Invariant Signal-to-Noise Ratio Improvement， SI-SNRi）和信号失真比改善（Signal-to-Distortion Ratio Improvement，SDRi）分别为14.7 dB和15.1 dB；在三个说话人数目中的估计精度为98.89%，提升了0.12%。

Abstract: ‍ ‍This study proposed a novel model for unknown speaker speech separation that was designed to operate effectively in complex environments characterized by noise and reverberation. Existing models for unknown speaker separation typically evaluated performance under clean experimental conditions that do not reflect the demands of real-world settings. To enhance the adaptability of the model to the variability and non-stationarity of mixed speech signals encountered in practical applications， we integrated a multi-scale deformable attention mechanism with a Transformer encoder to form the transformer encoder multi-scale deformable attention module. This approach enabled dynamic computation at various positions through the offset layers of the multiscale deformable attention mechanism， thereby expanding the receptive field of the model and allowing it to focus more effectively on crucial temporal points while simultaneously mitigating the adverse impacts of noise and reverberation. Additionally， to improve the acquisition of contextual information， we adopted a multipath fusion strategy that augmented the dual-path module with a Conformer layer， resulting in a three-path module. This design facilitated the extraction of feature information among multiple speakers， thereby enhancing the ability of the model to fuse information from both single and multiple speakers， which is critical for improving speech separation performance. Experimental results demonstrated that the proposed model achieved significant separation efficacy on both the clean and noisy Libri2Mix and Libri3Mix datasets. Notably， on the LRS2-2Mix dataset， the model exhibited improved resilience against noise and reverberation， achieving Scale-Invariant Signal-to-Noise Ratio Improvement （SI-SNRi） and Signal-to-Distortion Ratio Improvement （SDRi） scores of 14.7 dB and 15.1 dB， respectively. Furthermore， the model attained an estimation accuracy of 98.89% across varying speaker counts， thereby displaying an improvement of 0.12%. These findings indicate that the proposed model is well-suited for real-world applications， as it effectively addresses the challenges posed by complex auditory environments.

HTML全文

参考文献(36)

施引文献

资源附件(0)