Unknown Speaker Speech Separation Based on Multiscale Deformable Attention Encoding and Multi-Path Fusion
-
Graphical Abstract
-
Abstract
This study proposed a novel model for unknown speaker speech separation that was designed to operate effectively in complex environments characterized by noise and reverberation. Existing models for unknown speaker separation typically evaluated performance under clean experimental conditions that do not reflect the demands of real-world settings. To enhance the adaptability of the model to the variability and non-stationarity of mixed speech signals encountered in practical applications, we integrated a multi-scale deformable attention mechanism with a Transformer encoder to form the transformer encoder multi-scale deformable attention module. This approach enabled dynamic computation at various positions through the offset layers of the multiscale deformable attention mechanism, thereby expanding the receptive field of the model and allowing it to focus more effectively on crucial temporal points while simultaneously mitigating the adverse impacts of noise and reverberation. Additionally, to improve the acquisition of contextual information, we adopted a multipath fusion strategy that augmented the dual-path module with a Conformer layer, resulting in a three-path module. This design facilitated the extraction of feature information among multiple speakers, thereby enhancing the ability of the model to fuse information from both single and multiple speakers, which is critical for improving speech separation performance. Experimental results demonstrated that the proposed model achieved significant separation efficacy on both the clean and noisy Libri2Mix and Libri3Mix datasets. Notably, on the LRS2-2Mix dataset, the model exhibited improved resilience against noise and reverberation, achieving Scale-Invariant Signal-to-Noise Ratio Improvement (SI-SNRi) and Signal-to-Distortion Ratio Improvement (SDRi) scores of 14.7 dB and 15.1 dB, respectively. Furthermore, the model attained an estimation accuracy of 98.89% across varying speaker counts, thereby displaying an improvement of 0.12%. These findings indicate that the proposed model is well-suited for real-world applications, as it effectively addresses the challenges posed by complex auditory environments.
-
-