A lightweight Single Channel Speech Separation Method Based on Graph Attention Networks and Gated Network
-
Graphical Abstract
-
Abstract
Speech separation aims to extract individual source speech signals from a mixture containing multiple speakers, which is an essential front-end in speech processing tasks in multi-speaker scenarios. Recent years have seen remarkable advancements in deep learning-based speech separation models, especially Transformer-based models; however, as the performance of these models continues to improve, their parameter count and inference time have also become increasingly unacceptable, which means that deploying such models on resource-limited mobile devices is extremely challenging. Therefore, minimizing model complexity while realizing high performance has become a major challenge in the field of speech separation. Aiming at this problem, considering both model efficiency and separation performance, this study proposes a lightweight speech separation model based on a graph attention network (GAT) and a gated network (GN), referred to as GGN-Papez. Specifically, this method is based on the Papez model, which is a lightweight and computationally efficient model. Papez introduced the auditory working memory (AWM) Transformer architecture, which enhances intra-block Transformers with smaller short-term memory tokens and replaces the traditional inter-block Transformers in dual-path architectures. Papez also employed parameter sharing to reduce the number of parameters significantly. GGN-Papez builds upon the Papez model by introducing a graph attention network (GAT) to process the global information stored in the auditory memory blocks. Initially, all memory tokens are assumed to be interconnected, and GAT is used to calculate the attention scores between every pair of tokens. Subsequently, a threshold filtering strategy is employed to prune the edges with lower scores, generating a new adjacency matrix. This adjacency matrix is then used to aggregate the global information stored in the memory tokens, effectively extracting more useful contextual information, thereby enhancing the model’s ability to understand global features. In addition, considering that the mask generation module used in the Papez model is a two-layer fully connected feedforward neural network with limited expressive power, this study proposes employing a GN to replace the original module to generate masks that match the characteristics of the source speech more accurately. The GN, with a similar number of parameters, exhibits stronger feature selection capabilities, further improving the separation performance. The proposed model was evaluated on benchmark datasets WSJ0-2Mix and Libri2Mix, and the experimental results demonstrate that the proposed method significantly improves system performance with only a slight increase in parameter count, achieving significant improvements in metrics such as perceptual evaluation of speech quality (PESQ) and scale invariant signal-to-noise ratio (SI-SNR). Furthermore, ablation experiments were conducted to verify the impact of GAT and GN on the overall system performance, with the results indicating that both modules play an important role in enhancing the speech separation performance of the model. Finally, the feasibility of deploying GGN-Papez in practical applications was evaluated in terms of inference time and system complexity. The results demonstrate that the proposed model effectively controls inference time while ensuring high-quality separation, making it suitable for deployment in resource-constrained embedded devices or real-time applications. Overall, the proposed GGN-Papez model achieves a good balance between efficiency and performance, providing an effective solution for lightweight speech separation tasks.
-
-