基于图注意力网络和门控网络的轻量级单通道语音分离方法

余传旗; 郭海燕; 王婷婷; 王景润; 杨震

doi:10.12466/xhcl.2025.04.011

基于图注意力网络和门控网络的轻量级单通道语音分离方法

A lightweight Single Channel Speech Separation Method Based on Graph Attention Networks and Gated Network

摘要

摘要: 语音分离旨在从包含多个说话人的混合语音中分离出各个源语音，是多说话人场景下语音处理类任务的重要前端。目前，基于深度学习的语音分离取得了显著进展，但随着模型性能的不断提升的同时，模型的参数量和推理时间也显著增加。针对此问题，本文综合考虑模型效率与分离性能，提出一种基于图注意力网络（Graph Attention Network，GAT）和门控网络（Gated Network， GN）的轻量级语音分离模型（称为GGN-Papez）。该方法基于轻量级、高效的基线模型Papez，引入GAT处理听觉记忆块内存储的全局信息，并使用GN生成掩码，来提升基线模型Papez的性能。具体地，假定所有记忆令牌之间均存在连接，利用GAT计算令牌间的注意力得分，并采用阈值过滤策略裁剪掉得分较低的边，生成新的邻接矩阵。再利用此邻接矩阵聚合记忆令牌所存储的全局信息，以提取出更有效的上下文信息，提高模型对全局特征的理解能力。在此基础上，考虑到Papez使用的掩码生成模块为表达能力有限的双层全连接前馈神经网络，本文提出使用具备更强特征选择能力的GN替代原有模块，以生成更符合源语音特征的掩码。所提模型GGN-Papez在基准数据集WSJ0-2Mix和Libri2Mix上进行了实验，实验结果表明所提方法在增加很少参数量的情况下，显著提升了分离语音的尺度不变信噪比（Scale Invariant Signal-to-Noise Ratio， SI-SNR）。此外，本文还设计了消融实验验证GAT和GN对整体模型性能的影响，并从推理时间和语音质量感知评估（Perceptual Evaluation of Speech Quality，PESQ）得分等方面综合地对所提模型的性能进行了分析。

Abstract: ‍ ‍Speech separation aims to extract individual source speech signals from a mixture containing multiple speakers， which is an essential front-end in speech processing tasks in multi-speaker scenarios. Recent years have seen remarkable advancements in deep learning-based speech separation models， especially Transformer-based models； however， as the performance of these models continues to improve， their parameter count and inference time have also become increasingly unacceptable， which means that deploying such models on resource-limited mobile devices is extremely challenging. Therefore， minimizing model complexity while realizing high performance has become a major challenge in the field of speech separation. Aiming at this problem， considering both model efficiency and separation performance， this study proposes a lightweight speech separation model based on a graph attention network （GAT） and a gated network （GN）， referred to as GGN-Papez. Specifically， this method is based on the Papez model， which is a lightweight and computationally efficient model. Papez introduced the auditory working memory （AWM） Transformer architecture， which enhances intra-block Transformers with smaller short-term memory tokens and replaces the traditional inter-block Transformers in dual-path architectures. Papez also employed parameter sharing to reduce the number of parameters significantly. GGN-Papez builds upon the Papez model by introducing a graph attention network （GAT） to process the global information stored in the auditory memory blocks. Initially， all memory tokens are assumed to be interconnected， and GAT is used to calculate the attention scores between every pair of tokens. Subsequently， a threshold filtering strategy is employed to prune the edges with lower scores， generating a new adjacency matrix. This adjacency matrix is then used to aggregate the global information stored in the memory tokens， effectively extracting more useful contextual information， thereby enhancing the model’s ability to understand global features. In addition， considering that the mask generation module used in the Papez model is a two-layer fully connected feedforward neural network with limited expressive power， this study proposes employing a GN to replace the original module to generate masks that match the characteristics of the source speech more accurately. The GN， with a similar number of parameters， exhibits stronger feature selection capabilities， further improving the separation performance. The proposed model was evaluated on benchmark datasets WSJ0-2Mix and Libri2Mix， and the experimental results demonstrate that the proposed method significantly improves system performance with only a slight increase in parameter count， achieving significant improvements in metrics such as perceptual evaluation of speech quality （PESQ） and scale invariant signal-to-noise ratio （SI-SNR）. Furthermore， ablation experiments were conducted to verify the impact of GAT and GN on the overall system performance， with the results indicating that both modules play an important role in enhancing the speech separation performance of the model. Finally， the feasibility of deploying GGN-Papez in practical applications was evaluated in terms of inference time and system complexity. The results demonstrate that the proposed model effectively controls inference time while ensuring high-quality separation， making it suitable for deployment in resource-constrained embedded devices or real-time applications. Overall， the proposed GGN-Papez model achieves a good balance between efficiency and performance， providing an effective solution for lightweight speech separation tasks.

HTML全文

参考文献(45)

施引文献

资源附件(0)