基于图注意力机制和对抗训练的语音反欺骗方法

Speech Anti-Spoofing Method Based on Graph Attention Mechanism and Adversarial Training

  • 摘要: 语音反欺骗任务旨在通过设计网络结构和学习算法来区分真实语音和欺骗语音,以提升语音系统安全性。本文提出了一种结合图注意力机制和对抗训练的语音反欺骗方法,以应对语音反欺骗任务中的挑战。具体地,基于说话人吸引子多中心单类(speaker attractor multi-center one-class, SAMO)学习算法,利用图信号处理(graph signal processing, GSP)理论,本文提出了采用图注意力网络(graph attention network, GAT)提取说话人吸引子中心的方法。通过引入注意力机制来聚合说话人特征表示,以计算出更具代表性的说话人吸引子中心,从而提高系统对真实语音和欺骗语音的区分能力。另外,考虑到当网络只学习到训练集中已知欺骗类型的特定欺骗伪影时,则分类网络可能无法有效应对未知类型的欺骗攻击。本文在反欺骗网络结构中引入欺骗类型分类对抗网络,通过特征表示学习模块和欺骗类型分类辅助网络的对抗训练,促使网络能够从不同类型的欺骗语音中学习到共同的欺骗伪影特征,从而提升系统对实际测试中未知类型欺骗语音的检测能力。在ASVspoof 2019 LA、CFAD和ASVspoof 2021 LA数据集上进行了实验,实验结果表明所提方法在性能上优于基线系统和其他对比系统。此外,本文还采用了t分布随机邻居嵌入(t-distributed stochastic neighbor embedding, t-SNE)和相似度矩阵热力图的可视化方法,直观展示了所提方法在准确区分真实语音和欺骗语音方面的优势,并验证了对抗训练技术在学习共同欺骗伪影特征方面的有效性。

     

    Abstract: ‍ ‍Speech anti-spoofing seeks to bolster the security of speech systems by crafting network architectures and employing learning algorithms to effectively distinguish between genuine and fake speech. This paper presents a speech anti-spoofing method that integrates graph attention mechanisms and adversarial training to tackle the challenges of speech anti-spoofing. Specifically, the proposed method is based on the speaker attractor multi-center one-class (SAMO) learning algorithm using graph signal processing (GSP) theory. First, a speaker feature representation graph is constructed using GSP theory, in which each node corresponds to a feature representation of the speaker. A graph attention network (GAT) is then introduced to extract speaker attractor centers. By introducing an attention mechanism to consolidate speaker feature representations, a more representative speaker attractor center is obtained through aggregation calculation, thereby improving the capacity of the system to discriminate between genuine and fake speech. Furthermore, this paper acknowledges the potential limitations of learning specific features of fake artifacts solely based on recognized fake types. Specifically, it may restrict the effectiveness of the network in practical scenarios that involve handling unknown types of spoofing attacks. As a solution, a novel approach is proposed to enhance the anti-spoofing network by integrating an adversarial fake type classification network. This unique framework enables the network to simultaneously learn feature representations for both speech authenticity classification and fake type classification tasks. By utilizing the gradient reversal layer (GRL) in adversarial training between the fake type classification assistance network and the feature representation learning module, the network is prevented from accurately distinguishing between different types of fake speech. This prompts the network to learn common fake artifact features that are shared across different types of fake speech. Consequently, the speech authenticity classification task becomes more adaptable to unknown inputs, enabling the network to recognize the artifact features of unknown fake speech types and enhancing the efficiency of the system in detecting unknown fake speech in real tests. To evaluate the effectiveness of the proposed method of combining GAT and adversarial training, experiments were conducted on popular datasets, namely ASVspoof 2019 LA, CFAD, and ASVspoof 2021 LA. The experimental results, evaluated using common anti-spoofing metrics, demonstrate that the proposed method outperforms both the SAMO baseline system and other advanced comparative systems. Additionally, visualization techniques, including t-distributed stochastic neighbor embedding (t-SNE) and similarity matrix heat map, are employed to provide a visual representation of the system performance. The t-SNE visualization provides visual representations that show the distinct clustering of genuine and fake speech samples, highlighting the discernment of the proposed method. It visualizes the advantages of the proposed method in accurately distinguishing genuine speeches from fake ones. The similarity matrix heat map, on the other hand, visually represents the degree of similarity between different types of fake speech feature representations using different color shades. When the results obtained from different systems are compared, it is evident that the proposed system excels in learning common features of fake artifacts. Thus, it is demonstrated that the proposed system effectively leverages adversarial training to enhance its ability to learn common fake artifact features.

     

/

返回文章
返回