基于<b>Wav2Vec2.0</b>特征融合与联合损失的深度伪造语音检测方法

陈飞飞; 郭海燕; 郭延民; 葛子瑞; 陆华庆

doi:10.12466/xhcl.2025.09.008

基于Wav2Vec2.0特征融合与联合损失的深度伪造语音检测方法

Deepfake Speech Detection Method Based on Wav2Vec2.0 Feature Merging and Joint Loss

摘要

摘要: 语音预训练模型Wav2Vec2.0能够通过多个隐藏层提取丰富的多层嵌入特征，在深度伪造语音检测任务中表现出良好的性能。将 Wav2Vec2.0 各层特征进行融合，是进一步挖掘语音数据深层次表示的有效途径，而改进Wav2Vec2.0各层特征的融合方式则有望进一步提升深度伪造语音检测性能。鉴于此，本文基于Wav2Vec2.0深度伪造语音检测架构，提出引入卷积注意力模块（Convolutional Block Attention Module， CBAM）对Wav2Vec2.0各层嵌入特征进行融合，通过结合通道注意力和空间注意力的加权融合方式来自适应地增强关键特征，有效提升模型的特征提取能力。在此基础上，考虑到伪造语音类型复杂多样，不同类型的伪造语音在鉴别难度上可能存在显著差异，为避免模型在处理难鉴别样本时存在的偏倚，同时使得类内特征分布紧凑、类间特征分布疏远。本文提出联合交叉熵损失、中心损失和焦点损失，构造模型的整体损失函数，充分利用各类损失的优势来增强模型在多种伪造语音场景下的判别能力和泛化性能。在ASVspoof 2019 LA、ASVspoof 2021 LA、ASVspoof 2021 DF和CFAD数据集上的实验结果表明，所提出的方法在常用评价指标等错误率（equal error rate， EER）和最小串联检测代价函数（minimum tandem detection cost function， min t-DCF）均表现出色。尤其是在 ASVspoof 2021 LA 数据集上，相较于AASIST、ECAPA-TDNN 、ResNet，以及采用Wav2Vec2.0进行前端特征提取的多种对比方案，本文方法显著优于所有对比方法。

Abstract: The pre-trained speech model Wav2Vec2.0 can extract rich multi-layer embedding features through its multiple hidden layers and has exhibited an excellent performance in deepfake speech detection. Merging features from each layer of Wav2Vec2.0 is an effective way to further exploit deep speech representations and is expected to enhance detection performance. In this context， this paper proposes incorporating the Convolutional Block Attention Module （CBAM） into the Wav2Vec2.0-based deepfake speech detection architecture to merge the embedding features from each layer. By combining channel and spatial attention in a weighted fusion approach， CBAM adaptively enhances key features， effectively improving the model’s feature extraction capability. Given the complexity and diversity of deepfake speech types—and the potential for significant variation in the difficulty of detecting different types—this paper also addresses the need to reduce model bias and improve discriminative power. To ensure compact intra-class and well-separated inter-class feature distributions， we propose a composite loss function that jointly employs cross-entropy loss， center loss， and focal loss. This combination leverages the strengths of each loss function to improve both the discriminative ability and generalization performance of the model across various deepfake speech scenarios. Experimental results on the ASVspoof 2019 LA， ASVspoof 2021 LA， ASVspoof 2021 DF， and CFAD datasets demonstrate that the proposed method performs well across standard evaluation metrics， including Equal Error Rate （EER） and minimum tandem Detection Cost Function （min t-DCF）. Notably， on the ASVspoof 2021 LA dataset， our method significantly outperforms baseline systems such as AASIST， ECAPA-TDNN， ResNet， and various Wav2Vec2.0-based front-end feature extraction schemes.

HTML全文

参考文献(41)

施引文献

资源附件(0)