Deepfake Speech Detection Method Based on Wav2Vec2.0 Feature Merging and Joint Loss

CHEN Feifei; GUO Haiyan; GUO Yanmin; GE Zirui; LU Huaqing

doi:10.12466/xhcl.2025.09.008

CHEN Feifei, GUO Haiyan, GUO Yanmin, et al. Deepfake speech detection method based on Wav2Vec2.0 feature merging and joint loss[J]. Journal of Signal Processing, 2025, 41(9): 1547-1557. DOI: 10.12466/xhcl.2025.09.008.

Citation:

Deepfake Speech Detection Method Based on Wav2Vec2.0 Feature Merging and Joint Loss

Abstract

Abstract

The pre-trained speech model Wav2Vec2.0 can extract rich multi-layer embedding features through its multiple hidden layers and has exhibited an excellent performance in deepfake speech detection. Merging features from each layer of Wav2Vec2.0 is an effective way to further exploit deep speech representations and is expected to enhance detection performance. In this context， this paper proposes incorporating the Convolutional Block Attention Module （CBAM） into the Wav2Vec2.0-based deepfake speech detection architecture to merge the embedding features from each layer. By combining channel and spatial attention in a weighted fusion approach， CBAM adaptively enhances key features， effectively improving the model’s feature extraction capability. Given the complexity and diversity of deepfake speech types—and the potential for significant variation in the difficulty of detecting different types—this paper also addresses the need to reduce model bias and improve discriminative power. To ensure compact intra-class and well-separated inter-class feature distributions， we propose a composite loss function that jointly employs cross-entropy loss， center loss， and focal loss. This combination leverages the strengths of each loss function to improve both the discriminative ability and generalization performance of the model across various deepfake speech scenarios. Experimental results on the ASVspoof 2019 LA， ASVspoof 2021 LA， ASVspoof 2021 DF， and CFAD datasets demonstrate that the proposed method performs well across standard evaluation metrics， including Equal Error Rate （EER） and minimum tandem Detection Cost Function （min t-DCF）. Notably， on the ASVspoof 2021 LA dataset， our method significantly outperforms baseline systems such as AASIST， ECAPA-TDNN， ResNet， and various Wav2Vec2.0-based front-end feature extraction schemes.

FullText(HTML)

References (41)

Cited By

Deepfake Speech Detection Method Based on Wav2Vec2.0 Feature Merging and Joint Loss

Abstract

Catalog

Export File

Citation

Format

Content