基于注意力机制的端到端合成语音检测
End-to-end Synthetic Speech Detection Based on Attention Mechanism
-
摘要: 近年来深度伪造(Deepfake)技术的迅猛发展使合成语音的自然度和拟人度有了显著提升,对合成语音检测研究提出了更大挑战。本文将五种轻量级注意力模块中的机制改进为适用于语音序列的通道注意力机制和一维空间注意力机制,然后将模块分别嵌入到Inc-TSSDNet网络中,提出基于注意力机制的端到端合成语音检测系统。结果表明,改进系统能够重点关注某些对于检测真伪更关键的通道或区域来提高检测性能,相比于基线模型,引入注意力机制的十种模型在增加的参数量较少的情况下,ASVspoof2019测试集的等错误率(Equal Error Rate,EER)和最小串联检测代价函数(Minimum Tandem Detection Cost Function, min t-DCF)都有所降低,其中在池化层之前嵌入CBAM(Convolutional Block Attention Module)的模型测试集EER最低且具有较强的泛化性,在池化层之前嵌入ECA(Efficient Channel Attention)模块的模型测试集min t-DCF最低且统计性能较基线模型有显著提升。Abstract: In recent years the rapid development of deepfake technology has significantly improved the naturalness and personality of synthetic speech, which poses a greater challenge to the research of synthetic speech detection. In this paper, the mechanisms of five light-weight attention modules are incorporated and modified into channel attention mechanism and one-dimensional spatial attention mechanism suitable for speech sequence, and then the modules are embedded into Inc-TSSDNet respectively, establishing an end-to-end synthetic speech detection system based on attention mechanism. The results show that the improved system can focus on some channels or regions that are more critical to the detection of synthetic artifacts to improve the detection performance. Compared with the baseline model, the ten models with attention mechanism can effectively reduce the equal error rate (EER) and minimum tandem detection cost function (min t-DCF) on the evaluation set of ASVspoof2019 challenge, with a slight increase of the number of model parameters. Among them, the model embedded with CBAM (Convolutional Block Attention Module) before the pooling layer has the lowest EER and promising generalization capability, while the model embedded with ECA (Efficient Channel Attention) module before the pooling layer has the lowest min t-DCF and the statistical performance of the model is significantly improved compared with the baseline model.