基于音色一致的语音克隆说话人特征提取方法
Speaker Feature Extraction Based on Timbre Consistency for Voice Cloning
-
摘要: 当前基于预训练说话人编码器的语音克隆方法可以为训练过程中见到的说话人合成较高音色相似性的语音,但对于训练中未看到的说话人,语音克隆的语音在音色上仍然与真实说话人音色存在明显差别。针对此问题,本文提出了一种基于音色一致的说话人特征提取方法,该方法使用当前先进的说话人识别模型TitaNet作为说话人编码器的基本架构,并依据说话人音色在语音片段中保持不变的先验知识,引入一种音色一致性约束损失用于说话人编码器训练,以此提取更精确的说话人音色特征,增加说话人表征的鲁棒性和泛化性,最后将提取的特征应用端到端的语音合成模型VITS进行语音克隆。实验结果表明,本文提出的方法在2个公开的语音数据集上取得了相比基线系统更好的性能表现,提高了对未见说话人克隆语音的音色相似度。Abstract: Current speech cloning methods based on pre-trained speaker encoders can synthesize speech with high timbre similarity for speakers seen during training, but for speakers not seen during training, the timbre of cloned speech is still significantly different from that of real speaker. Aiming at this problem, this paper proposes a speaker feature extraction method based on timbre consistency. This method uses the current advanced speaker recognition model TitaNet as the basic architecture of the speaker encoder. According to the prior knowledge that the speaker timbre remains unchanged in the speech segment, a timbre consistency constraint loss is introduced for speaker encoder training to extract more accurate speaker timbre features and increase the robustness and generalization of speaker representation. Finally, the extracted features are applied to the end-to-end speech synthesis model VITS for speech cloning. Experimental results show that the proposed method achieves better performance than the baseline system on two public speech datasets and improves the timbre similarity of the cloned speech of unseen speakers.