LI Jiaxin, ZHANG Lianhai, LI Yiting. Speaker Feature Extraction Based on Timbre Consistency for Voice Cloning[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(4): 719-729. DOI: 10.16798/j.issn.1003-0530.2023.04.013
Citation: LI Jiaxin, ZHANG Lianhai, LI Yiting. Speaker Feature Extraction Based on Timbre Consistency for Voice Cloning[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(4): 719-729. DOI: 10.16798/j.issn.1003-0530.2023.04.013

Speaker Feature Extraction Based on Timbre Consistency for Voice Cloning

  • ‍ ‍Current speech cloning methods based on pre-trained speaker encoders can synthesize speech with high timbre similarity for speakers seen during training, but for speakers not seen during training, the timbre of cloned speech is still significantly different from that of real speaker. Aiming at this problem, this paper proposes a speaker feature extraction method based on timbre consistency. This method uses the current advanced speaker recognition model TitaNet as the basic architecture of the speaker encoder. According to the prior knowledge that the speaker timbre remains unchanged in the speech segment, a timbre consistency constraint loss is introduced for speaker encoder training to extract more accurate speaker timbre features and increase the robustness and generalization of speaker representation. Finally, the extracted features are applied to the end-to-end speech synthesis model VITS for speech cloning. Experimental results show that the proposed method achieves better performance than the baseline system on two public speech datasets and improves the timbre similarity of the cloned speech of unseen speakers.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return