基于SE注意力机制与互信息量的解纠缠跨语种语音转换

SE Attention Mechanism and Mutual Information-Based Representation Disentanglement for Cross-Lingual Voice Conversion

  • 摘要: 在跨语种语音转换(Cross-Lingual Voice Conversion, CLVC)任务中,如何保留转换语音中的内容信息,同时有效地提高转换语音的相似度和自然度是目前的研究难题。传统的编码器-解码器模型应用于跨语种语音转换时,通常会对语音进行相互独立的内容编码和说话人编码,导致得到的内容表征和说话人表征之间存在一定的信息泄露,从而使得转换语音的说话人个性相似度不够理想。为了解决上述存在的问题,本文提出一种基于SE注意力机制(Squeeze-and-Excitation Attention Mechanism, SE)与互信息量(Mutual Information, MI)的跨语种语音转换方法,实现有效的表征解纠缠,完成开集情形下高质量的跨语种语音转换。首先,在内容编码器中引入SE注意力机制以利用其对全局信息的提取能力,使得内容编码器可以提取包含全局上下文信息的内容表征;同时,在各个表征之间引入互信息量,并通过对其最小化来大幅减少各个表征之间存在的信息泄露问题,从而实现有效的表征解纠缠。在VCTK英文语料库和AISHELL-3中文语料库上的实验结果表明,本文提出的基于SE注意力机制与互信息量的跨语种语音转换模型(Squeeze-and-Excitation Attention Mechanism and Mutual Information, SEMI)具有更强的表征提取能力,相比于基准模型,其在客观评价中MCD值降低了10.89%,在主观评价中MOS值和ABX值分别提升了10.94%和12.06%,验证了SEMI模型在转换语音质量和说话人个性相似度方面都取得显著进展,实现了开集情形下高质量的跨语种语音转换。

     

    Abstract: ‍ ‍In cross-lingual voice conversion (CLVC) tasks, how to preserve the content information in converted speech while effectively improving the similarity and naturalness of converted speech is currently a research challenge. When the traditional encoder-decoder model is applied to cross-lingual voice conversion, it generally performs separate content encoding and speaker encoding on the speech, resulting in information leakage between the content representation and speaker representation, and the personality similarity of converted speech is not ideal. To address this problem, this paper proposes a cross-lingual voice conversion method based on the Squeeze-and-Excitation attention mechanism (SE) and Mutual Information (MI), which achieved effective representation disentanglement and high-quality cross-lingual voice conversion in open set case. First, the SE attention mechanism was introduced into the content encoder to extract content representation containing global contextual information. Simultaneously, mutual information was introduced between different representations to minimize information leakage and achieve effective representation disentanglement. The experimental results on the VCTK English corpus and AISHELL-3 Chinese corpus show that the proposed model had stronger and better representation extraction ability. Compared with the benchmark model, in objective evaluation, the MCD value is reduced by 10.89%. In subjective evaluation, the MOS and ABX are increased by 10.94% and 12.06%, respectively, which indicates that the SEMI model significantly improved the quality of converted speech and the similarity of speaker personalities, thereby achieving high-quality cross-lingual voice conversion in open set case.

     

/

返回文章
返回