基于隐变量模型的语音转换方法研究

Voice Conversion Using Latent Variable Model

  • 摘要: 传统语音转换方法利用说话人声音特征映射实现,容易造成过平滑(over-smoothing)和过拟合(over-fitting)问题。本文从语音信号内容与形式分离角度,利用隐变量模型提出了一种全新的语音转换方法。首先利用包含两个隐变量因子的隐变量模型(Latent Variable Model, LVM)建立语音信号的生成模型;然后采用最大似然方法把语音信号分解成表示语义的内容信息和体现说话人特征的形式信息,并估计出模型参数;最后基于LVM生成模型,利用说话人形式替换方法实现语音转换。主、客观测试结果表明,在相同训练集条件下,本文提出的语音转换方法性能明显优于GMM方法,并且隐变量模型和传统的双线性模型(Bilinear Model)相比,由于采用非线性关系描述内容与形式之间的相互作用,因此分离效果更好,语音转换质量更高。

     

    Abstract: Conventional voice conversion is to find a mapping from source acoustic features to those of the target, which is prone to cause over-smoothing and over-fitting phenomena. This paper proposes a novel strategy for voice conversion from the point of view of style and content separation, which is solved by a two-factor Latent Variable Model (LVM). Firstly, a generative model in terms of style and content is developed using a LVM with two low-dimensional latent factors, and the interactions between the two factors are captured by a set of basis mapping functions that relates low-dimensional latent spaces to a high-dimensional observation space. Secondly, through the model fitting, the observations of speech spectrum are decomposed into style and content factors that represent the speaker identity and phonetic information respectively, and the model parameters are also estimated. Lastly, the desired converted speech is reconstructed with the target identity style and the source phonetic content using the learned model as a prior. Objective and subjective test results showed that, compared to the traditional GMM mapping method, the proposed system results in an increased performance with limited size of training data. Further experiments showed that the LVM with nonlinear basis mapping functions is preferable to the Bilinear Model for voice conversion task.

     

/

返回文章
返回