Abstract:
Conventional voice conversion is to find a mapping from source acoustic features to those of the target, which is prone to cause over-smoothing and over-fitting phenomena. This paper proposes a novel strategy for voice conversion from the point of view of style and content separation, which is solved by a two-factor Latent Variable Model (LVM). Firstly, a generative model in terms of style and content is developed using a LVM with two low-dimensional latent factors, and the interactions between the two factors are captured by a set of basis mapping functions that relates low-dimensional latent spaces to a high-dimensional observation space. Secondly, through the model fitting, the observations of speech spectrum are decomposed into style and content factors that represent the speaker identity and phonetic information respectively, and the model parameters are also estimated. Lastly, the desired converted speech is reconstructed with the target identity style and the source phonetic content using the learned model as a prior. Objective and subjective test results showed that, compared to the traditional GMM mapping method, the proposed system results in an increased performance with limited size of training data. Further experiments showed that the LVM with nonlinear basis mapping functions is preferable to the Bilinear Model for voice conversion task.