Abstract:
Only few of prosody features such as fundamental frequency is used in common voice conversion system, so the conversion speech has weak target tendency and poor quality especially when speakers have strong speaking styles. In this paper, a new conversion method based on short-time spectrum and prosodies such as pitch contour, duration, pause and stress is proposed. Pitch contour is first described by pitch target model and then trained by Gaussian mixture model (GMM), the other prosodies are modeled by single Gaussian distribution model after statistical analysis. The experiment result show the target tendentiousness and naturalness of converted speech are well improved after use of rich prosody features comparing with traditional system.