Abstract:
The voice conversion technique converts the voice tone of the source speaker to the target speaker while keeping the linguistic information unchanged. At present, Mongolian voice conversion is facing problems such as lack of corpus and rich prosodic changes in pronunciation of Mongolian words. To address these problems, this paper presents a non-parallel Mongolian voice conversion method based on fine-grained prosody modeling and conditional CycleGAN. This method used continuous wavelet transform to extract fine-grained prosodic features, then added speaker identity vectors to the CycleGAN to build a conditional CycleGAN, Finally, the conditional CycleGAN was used to obtain a stable prosody conversion between source and target speakers. Experimental results showed that compared with the traditional CycleGAN voice conversion method, this method can effectively improve the Mongolian voice conversion effect, and the MOS scores of speech naturalness and speaker similarity are improved by 0.1 and 0.2 respectively.