Abstract:
The vector quantized variational autoencoder(VQVAE) based voice conversion system is a hot spot in voice conversion area, but the poor quality of converted speeches limits its wide use. To address this problem, this paper proposes an improved model called vector quantization regularized variational autoencoder(VQ-REG-VAE). During training, vector quantization works as the regularization term. Through the regularization of vector quantization, the encoder learns to generate speaker-independent linguistic features while the decoder learns to fuse the speaker features into linguistic features. During conversion, voice conversion can be realized through the encoder and the decoder. Since vector quantization is not used during the conversion, more linguitic information can be preserved. The objective and subjective experiments have shown that, compared with VQVAE model, VQ-REG-VAE model achieved significant improvement in speech quality and comparable results in speaker similarity.