LI Yanping, TAN Zhicheng, HU Chengyang, et al. SE attention mechanism and mutual information based representation disentanglement for cross-lingual voice conversion[J]. Journal of Signal Processing, 2025, 41(1):183-192. DOI: 10.12466/xhcl.2025.01.015.
Citation: LI Yanping, TAN Zhicheng, HU Chengyang, et al. SE attention mechanism and mutual information based representation disentanglement for cross-lingual voice conversion[J]. Journal of Signal Processing, 2025, 41(1):183-192. DOI: 10.12466/xhcl.2025.01.015.

SE Attention Mechanism and Mutual Information-Based Representation Disentanglement for Cross-Lingual Voice Conversion

  • ‍ ‍In cross-lingual voice conversion (CLVC) tasks, how to preserve the content information in converted speech while effectively improving the similarity and naturalness of converted speech is currently a research challenge. When the traditional encoder-decoder model is applied to cross-lingual voice conversion, it generally performs separate content encoding and speaker encoding on the speech, resulting in information leakage between the content representation and speaker representation, and the personality similarity of converted speech is not ideal. To address this problem, this paper proposes a cross-lingual voice conversion method based on the Squeeze-and-Excitation attention mechanism (SE) and Mutual Information (MI), which achieved effective representation disentanglement and high-quality cross-lingual voice conversion in open set case. First, the SE attention mechanism was introduced into the content encoder to extract content representation containing global contextual information. Simultaneously, mutual information was introduced between different representations to minimize information leakage and achieve effective representation disentanglement. The experimental results on the VCTK English corpus and AISHELL-3 Chinese corpus show that the proposed model had stronger and better representation extraction ability. Compared with the benchmark model, in objective evaluation, the MCD value is reduced by 10.89%. In subjective evaluation, the MOS and ABX are increased by 10.94% and 12.06%, respectively, which indicates that the SEMI model significantly improved the quality of converted speech and the similarity of speaker personalities, thereby achieving high-quality cross-lingual voice conversion in open set case.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return