编码器-解码器模型合成汉英语码转换文本

Synthesizing Mandarin-English Code-Switching Text Using Encoder-Decoder Model

  • 摘要: 为了解决汉英语码转换文本数据稀缺的问题,本文提出了基于编码器-解码器模型合成语码转换文本的方法,从有限的语码转换文本与大量单语种平行语料中学习语码转换语言学规则与语种内部的语言学规则,来合成语码转换文本。但是该模型合成的语码转换文本自然度低,因此本文又提出基于带复制机制的编码器-解码器模型合成语码转换文本的方法,在编码器-解码器的基础上,增加了一个门控,用来决定从编码器的预测结果还是从编码器的输入源文本中产生下一个词。最终,该方法使语言模型在SEAME测试集上的困惑度降低了绝对13.96。由此可得出结论,本文提出的方法可大规模地合成自然度高的语码转换文本,缓解语码转换文本数据的稀缺性。

     

    Abstract: ‍ ‍To address the scarcity of code-switching text data, a code-switching text synthesizing method was proposed, which constructed a code-switching text generator based on Encoder-Decoder model. The text generator implicitly learned the linguistic constraint rules of code-switching from the limited code-switching text, and the linguistic constraint rules of each language from a large number of monolingual parallel data to synthesize code-switching text. However, the naturalness of the generated text was low. To solve this problem, a method of synthesizing code-switching text based on Encoder-Decoder model with copy mechanism was proposed. On the basis of the code-switching text generator based on Encoder-Decoder model, a gating was added to decide whether to generate the next word from the prediction of the decoder or from the input source text of the encoder. Finally, the proposed method made the perplexity of the language model to obtains an absolute decrease of 13.96. It can be concluded that the method proposed can synthesize a large amount of code-switching text with high naturalness and alleviate the scarcity of code-switching text data.

     

/

返回文章
返回