Abstract:
To address the scarcity of code-switching text data, a code-switching text synthesizing method was proposed, which constructed a code-switching text generator based on Encoder-Decoder model. The text generator implicitly learned the linguistic constraint rules of code-switching from the limited code-switching text, and the linguistic constraint rules of each language from a large number of monolingual parallel data to synthesize code-switching text. However, the naturalness of the generated text was low. To solve this problem, a method of synthesizing code-switching text based on Encoder-Decoder model with copy mechanism was proposed. On the basis of the code-switching text generator based on Encoder-Decoder model, a gating was added to decide whether to generate the next word from the prediction of the decoder or from the input source text of the encoder. Finally, the proposed method made the perplexity of the language model to obtains an absolute decrease of 13.96. It can be concluded that the method proposed can synthesize a large amount of code-switching text with high naturalness and alleviate the scarcity of code-switching text data.