Abstract:
Uyghur language is an agglutinative language with a large vocabulary, which easily causes the problem of unregistered words. Furthermore, it is also a low-resource language, resulting in low performance of its end-to-end speech recognition model. In order to address foregoing problems, a multi-task learning based end-to-end Uyghur speech recognition model is proposed herein. At the encoder layer , conformer is used and linked to connectionist temporal classification (CTC). By introducing BPE-dropout, more robust modeling units are created. Then, with sub-words and characters as modeling units, multi-task training and decoding are carried out at the same time. The experimental result suggests that the use of sub-word as modeling unit will provide an effective solution to the problem of unregistered words and multi-task learning model achieves full utilization of data in low-resource environment and learn rich time-series speech feature information, thus further promoting the recognition performance of model. In the publicized Uyghur speech data set THUYG-20, the error of sub-words and characters is reduced by 7.3% and 3.8% respectively compared with the baseline