基于多任务学习的端到端维吾尔语语音识别

End to End Uyghur Speech Recognition Based on Multi Task Learning

  • 摘要: 维吾尔语是黏着语,词汇量较多,容易出现未登录词问题并且属于低资源语言,导致维吾尔语的端到端语音识别模型性能较低。针对上述问题,该文提出了基于多任务学习的端到端维吾尔语语音识别模型,在编码器层使用Conformer并与链接时序分类(CTC)相连接,通过BPE-dropout方法形成鲁棒性更强的子词,以子词和字作为建模单元,同时进行多任务训练和解码。实验结果分析发现,子词作为建模单元能有效解决未登录词问题,多任务学习模型能在低资源环境下较充分利用数据,学习到丰富的时序语音特征信息,进一步提升模型的识别性能。在公开的维吾尔语语音数据集THUYG-20上与基线相比把子词错误率和字错误率分别降低7.3%和3.8%。

     

    Abstract: Uyghur language is an agglutinative language with a large vocabulary, which easily causes the problem of unregistered words. Furthermore, it is also a low-resource language, resulting in low performance of its end-to-end speech recognition model. In order to address foregoing problems, a multi-task learning based end-to-end Uyghur speech recognition model is proposed herein. At the encoder layer , conformer is used and linked to connectionist temporal classification (CTC). By introducing BPE-dropout, more robust modeling units are created. Then, with sub-words and characters as modeling units, multi-task training and decoding are carried out at the same time. The experimental result suggests that the use of sub-word as modeling unit will provide an effective solution to the problem of unregistered words and multi-task learning model achieves full utilization of data in low-resource environment and learn rich time-series speech feature information, thus further promoting the recognition performance of model. In the publicized Uyghur speech data set THUYG-20, the error of sub-words and characters is reduced by 7.3% and 3.8% respectively compared with the baseline

     

/

返回文章
返回