基于自注意力的端到端方言语音识别模型

An End-to-End Dialect Speech Recognition Model Based on Self Attention

  • 摘要: 方言语音识别是方言保护的核心环节。传统的方言语音识别模型缺乏考虑方言语音中特定方言音素的重要性,同时缺少多种语音特征提取及融合,导致方言语音识别性能不高。本文提出的端到端方言语音识别模型充分发挥了残差CNN(Convolutional Neural Networks)和Bi-LSTM(Bi-directional Long Short-Term Memory)分别在语音帧内和帧间特征提取的优势,并利用多头自注意力机制有效提取不同方言中特定方言音素信息构成语音发音底层特征,利用该方言发音底层特征进行方言语音识别。在基准赣方言和客家方言两种方言语音语料库上的实验结果表明本文提出的方言语音识别模型显著优于现有基准模型,通过对注意力机制的可视化进一步分析了模型取得性能提升的根本原因。

     

    Abstract: Dialect speech recognition is the core step of dialect protection. Traditional speech recognition models lack the consideration of the importance of specific dialect phonemes in dialect pronunciation, and neglect the extraction and fusion of different kinds of speech features, resulting in poor performance of dialect speech recognition. An end-to-end dialect speech recognition model proposed in this paper takes full advantage of the feature extraction of residual CNN(Convolutional Neural Networks)and Bi-LSTM (Bi-directional Long Short-Term Memory)in intra-frame and inter-frame respectively, adopts multi-head self attention to effectively extract specific dialect phoneme information in different dialects to form pronunciation features, and performs dialect speech recognition using the extracted pronunciation features. The experimental results on the Gan Chinese dialects and Hakka Chinese dialects show that our dialect speech recognition model outperforms the state-of-the-art by large margin. Through the visualization of attention, the fundamental cause of the performance improvement of the model is further analyzed.

     

/

返回文章
返回