Abstract:
Dialect speech recognition is the core step of dialect protection. Traditional speech recognition models lack the consideration of the importance of specific dialect phonemes in dialect pronunciation, and neglect the extraction and fusion of different kinds of speech features, resulting in poor performance of dialect speech recognition. An end-to-end dialect speech recognition model proposed in this paper takes full advantage of the feature extraction of residual CNN(Convolutional Neural Networks)and Bi-LSTM (Bi-directional Long Short-Term Memory)in intra-frame and inter-frame respectively, adopts multi-head self attention to effectively extract specific dialect phoneme information in different dialects to form pronunciation features, and performs dialect speech recognition using the extracted pronunciation features. The experimental results on the Gan Chinese dialects and Hakka Chinese dialects show that our dialect speech recognition model outperforms the state-of-the-art by large margin. Through the visualization of attention, the fundamental cause of the performance improvement of the model is further analyzed.