Abstract:
The acoustic model based "encoder-decoder" architecture with attention mechanism suffers from large scale, slow convergence and inaccurate distribution of attention due to the noise. In view of these problems, it is proposed to utilize Minimal Gate Unit to reduce the model parameters the training time. Then utilize the adaptive window function and add the pooling layer to the convolution neural network to improve the recognition accuracy as well as the accuracy of alignments between phonemes and acoustic features. The results of the experiments in English and Czech corpus show a certain decrease in quantity of parameters and the phone error rate, and the recognition performance outperforms the hidden Markov model based acoustic model and Connectionist Temporal Classification.