一种改进的线性注意力机制语音识别方法
Speech Recognition Model Based on Improved Linear Attention Mechanism
-
摘要: Conformer模型因其优越的性能,吸引了越来越多研究者的关注,逐渐成为语音识别领域的主流模型,但因其采用注意力机制从输入中提取信息,需要对输入序列中所有样本点进行交互计算,导致网络计算复杂度为输入序列长度的平方,因此在对长语音进行识别时需要消耗更多计算资源,其识别速度较慢。针对此问题,本文提出一种线性注意力机制的语音识别方法。首先,提出一种新型门控线性注意力结构将多头注意力改进为单头,将注意力计算复杂度改进为序列长度的线性关系,以有效减少注意力计算复杂度。其次,为了弥补使用线性注意力导致的模型建模能力下降,在线性注意力求解过程中,综合使用局部注意力和全局注意力,联合线性注意力编码,提高模型识别精度。最后,为了进一步提升模型识别效果,在注意力损失和连接时序分类(connectionist temporal classification, CTC)损失的基础上使用注意力引导损失和中间CTC损失融合建模目标函数。在中文普通话数据集AISHELL-1和英文LibriSpeech数据集上的实验结果表明,改进模型的性能明显优于基线模型,且模型显存消耗下降,训练、识别速度得到较大提升。Abstract: The Conformer model has drawn more and more researchers attention and gradually become the mainstream model in the field of speech recognition because of its superior performance. But because it uses the attention mechanism to extract information from the input, which needs to be interactively calculated for all sample points in the input sequences, resulting in the complexity of the network calculation being the square of the length of the input sequences. So it needs to consume more computing resources when recognizing long speech sequences, and its recognition speed is slower. Aiming at solving this problem, this paper proposed a speech recognition method of linear attention mechanism. Firstly, a novel gated linear attention structure was proposed to effectively reduce the attention calculation complexity. The multi-head attention was improved to single head attention and the attention calculation complexity reduced to linear relationship of the sequence length. Secondly, in order to make up for the decline in modeling ability caused by the use of linear attention, the combination of local attention and global attention was used with the help of linear attention positional coding. Finally, in order to further improve the model recognition performance, the guided attention loss and intermediate connectionist temporal classification (CTC) loss was added to the objective function on the basis of attention loss and CTC loss. Experimental results on the Chinese Mandarin dataset AISHELL-1 and the English LibriSpeech dataset showed that the performance of the improved model was significantly better than the baseline model, and the video memory consumption of the model decreased, with the training and recognition speed greatly improved.