LI Yiting, QU Dan, YANG Xukui, ZHANG Hao, SHEN Xiaolong. Speech Recognition Model Based on Improved Linear Attention Mechanism[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(3): 516-525. DOI: 10.16798/j.issn.1003-0530.2023.03.014
Citation: LI Yiting, QU Dan, YANG Xukui, ZHANG Hao, SHEN Xiaolong. Speech Recognition Model Based on Improved Linear Attention Mechanism[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(3): 516-525. DOI: 10.16798/j.issn.1003-0530.2023.03.014

Speech Recognition Model Based on Improved Linear Attention Mechanism

  • ‍ ‍The Conformer model has drawn more and more researchers attention and gradually become the mainstream model in the field of speech recognition because of its superior performance. But because it uses the attention mechanism to extract information from the input, which needs to be interactively calculated for all sample points in the input sequences, resulting in the complexity of the network calculation being the square of the length of the input sequences. So it needs to consume more computing resources when recognizing long speech sequences, and its recognition speed is slower. Aiming at solving this problem, this paper proposed a speech recognition method of linear attention mechanism. Firstly, a novel gated linear attention structure was proposed to effectively reduce the attention calculation complexity. The multi-head attention was improved to single head attention and the attention calculation complexity reduced to linear relationship of the sequence length. Secondly, in order to make up for the decline in modeling ability caused by the use of linear attention, the combination of local attention and global attention was used with the help of linear attention positional coding. Finally, in order to further improve the model recognition performance, the guided attention loss and intermediate connectionist temporal classification (CTC) loss was added to the objective function on the basis of attention loss and CTC loss. Experimental results on the Chinese Mandarin dataset AISHELL-1 and the English LibriSpeech dataset showed that the performance of the improved model was significantly better than the baseline model, and the video memory consumption of the model decreased, with the training and recognition speed greatly improved.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return