采用注意力机制和多任务训练的端到端无语音识别关键词检索系统

End-to-End Keyword Search System Based on Attention Mechanism and Multitask Learning

  • 摘要: 传统的关键词搜索(KWS, Keyword Search)系统依靠自动语音识别(ASR, Automatic Speech Recognition),通常在资源不足的情况下很难训练。为了免去训练完整的语音识别系统,无语音识别(ASR-free)的关键词检索系统受到越来越多的欢迎。本文提出了一个端到端(E2E, End-to-End)的关键词检索系统,该系统由两个编码器,两个解码器,一个注意机制和一个判别器组成。本文在所提出的系统中引入了注意力机制,该机制可以合并编码器输出的文本和音频特征从而辅助定位关键词所在的位置。在文本和音频解码器的不同组合情况下,使用Babel阿萨姆语和普什图语数据集测试系统。实验结果表明,相比于基线系统而言,该系统拥有更好的检测性能。相比于基于语音识别的关键词检索系统,该系统对于集外词(OOV, Out-Of-Vocabulary),在STWV(Supremum Term Weighted Value)指标上,取得了更好的效果。当训练数据量受限时,该系统比基于语音识别的关键词检索系统更具有优势。

     

    Abstract: Conventional keyword search (KWS) systems rely on Automatic Speech Recognition (ASR), which is often difficult to train with insufficient resources. In order to avoid training a complete ASR system, AWS-free KWS systems are becoming more and more popular. In this paper, an end-to-end (E2E) KWS system was proposed. The system consists of two encoders, two decoders, an attention mechanism and a discriminator. In this system, an attention mechanism is introduced, which can merge the text and audio features output by the encoder to locate the keywords. The system was tested using Babel Assamese and Pashto dataset under different combinations of text and audio decoders. Experimental results showed that the system had better detection performance than the baseline KWS system. Compared with the AWS-based KWS system, the system achieved better results for the Out-Of-Vocabulary (OOV) in terms of STWV (Supremum Term Weighted Value) metrics. When there is a limit on the amount of training data, the system is more competitive than the ASR-based KWS system.

     

/

返回文章
返回