Abstract:
Conventional keyword search (KWS) systems rely on Automatic Speech Recognition (ASR), which is often difficult to train with insufficient resources. In order to avoid training a complete ASR system, AWS-free KWS systems are becoming more and more popular. In this paper, an end-to-end (E2E) KWS system was proposed. The system consists of two encoders, two decoders, an attention mechanism and a discriminator. In this system, an attention mechanism is introduced, which can merge the text and audio features output by the encoder to locate the keywords. The system was tested using Babel Assamese and Pashto dataset under different combinations of text and audio decoders. Experimental results showed that the system had better detection performance than the baseline KWS system. Compared with the AWS-based KWS system, the system achieved better results for the Out-Of-Vocabulary (OOV) in terms of STWV (Supremum Term Weighted Value) metrics. When there is a limit on the amount of training data, the system is more competitive than the ASR-based KWS system.