Abstract:
Deep neural network based resource-limited keyword spotting systems have made great progress in recent years, but these methods still need a lot of parameters to get the state-of-the-art performance. In this paper, we focus on the tradeoff between achieving high detection accuracy and having a small model size. We propose to apply Squeeze-and-Excitation network and depthwise separable convolution in keyword spotting task. Specifically, We first improve the model performance by explicitly modelling the interdependencies between the channels of convolutional features with a so-called squeeze-and-excitation network. Then, we replace the standard convolution with the depthwise separable convolution, which greatly reduces the number of parameters of the standard convolution. We compared the proposed method with two convolutional neural network based models on Google Speech Commands dataset. Experimental results show that the proposed method significantly outperforms the comparison methods in terms of detection accuracy and model size. For example, it achieves a detection accuracy of 96.16% with a number of parameters of 75.5K, which significantly outperforms the comparison methods.