Abstract:
Considering the complexity of manual extraction of acoustic features in traditional speech emotion recognition tasks, this paper proposed the Sinc-Transformer (SincNet Transformer) model for speech emotion recognition using raw speech. This model combined the advantages of SincNet and Transformer model encoder, and used SincNet filter to capture important narrow-band emotional features from the raw speech waveform, so that the whole network structure could be instructive in the process of feature extraction, so as to completed the shallow feature extraction work of raw speech signals;and used two layers of Transformer model encoders for secondary processing to extract deeper feature vectors that contain global context information. Among the four categories of speech emotion recognition in IEMOCAP database, experimental results show that the accuracy and unweighted average recall of Sinc-Transformer model proposed in this paper are 64.14% and 65.28% respectively. Meanwhile, compared with the baseline model, the proposed model can effectively improve speech emotion recognition performance.