Abstract:
In recent years, affective computing has gradually become one of the keys to the development of human-computer interaction. Emotion recognition, as an important part of affective computing, has also received extensive attention. Residual network is one of the most widely used networks and HGFM has better accuracy and robustness. This paper implemented facial expression recognition system based on ResNet18 and speech emotion recognition model based on HGFM. By adjusting the parameters, the model with better performance was trained. On this basis, we realized the multimodal system included video and audio by multimodal fusion strategies, namely feature-level fusion and decision-level fusion. It showed the superiority of the multimodal emotion recognition system performance. The feature-level fusion spliced the features of visual and audio mode into a large feature vector and then sent it into the classifier for classification and recognition. For the decision-level fusion, after the prediction probability of visual and audio mode was obtained through classifiers, the weight of each mode and the fusion strategy were determined according to the reliability of each mode, and the classification result was obtained after fusion. It was found that both two audio-visual emotion recognition models using different fusion strategies had improvements in accuracy compared with video modal model and audio modal model. The conclusion that the multimodal model is better than the optimal single-mode model was verified. The accuracy of the fused audio-visual bimodal model reached 76.84%, which was 3.50% higher than the existing optimal model. The model achieved in this paper has better performance in emotion recognition and has advantages in performance compared with the existing audio-visual emotion recognition models.