Abstract:
In recent years, emotion recognition had become a hot research topic in the field of human-computer interaction, and multi-modal dimensional emotion recognition could detect subtle emotional changes, which had attracted more and more attention. In multi-modal emotion recognition, it was necessary to consider how to effectively integrate different modal emotion information. Aiming at the problem of effective feature extraction and modal synchronization in feature level fusion, and the correlation problem of different modal feature information in decision level fusion, this paper adopted a model level fusion strategy and proposes a multi-modal dimension emotion recognition method based on Transformer. Respectively constructed audio model, video model and multi-modal fusion model to learn the deep features of the information flow, and finally put it into Bi-directional Long Short Term Memory to obtain the final emotional prediction value. Compared with different baseline methods, the proposed method achieves the best performance in terms of arousal and valence, and could effectively capture emotional information in high-level dimensions, and thus better effectively integrate audio and video information.