基于核典型相关分析的语音情感识别研究
Speech Emotion Recognition Through Kernel Canonical Correlation Analysis
-
摘要: 语音是人类表达思想和感情交流最重要的工具,是人类文化的重要组成部分。语音情感识别作为情感计算中的重要课题已经成为国际上的研究热点,受到越来越多的关注。已有神经科学研究表明,大脑是产生调节情感的物质基础。因此,在语音情感的研究中,我们不能仅考虑语音信号自身,还应将大脑的活动信号融入语音情感识别中,以实现更高准确率的情感识别。基于上述思想,本文提出了一种基于核典型相关分析(KCCA)的语音特征提取方法。该方法将语音特征与脑电图(EEG)特征映射到高维希尔伯特空间,并计算二者的最大相关系数。KCCA将语音特征在高维希尔伯特空间上向与脑电特征相关性最大的方向投影,最终得到包含脑电信息的语音特征。本文方法将与语音情感相关的脑电信息融入语音情感特征提取中,所提特征能够更准确的表征情感。同时,本方法在理论上具有良好的可迁移性,当所提脑电特征足够准确与具有代表性时,KCCA建模得到的投影向量具有通用性,可直接用于新的语音情感数据集中而无需重新采集和计算相应的脑电信号。在自建语音情感数据库与公开语音情感数据库MSP-IMPROV上的实验结果表明,使用投影语音特征进行语音情感分类的方法优于使用原始音频特征和其他语音特征提取方法。Abstract: Speech is the most important tool for human to express thoughts and emotion, as well as it is an important part of human culture. Speech affective recognition (SAR), as an important task in affective computing, has become an international research hotspot and attracts more and more attention. Neuroscience research has shown that the brain is the material basis for producing and regulating emotions. Therefore, in the study of speech emotion, we should not only consider the speech signal itself, but also integrate the activity signal of the brain into speech affective recognition to achieve a higher accuracy rate. Based on the above ideas, a speech feature projection method based on Kernel Canonical Correlation Analysis (KCCA) was proposed in this paper. This method maps speech and electroencephalogram (EEG) features to a high-dimensional Hilbert space and calculates the maximum correlation coefficient between them. KCCA projects the speech features in the direction with the greatest correlation with the EEG features, and finally obtains the speech features containing EEG information. This method incorporates EEG information related to speech emotion into speech emotion feature extraction, which can more accurately represent emotion. At the same time, this method has good migration in theory. When the proposed EEG features are sufficiently accurate and representative, the projection vectors obtained from KCCA modeling are generalizable and can be used directly in new speech emotion datasets without reacquisition and computation of the responding EEG. The experimental results on MSP-IMPROV and self-built speech emotion datasets show that the projected speech feature is better than the original speech feature and other speech feature extraction methods.