低资源条件下的藏语语音情感识别
Tibetan-Speech-Emotion Recognition Under Low-Resource Conditions
-
摘要: 近年来,虽然面向主流语言的语音情感识别研究已经取得了较大进展,但是面向低资源语言的语音情感识别研究在数据集构建、特征提取与识别模型设计等方面面临诸多困难。针对低资源条件下的藏语语音情感识别问题,首先通过视频剪辑、音频提取与增强、人工标注与校对等步骤,初步构建了藏语情感语音数据集(Tibetan Emotion Speech Dataset-2500,TESD-2500)。该数据集涵盖四种情感类型(生气、悲伤、高兴和中性),共包含2500个语音样本,情感类别与样本数量仍在持续扩充中。然后,设计了一种融合交叉注意力与协同注意力机制的多特征融合语音情感识别模型,采用双向长短期记忆网络(Bidirectional Long Short-Term Memory Network, BiLSTM)对梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficient, MFCC)进行时序建模,以提取语音信号中的动态时序表征;利用AlexNet提取语谱图的时频特征,以捕获语音信号的时频联合分布模式,并通过交叉注意力机制计算上述两类异构特征间的相关性权重;引入大规模预训练模型WavLM提取语音信号的深层特征,并以前述交叉注意力计算的结果作为权重向量,通过协同注意力机制对深层特征进行加权重构;将MFCC时序特征、语谱图时频特征和加权的预训练模型深层特征拼接成多层次特征融合表示,通过全连接层映射至情感类别空间,完成藏语语音情感分类任务。最终实验结果表明,所提出的模型在TESD-2500数据集上取得了76.56%的加权准确率和75.42%的未加权准确率,显著优于基线模型。本文还在IEMOCAP和EmoDB数据集上进行了模型泛化能力测试,在IEMOCAP上达到了74.27%的加权准确率和73.60%的未加权准确率,在EmoDB上达到了92.61%的加权准确率和91.68%的未加权准确率。本文的研究方法与结果亦可为其他低资源语言的语音情感识别研究提供参考。Abstract: In recent years, although significant progress has been made in speech emotion recognition (SER) research for major languages, studies focusing on low-resource languages still face numerous challenges in dataset construction, feature extraction, and recognition model design. To address the problem of Tibetan speech emotion recognition under low-resource conditions, our study first constructed the Tibetan Emotion Speech Dataset-2500 (TESD-2500) through the steps of video clipping, audio extraction and enhancement, and manual annotation and verification. This dataset covers four emotion types (anger, sadness, happiness, and neutral), and contains 2500 speech samples. The emotion categories and sample size are still being expanded. Subsequently, we designed a multi-feature fusion speech emotion recognition model incorporating cross-attention and co-attention mechanisms. A Bidirectional Long Short-Term Memory Network (BiLSTM) was employed to model the temporal dynamics of Mel-Frequency Cepstral Coefficient (MFCC) and extract dynamic temporal representations from the speech signal. AlexNet was utilized to extract time-frequency features from spectrograms and capture the joint time-frequency distribution patterns of the speech signal. A cross-attention mechanism was used to compute the correlation weights between these two types of heterogeneous features. The large-scale pre-trained model WavLM was introduced to extract deep semantic features from the speech signal. Using the results from the aforementioned cross-attention calculation as weight vectors, a co-attention mechanism was applied to perform weighted reconstruction of the deep features. The MFCC temporal features, spectrogram time-frequency features, and the weighted deep pre-trained model features were concatenated to form a multi-level fused feature representation. This fused representation was then mapped to the emotion category space via fully connected layers to accomplish Tibetan speech emotion classification. Experimental results demonstrated that the proposed model achieved a Weighted Accuracy (WA) of 76.56% and an Unweighted Accuracy (UA) of 75.42% on the TESD-2500 dataset, thus significantly outperforming baseline models. The study also evaluated the model’s generalization capability on the IEMOCAP and EmoDB datasets, achieving 74.27% WA and 73.60% UA on IEMOCAP, and 92.61% WA and 91.68% UA on EmoDB. The methodology and results presented in the paper may also serve as a reference for speech emotion recognition research in other low-resource languages.