Tibetan-Speech-Emotion Recognition Under Low-Resource Conditions

ZHANG Weizhao; LI Haoyuan; YANG Hongwu

doi:10.12466/xhcl.2025.09.009

ZHANG Weizhao, LI Haoyuan, YANG Hongwu. Tibetan-speech-emotion recognition under low-resource conditions[J]. Journal of Signal Processing, 2025, 41(9): 1558-1569. DOI: 10.12466/xhcl.2025.09.009.

Citation:

ZHANG Weizhao, LI Haoyuan, YANG Hongwu. Tibetan-speech-emotion recognition under low-resource conditions[J]. Journal of Signal Processing, 2025, 41(9): 1558-1569. DOI: 10.12466/xhcl.2025.09.009.

Citation:

ZHANG Weizhao, LI Haoyuan, YANG Hongwu. Tibetan-speech-emotion recognition under low-resource conditions[J]. Journal of Signal Processing, 2025, 41(9): 1558-1569. DOI: 10.12466/xhcl.2025.09.009.

Tibetan-Speech-Emotion Recognition Under Low-Resource Conditions

Abstract

Abstract

In recent years， although significant progress has been made in speech emotion recognition （SER） research for major languages， studies focusing on low-resource languages still face numerous challenges in dataset construction， feature extraction， and recognition model design. To address the problem of Tibetan speech emotion recognition under low-resource conditions， our study first constructed the Tibetan Emotion Speech Dataset-2500 （TESD-2500） through the steps of video clipping， audio extraction and enhancement， and manual annotation and verification. This dataset covers four emotion types （anger， sadness， happiness， and neutral）， and contains 2500 speech samples. The emotion categories and sample size are still being expanded. Subsequently， we designed a multi-feature fusion speech emotion recognition model incorporating cross-attention and co-attention mechanisms. A Bidirectional Long Short-Term Memory Network （BiLSTM） was employed to model the temporal dynamics of Mel-Frequency Cepstral Coefficient （MFCC） and extract dynamic temporal representations from the speech signal. AlexNet was utilized to extract time-frequency features from spectrograms and capture the joint time-frequency distribution patterns of the speech signal. A cross-attention mechanism was used to compute the correlation weights between these two types of heterogeneous features. The large-scale pre-trained model WavLM was introduced to extract deep semantic features from the speech signal. Using the results from the aforementioned cross-attention calculation as weight vectors， a co-attention mechanism was applied to perform weighted reconstruction of the deep features. The MFCC temporal features， spectrogram time-frequency features， and the weighted deep pre-trained model features were concatenated to form a multi-level fused feature representation. This fused representation was then mapped to the emotion category space via fully connected layers to accomplish Tibetan speech emotion classification. Experimental results demonstrated that the proposed model achieved a Weighted Accuracy （WA） of 76.56% and an Unweighted Accuracy （UA） of 75.42% on the TESD-2500 dataset， thus significantly outperforming baseline models. The study also evaluated the model’s generalization capability on the IEMOCAP and EmoDB datasets， achieving 74.27% WA and 73.60% UA on IEMOCAP， and 92.61% WA and 91.68% UA on EmoDB. The methodology and results presented in the paper may also serve as a reference for speech emotion recognition research in other low-resource languages.

FullText(HTML)

References (34)

Cited By

Tibetan-Speech-Emotion Recognition Under Low-Resource Conditions

Abstract

Catalog

Export File

Citation

Format

Content