Abstract:
When there exists mismatch between the trained acoustic models and the test utterances due to noise conditions, speaking styles and speaker traits, unmatched features may appear in cross-corpus. The resulting is the drastic degression in the performance of speech emotion recognition. Hence, the auditory attention model is found to be very effective for variational emotion features detection in our work. Therefore, Chirplet has been adopted to obtain salient gist features which show their relation to the expected performance in cross-corpus testing. In our experimental results, the prototypical classifier with the proposed feature extraction approach can deliver a gain of up to 9.6% accuracy in cross-corpus speech recognition, which is observed insensitive to different databases.