MA Like, LI Guanyu. A study on Tibetan dialect speech recognition based on WhisperJ. Journal of Signal Processing, 2025, 41(12): 1980-1991.DOI: 10.12466/xhcl.2025.12.010.
Citation: MA Like, LI Guanyu. A study on Tibetan dialect speech recognition based on WhisperJ. Journal of Signal Processing, 2025, 41(12): 1980-1991.DOI: 10.12466/xhcl.2025.12.010.

A Study on Tibetan Dialect Speech Recognition Based on Whisper

  • Although the Whisper large-scale speech model was trained on 680000 hours of multilingual data, its original architecture does not support Tibetan speech recognition. Therefore, directly applying the fine-tuning method based on Whisper to Tibetan speech recognition tasks still faces the following issues: (1) The shared representation space is dominated by high-resource languages such as English, leading to insufficient learning of Tibetan-specific characteristics; (2) The default byte-level tokenization of the model indiscriminately splits Tibetan syllables, resulting in broken character structures and loss of semantic information; (3) The scarcity of Tibetan training data makes it difficult for the model to adequately capture Tibetan linguistic patterns; (4) As different Tibetan dialects share the same writing system, training with mixed dialects makes it hard to distinguish pronunciation variations of the same syllable across dialects, causing severe cross-dialect confusion and increased recognition errors. To address these issues, this paper proposes an improved Tibetan dialect speech recognition method within the Whisper multilingual pre-training framework, aiming to help the model learn the commonalities and differences among dialects, thereby enhancing recognition robustness and accuracy in various dialect scenarios. First, a Tibetan-specific byte pair encoding (BPE) model was constructed, and the Whisper vocabulary was expanded by introducing different modeling units (such as letters, BPE subwords, and phonemes) to systematically compare the impact of different encoding strategies on the final recognition performance. Second, a dialect discrimination auxiliary mechanism was introduced alongside the original speech recognition task to enhance the model’s ability to distinguish Tibetan dialects. Finally, based on the analysis of recognition results, an external language model was incorporated using rescoring and shallow fusion to improve decoding outcomes and further enhance audio-text consistency. The experimental results show that applying the proposed method with BPE-100 modeling units for fine-tuning and incorporating a language model to optimize decoding reduces the character error rate (CER) from 45.80% for the direct full-parameter method to 9.56%. Additionally, the model’s ability to process long Tibetan text sequences improves, with the maximum processable sequence length increasing by approximately three times.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return