基于Whisper的藏语方言语音识别研究

A Study on Tibetan Dialect Speech Recognition Based on Whisper

  • 摘要: 尽管Whisper语音大模型基于68万小时多语种语料进行训练,但其原生架构并未涵盖藏语语音识别任务。直接采用模型自带的微调方法进行藏语语音识别任务的训练,仍面临以下问题: (1)共享表征空间被英语等高资源语言主导,导致模型对藏语特性的学习不足;(2)模型自带的字节码编码将藏文音节无差别拆分,造成字符结构断裂与语义信息丢失;(3)藏语语料训练数据稀缺,模型难以充分习得藏语语言规律;(4)藏语各方言使用同一套文字体系,在多方言混合训练时难以区分音节在不同方言中的发音差异,产生严重的跨方言混淆,导致识别错误率上升。为此,本文提出一种在Whisper多语种预训练框架下改进的藏语方言语音识别方法,旨在促使模型学习方言间的共性与差异,以提升模型在不同方言场景下的识别鲁棒性与精度。首先,本文构建藏语字节对编码(Byte Pair Encoding, BPE)模型,并通过引入不同建模单元(如字母、BPE与音素)来扩展Whisper词表,系统比较不同编码策略对模型最终识别效果的影响;其次,在模型原有语音识别任务的基础上引入方言判别辅助机制,增强模型对藏语方言的区分能力;最后,结合对识别结果的分析,引入外部语言模型使用重打分以及浅融合的方式来提升模型的解码结果,进一步提升音频和文本一致性。实验结果表明,相较于模型直接全参微调,采用本方法并基于BPE-100建模单元对模型进行微调,同时引入语言模型优化解码结果,字符错误率(Character Error Rate, CER)可由45.80%降至9.56%。同时模型对藏语长序列文本的处理能力提升,最大可处理序列长度为原来的3倍。

     

    Abstract: Although the Whisper large-scale speech model was trained on 680000 hours of multilingual data, its original architecture does not support Tibetan speech recognition. Therefore, directly applying the fine-tuning method based on Whisper to Tibetan speech recognition tasks still faces the following issues: (1) The shared representation space is dominated by high-resource languages such as English, leading to insufficient learning of Tibetan-specific characteristics; (2) The default byte-level tokenization of the model indiscriminately splits Tibetan syllables, resulting in broken character structures and loss of semantic information; (3) The scarcity of Tibetan training data makes it difficult for the model to adequately capture Tibetan linguistic patterns; (4) As different Tibetan dialects share the same writing system, training with mixed dialects makes it hard to distinguish pronunciation variations of the same syllable across dialects, causing severe cross-dialect confusion and increased recognition errors. To address these issues, this paper proposes an improved Tibetan dialect speech recognition method within the Whisper multilingual pre-training framework, aiming to help the model learn the commonalities and differences among dialects, thereby enhancing recognition robustness and accuracy in various dialect scenarios. First, a Tibetan-specific byte pair encoding (BPE) model was constructed, and the Whisper vocabulary was expanded by introducing different modeling units (such as letters, BPE subwords, and phonemes) to systematically compare the impact of different encoding strategies on the final recognition performance. Second, a dialect discrimination auxiliary mechanism was introduced alongside the original speech recognition task to enhance the model’s ability to distinguish Tibetan dialects. Finally, based on the analysis of recognition results, an external language model was incorporated using rescoring and shallow fusion to improve decoding outcomes and further enhance audio-text consistency. The experimental results show that applying the proposed method with BPE-100 modeling units for fine-tuning and incorporating a language model to optimize decoding reduces the character error rate (CER) from 45.80% for the direct full-parameter method to 9.56%. Additionally, the model’s ability to process long Tibetan text sequences improves, with the maximum processable sequence length increasing by approximately three times.

     

/

返回文章
返回