采用最少门单元结构的改进注意力声学模型

龙星延, 屈丹, 张文林, 徐思颖

龙星延, 屈丹, 张文林, 徐思颖. 采用最少门单元结构的改进注意力声学模型[J]. 信号处理, 2018, 34(6): 739-748. DOI: 10.16798/j.issn.1003-0530.2018.06.013
引用本文: 龙星延, 屈丹, 张文林, 徐思颖. 采用最少门单元结构的改进注意力声学模型[J]. 信号处理, 2018, 34(6): 739-748. DOI: 10.16798/j.issn.1003-0530.2018.06.013
LONG Xing-yan, QU Dan, ZHANG Wen-lin, XU Si-ying. An Improved Attention Based Acoustics Model with Minimal Gate Unit[J]. JOURNAL OF SIGNAL PROCESSING, 2018, 34(6): 739-748. DOI: 10.16798/j.issn.1003-0530.2018.06.013
Citation: LONG Xing-yan, QU Dan, ZHANG Wen-lin, XU Si-ying. An Improved Attention Based Acoustics Model with Minimal Gate Unit[J]. JOURNAL OF SIGNAL PROCESSING, 2018, 34(6): 739-748. DOI: 10.16798/j.issn.1003-0530.2018.06.013

采用最少门单元结构的改进注意力声学模型

基金项目: 国家自然科学基金(61673395,61403415);河南省自然科学基金(162300410331)项目资助
详细信息
  • 中图分类号: TN912.3

An Improved Attention Based Acoustics Model with Minimal Gate Unit

  • 摘要: 采用“编码-解码”结构的注意力声学模型存在参数规模庞大、收敛速度慢和在噪声环境中对齐关系不准确的问题。针对以上问题,先提出引入最少门结构单元减少模型参数,减少训练时间;再采用自适应宽度的窗函数和在计算注意力系数特征的卷积神经网络中加入池化层进一步提高音素与特征对齐的准确度,从而提升识别准确率。在英语和捷克语的实验结果表明,改进后的模型参数规模和音素错误率均下降,同时识别性能优于基于隐马可夫模型和基于连接时序分类算法的声学模型。
    Abstract: The acoustic model based "encoder-decoder" architecture with attention mechanism suffers from large scale, slow convergence and inaccurate distribution of attention due to the noise. In view of these problems, it is proposed to utilize Minimal Gate Unit to reduce the model parameters the training time. Then utilize the adaptive window function and add the pooling layer to the convolution neural network to improve the recognition accuracy as well as the accuracy of alignments between phonemes and acoustic features. The results of the experiments in English and Czech corpus show a certain decrease in quantity of parameters and the phone error rate, and the recognition performance outperforms the hidden Markov model based acoustic model and Connectionist Temporal Classification.
  • [1] Hinton G, Deng L, Yu D, et al.Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups[J].IEEE Signal Processing Magazine, 2012, 29(6):82-97
    [2] Valtchev V, Odell J, Woodland P C, et al.Lattice-based discriminative training for large vocabulary speech recognition[C]// Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings. 1996 IEEE International Conference. IEEE Computer Society, 1996:605-608.
    [3] Povey D, Kanevsky D, Kingsbury B, et al.Boosted MMI for model and feature-space discriminative training[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008:4057-4060.
    [4] Povey D, Woodland P C.Minimum Phone Error and I-smoothing for improved discriminative training[C]// IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2002:I-105-I-108.
    [5] Povey D, Kingsbury B.Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2007:IV-321 - IV-324.
    [6] 陈雷, 杨俊安, 王一, 等.LVCSR系统中一种基于区分性和自适应瓶颈深度置信网络的特征提取方法[J].信号处理, 2015, 31(3):290-298.
    [7] Voigtlaender P, Doetsch P, Wiesler S, et al.Sequence-discriminative training of recurrent neural networks[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2015:2100-2104.
    [8] Graves A.Connectionist Temporal Classification[M]// Supervised Sequence Labelling with Recurrent Neural Networks. Springer Berlin Heidelberg, 2012:61-93.
    [9] Graves A, Jaitly N.Towards end-to-end speech recognition with recurrent neural networks[C]// International Conference on Machine Learning. 2014:1764-1772.
    [10] Miao Y, Gowayyed M, Metze F.EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding[C]// Automatic Speech Recognition and Understanding. IEEE, 2016:167-174.
    [11] Cho K, Merrienboer B V, Gulcehre C, et al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[J].Computer Science, 2014, 00(0):00-00
    [12] Bahdanau D, Cho K, Bengio Y.Neural Machine Translation by Jointly Learning to Align and Translate[J]. Computer Science, 2014,00(0):00-00
    [13] Xu K, Ba J, Kiros R, et al.Show, Attend and Tell: Neural Image Caption Generation with Visual Attention[J]. Computer Science, 2015,00(0):00-00
    [14] Chorowski J, Bahdanau D, Serdyuk D, et al.Attention-Based Models for Speech Recognition[J]. Computer Science, 2015,00(0):00-00
    [15] Bahdanau D, Chorowski J, Serdyuk D, et al.End-to-end attention-based large vocabulary speech recognition[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016:4945-4949.
    [16] Jozefowicz R, Zaremba W, Sutskever I.An empirical exploration of recurrent network architectures[C]// International Conference on International Conference on Machine Learning. JMLR.org, 2015:2342-2350.
    [17] Hochreiter S, Schmidhuber J.Long Short-Term Memory[J].Neural Computation, 1997, 9(8):1735-1762
    [18] Zhou G B, Wu J, Zhang C L, et al.Minimal Gated Unit for Recurrent Neural Networks[J].International Journal of Automation and Computing, 2016, 13(03):226-234
  • 期刊类型引用(1)

    1. 俞建强,颜雁,刘葳,孙一鸣. 基于改进门控单元神经网络的语音识别声学模型研究. 长春理工大学学报(自然科学版). 2020(01): 104-111 . 百度学术

    其他类型引用(1)

计量
  • 文章访问数:  92
  • HTML全文浏览量:  0
  • PDF下载量:  757
  • 被引次数: 2
出版历程
  • 收稿日期:  2017-10-22
  • 修回日期:  2018-03-30
  • 发布日期:  2018-06-24

目录

    /

    返回文章
    返回