采用HDPHMM符号化器的语音查询样例检测方法

曹建凯, 张连海

曹建凯, 张连海. 采用HDPHMM符号化器的语音查询样例检测方法[J]. 信号处理, 2017, 33(5): 703-710. DOI: 10.16798/j.issn.1003-0530.2017.05.007
引用本文: 曹建凯, 张连海. 采用HDPHMM符号化器的语音查询样例检测方法[J]. 信号处理, 2017, 33(5): 703-710. DOI: 10.16798/j.issn.1003-0530.2017.05.007
CAO Jian-kai, ZHANG Lian-hai. Query-by-example spoken term detection by applying the HDPHMM tokenizer[J]. JOURNAL OF SIGNAL PROCESSING, 2017, 33(5): 703-710. DOI: 10.16798/j.issn.1003-0530.2017.05.007
Citation: CAO Jian-kai, ZHANG Lian-hai. Query-by-example spoken term detection by applying the HDPHMM tokenizer[J]. JOURNAL OF SIGNAL PROCESSING, 2017, 33(5): 703-710. DOI: 10.16798/j.issn.1003-0530.2017.05.007

采用HDPHMM符号化器的语音查询样例检测方法

基金项目: 国家自然科学基金资助项目(61673395,61403415,61302107)
详细信息
    通讯作者:

    张连海   E-mail: lianhaiz@sina.com

  • 中图分类号: TP391

Query-by-example spoken term detection by applying the HDPHMM tokenizer

More Information
    Corresponding author:

    ZHANG Lian-hai   E-mail: lianhaiz@sina.com

  • 摘要: 提出一种基于层级狄利克雷过程隐马尔科夫模型(HDPHMM)符号化器的无监督语音查询样例检测(QbE-STD)方法。该方法首先应用一个双状态层隐马尔科夫模型,其中顶层状态用于表示所发现的声学单元,底层状态用于建模顶层状态的发射概率,通过对顶层状态假设一个层级狄利克雷过程先验,获得非参贝叶斯模型HDPHMM。使用无标注语音数据对该模型进行训练,然后对测试语音和查询样例输出后验概率特征矢量,使用非负矩阵分解算法对后验概率进行优化得到新的特征,然后在此基础上,应用修正分段动态时间规整算法进行检索,构成QbE-STD系统。实验结果表明,相比于基于高斯混合模型符号化器的基线系统,本文所提出的方法性能更优,检索精度得到显著提升。
    Abstract: This paper presents a study of hierarchical Dirichlet processing hidden Markov model (HDPHMM) approach for unsupervised query-by-example spoken term detection (QbE-STD). First a hierarchical hidden Markov model is applied,in which the top layer states are used for representing the finding acoustic units, bottom layer states are used for modeling the emission probability of top layer states. We can get a nonparametric Bayesian model HDPHMM when imposing a hierarchical Dirichlet processing prior on the top layer states. After the model is trained by unlabeled speech data, it outputs posteriorgram feature vector for test utterance and query term. The posteriorgram feature is optimized by non-negative matrix factorization algorithm. Then the detection is performed by modified SDTW algorithm. Experimental results show that the proposed method outperforms the baseline system based on Gaussian mixture model tokenizer, and improve the detection precision obviously.
  • [1] Xu Ji, Zhang Ge, Yan Yonghang. Effective utilization of multiple examples in Query-by-Example spoken term detection [C]// ICASSP 2016. Shanghai, China. 2016: 5440-5444.
    [2] Zhang Yichi, Duan Zhiyao. IMISound: an unsupervised system for sound Query by vocal imitation [C]// ICASSP 2016. Shanghai, China. 2016: 2269-2273.
    [3] Stefan Balke, Vlora Arifi-Muller, Lukas Lamprecht, Meinard Miiller. Retrieving Audio Recordings using musical themes [C]// ICASSP 2016. Shanghai, China. 2016: 281-285.
    [4] David R. H. Miller, Michael Kleber, Chia-lin Kao, et al. Rapid and accurate spoken term detection [C]// Interspeech 2007. Antwerp, Belgium. 2007: 314-317.
    [5] Xu Haihua, Hou Jingyong, Xiao Xiong, et al. Approximate search of audio queries by using DTW with phone time boundary and data augmentation [C]// ICASSP 2016. Shanghai, China. 2016: 6030-6034.
    [6] Zhang Yaodong, Glass, J. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams[A]. In: Prco. of IEEE Automatic Speech Recognition and Understanding Workshop [C]. Merano, Italy. 2009:398-403.
    [7] Wang Haipeng, Leung C, Lee T, et al. An acoustic segment modeling approach to query-by-example spoken term detection [C]// ICASSP 2012. Kyoto, Japan. 2012: 5157-5160.
    [8] Chung Cheng-tao, Chan Chun-an, Lee Lin-shan. Unsupervised discovery of linguistic structure including two-level acoustic patterns using three cascaded stages of iterative optimization [C]// ICASSP 2013. Vancouver, Canada. 2013: 8081-8085.
    [9] Chengtao Chung, Weining Hsu, Chengyi Lee, and Linshan Lee. Enhancing automatically discovered multi-level acoustic patterns considering context consistency eith applications in spoken term detection [C]// ICASSP 2015. Brisbane, Australia. 2015: 5231-5235.
    [10] Brenden M. Lake, Chia-ying Lee, James R. Glass, and Joshua B. Tenenbaum. One-shot learning of generative speech concepts [C]. In Proceedings of the 36th Annual Meeting of the Cognitive Science Soceity, 2014: 803-808.
    [11] Chia-ying Lee and James Glass. A nonparametric Bayesian approach to acoustic model discovery. In Proceedings of ACL[C], 2012: 40-49.
    [12] S. Ganapathy. Signal analysis using autoregressive models of amplitude modulation [D]. Baltimore, Maryland, USA: Johns Hopkins University, 2012:60-68.
    [13] G. Mantena, S. Achanta, and K. Prahallad. Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping [J]. IEEE Transactions on Audio Speech and Language Processing, 2014, 22(5): 946-955.
    [14] The, Y., Jordan, M., Beal, M., &Blei, D.. Hierarchical Dirichlet Processes [J]. Journal of the American Statistical Association, 2006, 101(47), 1566-1581.
    [15] D.D. Lee, H. S. Seung, Learning the parts of objects by nonnegative matrix factorization [J]. Nature, October 1999, vol. 401, pp 1451 - 1454.
    [16] John S. Garofolo, Lori F. Lamel, William M. Fisher, et al. TIMIT acoustic-phonetic continuous speech (MS-WAV version) [J]. Journal of the Acoustical Society of America, 1990, 88(88):210-221.
  • 期刊类型引用(1)

    1. 黄威,毛开,赵子坤,朱秋明,赵新宇,谢红. 可扩展多输入多输出信道高效模拟器研制. 电子测量与仪器学报. 2020(09): 1-8 . 百度学术

    其他类型引用(2)

计量
  • 文章访问数:  116
  • HTML全文浏览量:  1
  • PDF下载量:  860
  • 被引次数: 3
出版历程
  • 收稿日期:  2016-09-12
  • 修回日期:  2016-11-15
  • 发布日期:  2017-05-24

目录

    /

    返回文章
    返回