融合潜在主题信息和卷积语义特征的文本分类

Text Categorization Combining Latent Topic Information and Convolutional Semantic Features

  • 摘要: 经典的概率主题模型通过词与词的共现挖掘文本的潜在主题信息,在文本聚类与分类任务上被广泛应用。近几年来,随着词向量和各种神经网络模型在自然语言处理上的成功应用,基于神经网络的文本分类方法开始成为研究主流。本文通过卷积神经网络(Convolutional Neural Network,CNN)和概率主题模型在文本主题分类上的效果对比,展示了CNN在此任务上的优越性。在此基础上,本文利用CNN模型提取文本的特征向量并将其命名为卷积语义特征。为了更好地刻画文本的主题信息,本文在卷积语义特征上加入文本的潜在主题分布信息,从而得到一种更有效的文本特征表示。实验结果表明,相比于单独的概率主题模型或CNN模型,新的特征表示显著地提升了主题分类任务的F1值。

     

    Abstract: The classical probabilistic topic models can discover the latent topic information of documents by the co-occurrences of words, thus being widely used in text clustering and categorization tasks. In the last few years, with the successful applications of word embedding and neural networks, the research of text categorization based on neural networks has formed the mainstream. This paper shows the superiority of neural networks in text categorization tasks by comparing the Convolutional Neural Networks (CNN) and probabilistic topic models. And on this basis, this paper extracted the document feature vector through CNN and named it Convolutional Semantic Feature. In order to describe the topic information of documents better, this paper proposed a new kind of feature by combing the Convolutional Semantic Feature and latent topic information. The experimental results presented in this paper shows that this kind of new feature is superior to individual probabilistic topic model or CNN model,and obviously improves the F1 performance of topic categorization tasks.

     

/

返回文章
返回