基于全局特征图的半监督微博文本情感分类
Semi-supervised Microblog Text Sentiment Classification Based on Global Feature Graph
-
摘要: 网络社交的流行与普及,使得微博等短文本区别于以往传统文章,具有了独有的文学表达形式和情感发泄方式,导致基于短文本的机器学习情感分析工作难度逐渐增大。针对微博短文本的语言表达新特性,爬取收集大量无情感标记微博数据,建立微博短文本语料库,基于全局语料库构建词与短文本的全局关系图,使用BERT(Bidirectional Encoder Representations from Transformers)文档嵌入作为图节点的特征值,采用图卷积进行节点间的特征传递和特征提取。采样部分无情感标记微博数据进行人工标注,采用半监督机器学习方法结合全局关系图提高情感分类器的性能,实验表明通过无情感标记数据比例的增加,该方法可以更好地捕捉全局特征,提高情感分类的精度。在自建人工标记数据、COAE2014数据集和NLP&CC2014数据集上进行了对比实验,实验结果表明该方法在精确率和召回率上均具有很好的表现。
Abstract: Online social networks have gradually become popular and popularization. A number of social networks such as microblog have formed a unique form of literary and emotional expression. Because the expression of microblog is different from the expression of traditional articles, the sentiment analysis research based on short-text machine learning has become more and more difficult. Aiming at the new features of Microblog short text language expression, we crawl and collect a large amount of non-emotionally labeled Microblog data, and build a Microblog short text corpus to create a global relationship graph between words and short texts. The BERT (Bidirectional Encoder Representations from Transformers) document embedding is used as the feature value of the graph node, and graph convolution is used for feature transfer and feature extraction between nodes. We manually annotate non-emotionally labeled Microblog data which sample from the whole Microblog short text corpus. A semi-supervised machine learning method combined with global relationship graph is proposed to improve the performance of sentiment classifier. Experiments show that by increasing the proportion of unmarked data, the method can better capture global features and improve the accuracy of sentiment classification. Comparative experiments are carried out on self-built artificial labeling data, COAE2014 data set and NLP&CC2014 data set. The experimental results show that the method has a good performance in accuracy and recall.