基于Spark的分层子空间权重树随机森林算法

Random forest algorithm using stratified subspaces and weighted trees based on Spark

  • 摘要: 高维数据的很多特征与类别的相关性弱,影响了随机森林的分类正确率。针对原始随机森林算法在高维数据上的分类问题,提出了一种分层子空间权重树随机森林算法。同时,传统的单机模式无法满足高维数据计算效率的需求,因此利用开源集群计算框架Spark在内存缓存和迭代计算上的优势,将所提算法在Spark上实现。所提算法采用以决策树为单位的分层抽样来生成特征子空间,在提高单棵决策树性能的同时,保证决策树之间的多样性;并且采用权重树的集成策略,使分类能力强的树在集成过程中影响力更大。通过在Mnist和Gisette数据集上的实验结果表明,相比原始随机森林算法、TWRF算法以及分层子空间随机森林算法,所提算法具有更好的正确率,提高了泛化误差性能,可扩展性良好,能够有效分类高维数据。

     

    Abstract: For high dimensional data, a large portion of features are often not informative of the class of the objects, which affects the classification accuracy of the original random forest algorithm. In order to deal with the classification problem of the original random forest algorithm on high dimensional data, a random forest algorithm using stratified subspaces and weighted trees was proposed. Meanwhile, the traditional single-machine mode cannot meet the needs of computational efficiency of high dimensional data. Spark is a new cluster-computing framework. Therefore, the proposed algorithm was implemented on Spark to use its advantages in memory cache and iterative computation. In the paper, the decision tree was treated as a unit to adopt stratified sampling to generate feature subspaces, which could improve the performance of the decision trees among the forest and could ensure the diversity of them. Meanwhile, the integration strategy of weighted trees was used to make the trees with strong classification ability more influential in the integration process. The experiments on Mnist dataset and Gisette dataset show that the proposed algorithm has better performance than the original random forest algorithm and other two algorithms and has good scalability. The proposed algorithm could be an effective method for classifying high dimensional data.

     

/

返回文章
返回