Abstract:
For high dimensional data, a large portion of features are often not informative of the class of the objects, which affects the classification accuracy of the original random forest algorithm. In order to deal with the classification problem of the original random forest algorithm on high dimensional data, a random forest algorithm using stratified subspaces and weighted trees was proposed. Meanwhile, the traditional single-machine mode cannot meet the needs of computational efficiency of high dimensional data. Spark is a new cluster-computing framework. Therefore, the proposed algorithm was implemented on Spark to use its advantages in memory cache and iterative computation. In the paper, the decision tree was treated as a unit to adopt stratified sampling to generate feature subspaces, which could improve the performance of the decision trees among the forest and could ensure the diversity of them. Meanwhile, the integration strategy of weighted trees was used to make the trees with strong classification ability more influential in the integration process. The experiments on Mnist dataset and Gisette dataset show that the proposed algorithm has better performance than the original random forest algorithm and other two algorithms and has good scalability. The proposed algorithm could be an effective method for classifying high dimensional data.