Abstract:
Nowadays, aviation data show a high dimensional and massive trend, while the traditional models always lack computing resources. In order to solve this problem, a parallel flight delay prediction model considering meteorological data based on Spark was proposed in this paper. The DataFrame was used to complete the fusion of flight data and meteorological data, so as to add different hours of weather data to a single flight data. Then, the parallelization method was used to divide the characteristics of the random forest and generate the tree, thus the flight delay prediction can be carried out quickly. The experimental results show that the recall and the accuracy rate improve after integrating meteorological data. The prediction accuracy of large threshold is higher for predicting different delay time. At the same time, the parallelization model converges faster than the single machine model, and has stronger acceleration ratio.