Abstract:
Document region classification is a crucial task for understanding document images. In the conventional machine learning algorithms, taking an image as the input directly will lead to a model with a large number of parameters which is difficult to be trained. To overcome this difficulty, we design a group of effective features for document region images and propose a document region classification framework based on feature extraction and machine learning classifiers. To make the features discriminative, the aspects of the total 32-dimension features include geometry, grayscale, region, texture, and content. And we conduct the experiments on conventional machine learning algorithms, auto-ml method and deep learning based on these features. The experimental results on the public dataset demonstrate that our proposed document region classification algorithm can achieve a higher classification accuracy while maintaining the same efficiency. In addition, we implement a simple stepwise page layout analysis algorithm to prove the generalization ability of the proposed document region classification algorithm.