基于特征提取与机器学习的文档区块图像分类算法

Document Region Image Classification via Feature Extraction and Machine Learning Algorithms

  • 摘要: 文档区块图像分类对于文档版面图像的理解和分析至关重要。在传统机器学习分类模型中,直接使用图像作为输入会导致模型参数量过大因此无法进行训练。为了克服这个困难,我们在本文中针对文档区块图像设计了一组有效的特征,并提出了基于这些特征和机器学习的文档区块分类算法。在特征设计上,我们提取了几何、灰度、区域、纹理和内容五方面在内的32维特征,以增强特征针对区块类别的分辨能力。在分类器方面,我们在所提出的特征上对传统机器学习分类模型、自动机器学习方法以及深度学习均进行了实验。在公开数据集上的实验结果表明,我们提出的文档版面区块分类算法具有很高的分类准确率,并且十分高效。另外我们实现了一个简单的分步文档版面分析算法,以展示所提出的区块分类算法的推广能力。

     

    Abstract: Document region classification is a crucial task for understanding document images. In the conventional machine learning algorithms, taking an image as the input directly will lead to a model with a large number of parameters which is difficult to be trained. To overcome this difficulty, we design a group of effective features for document region images and propose a document region classification framework based on feature extraction and machine learning classifiers. To make the features discriminative, the aspects of the total 32-dimension features include geometry, grayscale, region, texture, and content. And we conduct the experiments on conventional machine learning algorithms, auto-ml method and deep learning based on these features. The experimental results on the public dataset demonstrate that our proposed document region classification algorithm can achieve a higher classification accuracy while maintaining the same efficiency. In addition, we implement a simple stepwise page layout analysis algorithm to prove the generalization ability of the proposed document region classification algorithm.

     

/

返回文章
返回