基于宽度自编码网络的单分类集成算法

施一帆; 陈宇昂; 曾焕强; 杨楷翔

doi:10.16798/j.issn.1003-0530.2024.04.015

摘要: 异常检测是模式识别领域的经典研究，然而在极端类别不平衡场景下，异常样本匮乏，训练数据仅包含正常样本，传统异常检测方法难以适用。因此，单分类算法逐渐受到关注，它只使用目标类样本构建决策边界，实现对非目标类样本的识别。目前单分类算法已经取得了显著进展，然而也存在一些局限性：（1）原始特征空间容易受噪声特征干扰；（2）单模型的单分类算法难以从多个特征空间学习更全面的决策边界；（3）缺少对先前模型的欠拟合样本进行针对性学习。为了解决这些问题，本文提出了基于宽度自编码网络的单分类集成算法（Ensemble One-class Classification Based on BLS-Autoencoder，EOC-BLSAE）。首先，本文设计了一种单分类宽度自编码网络模型（One-class BLS-Autoencoder，OC-BLSAE），它能高效学习原始特征空间到重构特征空间的非线性映射关系，利用重构误差构建决策边界；接着，本文设计了单分类Boosting策略，通过最小化全局重构损失，迭代学习欠拟合样本，从而多角度构建OC-BLSAE模型，并自适应评估模型的可靠性；最终，加权集成多个OC-BLSAE模型，有效提升整体算法准确性和鲁棒性。在实验中，本文在16个不同规模的单分类任务上进行参数实验、对比实验和消融实验，结果表明所提算法参数选择较为灵活，算法各模块能够相互协同，有效提升单分类任务的准确性和鲁棒性，整体性能超过前沿单分类方法。

Abstract: ‍ ‍Anomaly detection is a classic research problem in the field of pattern recognition； however， it is difficult to apply conventional anomaly detection methods in the scenario of extreme class imbalance， where only normal samples are included in the training set. Thus， one-class classification （OC） has gained increasing interest owing to its ability to build decision boundaries for the target class to identify non-target samples. Although many state-of-the-art OC algorithms have been proposed， they still have some limitations：（1） They usually train the model in the original feature space， which can be easily affected by noise features；（2） Most of these methods design a single OC model， which makes it difficult to learn comprehensive decision boundaries from multiple perspectives of feature subspaces；（3） All training samples are treated equally， lacking a specific learning process for under-fitting samples. To address these problems， this paper proposes an ensemble one-class classification based on BLS-autoencoder （EOC-BLSAE）. Specifically， inspired by the advantages of good generalization performance of a broad learning system （BLS） and autoencoder， a one-class BLS-autoencoder （OC-BLSAE） is first designed. OC-BLSAE efficiently eliminates noise features by learning the nonlinear mapping relationship between the original feature space and the reconstructed feature space. The reconstruction error of training samples is utilized to establish the decision boundary for the target class， and the testing samples with large reconstruction errors are recognized as anomalies. Subsequently， to build OC-BLSAE models from multiple perspectives， a one-class boosting strategy was proposed， which iteratively produces OC-BLSAE models by feeding training subsets containing under-fitting samples. Theoretically， the sampling weights of previously under-fitting samples would increase if the overall reconstruction loss of multiple OC-BLSAE models was minimized， and the reliability of each OC-BLSAE model could be adaptively evaluated. Upon training multiple OC-BLSAE models， the predictions of these models are integrated by weighted voting to derive the final prediction. Parameter， comparison， significant test， and ablation experiments are performed on 16 different OC tasks to verify the effectiveness of the proposed methods. The experimental results demonstrate that the proposed methods could obtain better overall performance than state-of-the-art OC approaches.