基于多模型正交化的深度图像识别对抗鲁棒性增强技术
Adversarial Robustness Enhancement Technology for Deep Image Recognition Based on Multi-Model Orthogonalization
-
摘要: 近年来,深度神经网络(Deep Neural Networks, DNN)已被广泛应用于图像识别,目标检测,图像分割等多种计算机视觉任务中,并取得了巨大成功。然而,DNN模型因其本身的脆弱性,仍面临着对抗攻击等技术手段带来的安全隐患。攻击者在图像上恶意地添加微小且人眼难以识别的扰动,可以让模型产生高置信度的错误输出。针对上述问题,集成多个DNN模型来提升对抗鲁棒性已成为有效的解决方案之一。但是,对抗样本在集成模型中的子模型间存在对抗迁移现象,可能使集成模型的防御效能大大降低,而且目前仍缺乏能够降低集成防御内部对抗迁移性的直观理论分析。本文引入损失场的概念并定量描述DNN模型间的对抗迁移性,重点关注和推导对抗迁移表达式的上界,发现促进模型损失场之间的正交性以及降低模型损失场的强度(Promoting Orthogonality and Reducing Strength,PORS)可以限制其上界大小,进而限制DNN模型间对抗迁移性。本文引入PORS惩罚项至原损失函数中,使集成模型能够保持在原始数据上的识别性能的同时,通过降低子模型间的对抗迁移性来增强整体的对抗鲁棒性。文章在CIFAR-10和MNIST数据集上对由PORS训练得到的集成模型开展实验,分别在白盒和黑盒攻击环境下与其他先进的集成防御方法进行对比实验,实验结果表明PORS可以显著提高对抗鲁棒性,在白盒攻击和原始数据集上能保持非常高的识别精度,尤其在黑盒迁移攻击中极为有效,在所有集成防御方法中表现最为稳定。Abstract: In recent years, deep neural networks (DNNs) have been widely used with great success in a variety of computer-vision tasks such as image classification, target detection, and image segmentation. However, DNN models still face security risks such as adversarial attacks because of their inherent vulnerability. Attackers maliciously add small and hard-to-identify perturbations to images, which can cause the models to produce incorrect outputs with high confidence. Integrating multiple DNN models to improve the adversarial robustness has become an effective solution to this problem. However, the transferability of adversarial samples among the sub-models in an ensemble may significantly reduce its defense effectiveness. Thus, there is still a need for an intuitive theoretical analysis that can reduce the adversarial transferability. This paper introduces the loss-field concept, quantitatively describes the adversarial transferability between DNN models, derives the upper bound of the adversarial transferability, and shows that promoting orthogonality between model loss fields and reducing their strength can limit the upper bound. This paper then introduces a PORS penalty term to the original loss, which maintains the recognition performance on the original data while enhancing the overall adversarial robustness by reducing the adversarial transferability among sub-models. The paper discusses the results of experiments on ensembles obtained by PORS on CIFAR-10 and MNIST datasets, and compares it with other advanced integrated defense methods in white-box and black-box attack settings. The experimental results showed that PORS could significantly improve the adversarial robustness and maintain a very high recognition accuracy during white-box attacks with the original datasets, especially during black-box transfer attacks. PORS is the most stable of the integrated defense methods.