基于公平感知的缺失多视图聚类
Incomplete Multi-View Clustering Based on Fairness Perception
-
摘要: 缺失多视图聚类是一种处理多源数据的方法,它能够在数据中发现一致和互补的信息,并将数据分成不同的簇。这种方法可以有效解决复杂环境下的无监督多源数据分析问题,因此受到了广泛关注。然而,现有的缺失多视图聚类算法存在一些问题。它们往往忽视了数据中的一些差异,这些差异源于特殊群体的敏感属性。这会导致算法对这些特殊群体产生偏见,从而引发聚类的不公平问题。此外,经过修复之后的缺失样本,缺乏样本的独特性。针对以上问题,本文提出了一种基于公平感知的缺失多视图聚类方法,以缓解无监督聚类任务对特殊群体的不公平对待,同时解决了多视图数据一致性融合和缺失数据恢复问题。首先分别为每一个视图训练一个自动编解码器,利用信息论对经过编码器得到的多视图嵌入特征进行一致性融合,同时训练一个生成网络以恢复缺失视图数据,在使用嵌入特征进行聚类时,约束各簇中特殊群体的分布,使得各簇中特殊群体分布与整个数据集中的分布接近以保证算法的公平性。实验在3个常用多视图数据集上与最新的5种缺失多视图聚类方法进行了比较,在Bank数据集上缺失率为0.5时,相比于性能第2的方法,标准化互信息(Normalized Mutual Information,NMI)值提高了0.82%,公平值(Balance)提高了3.03%;在Credit Card数据集上缺失率为0时,相比于性能第2的方法,NMI值提高了3.53%,Balance值提高了5.62%。同时也在Credit Card数据集中进行了可视化实验以验证聚类算法的性能和公平性,消融实验证明了提出的多视图一致性融合和缺失视图恢复机制的有效性。本文所提出的方法考虑了缺失多视图场景下无监督聚类算法的公平性问题,在保证算法聚类性能的前提下提高了无监督聚类任务的公平性。Abstract: Incomplete multi-view clustering is a technique for processing multi-source data that aims to identify consistent and complementary information across the data and segment it into distinct clusters. This method effectively addresses the challenges of unsupervised multi-source data analysis in complex environments, making it a topic of considerable discussion. However, existing algorithms for incomplete multi-view clustering have notable shortcomings. They often overlook differences in the data arising from sensitive attributes associated with specific groups. This oversight can lead to biases against these groups, resulting in fairness issues during clustering. Furthermore, missing samples that are repaired may lose their uniqueness. To tackle these challenges, this paper presents a fairness-perception-based incomplete multi-view clustering method. This approach aims to reduce the unfair treatment of underrepresented groups in unsupervised clustering tasks while addressing the issues of multi-view data consistency and missing data recovery. Initially, an automated codec is trained for each view, allowing the coherent fusion of embedded features through information theory. Simultaneously, a generative network is trained to recover the missing view data. When utilizing the embedded features for clustering, we constrain the distribution of sensitive groups within each cluster. This ensures that the distribution of these groups closely mirrors that of the entire dataset, promoting fairness in the algorithm. We conducted experiments comparing our method with five state-of-the-art incomplete multi-view clustering techniques across three widely used multi-view datasets. For instance, when the missing rate was 0.5 on the Bank dataset, our method achieved a 0.82% increase in Normalized Mutual Information (NMI) and a 3.03% increase in Balance compared to the second-best method. Additionally, on the Credit Card dataset, with a missing rate of 0, our method showed a 3.53% increase in NMI and a 5.62% increase in Balance compared to the second method. Visualization experiments on the Credit Card dataset further confirmed the performance and fairness of our clustering algorithm. Ablation studies demonstrated the effectiveness of our proposed multi-view consistency fusion and missing view recovery mechanisms. Our method not only addresses fairness concerns in unsupervised clustering within the context of incomplete multi-view data but also enhances the clustering performance of the algorithm.