基于跨模态对比的场景图图像生成
Cross-Modality Contrastive Image Generation from Scene Graph
-
摘要: 条件图像生成根据不同形式的输入生成符合条件的图像,其中场景图是一类具有代表性的条件输入形式。场景图将图像中的物体抽象为节点,将物体之间的关系抽象为边,是一种广泛应用在计算机视觉和跨模态领域的结构化图表示。由于场景图中包含多个物体和物体之间的关系,现有的场景图图像生成方法容易导致生成结果和条件语义不一致,例如物体缺失和关系错误等。本文提出基于跨模态对比的生成方法解决上述问题。首先,本文提出关系一致性对比使生成的物体关系和输入的边保持一致。我们设计了联合特征代表图像中的物体的关系,并拉近联合特征和与其相关的边特征的距离,使其相比于不相关的边特征距离更接近。本文引入物体一致性对比使的生成的物体区域和输入的节点保持对应。在这个部分我们使用注意力机制获得节点对应的物体特征,然后拉近相关的节点特征于物体特征的距离。最后,本文提出全局一致性对比使的生成的图像整体和输入的场景图保持一致, 该对比损失将相关联的图像和场景图特征拉近,同时将不相关的样本特征相互远离。我们COCO-stuff和VG数据集上进行了详细的实验,实验结果表明我们的方法相比当前最佳性能分别在两个数据集上提升8.33%和8.87%的FID。消融实验表明每个对比损失模块都能够提升图像的生成质量,可视化结果展示了方法对于解决上述问题的有效性。从实验结果可知,我们的方法不仅能够提升图像的生成质量,并能够有效缓解物体缺失和关系错误等语义不一致问题。Abstract: Conditional image generation generates realistic and reasonable images according to different forms of input, of which scene graph is a representative input form. The scene graph abstracts the objects and their relationships in the image as vertices and edges, which has become a widely used structured graph representation in computer vision and cross-modality fields. Existing scene graph-to-image methods lack relation-level and object-level constraints for multiple objects and complex relationships between objects in the scene graph, so they are prone to semantic inconsistency problems between the generated results and the inputs, such as missing objects and relationship errors. To address the above problems, a method based on cross-modality contrastive learning is proposed to solve the above problems for scene graph-to-image generation. First, we proposed the relational contrastive loss to keep the relation between the generated objects consistent with the input edges. In this part, we designed the union features to represent the relationships between objects and pulled them to become closer to the related edges than the unrelated edges in the feature space. Then we introduced the object contrastive loss to keep the generated object regions corresponding to the input vertices. In this part, we adopted the attention mechanism to get the features of objects in an image and pulled them to be closer to related vertices than unrelated edges. Finally, we proposed the global contrastive loss to make the generated image consistent with the input scene graph, which brings the distance between related images and graphs closer together and pushes unrelated samples away. Extensive experiments on benchmark COCO-stuff and VG datasets have shown the effectiveness of our method, we have achieved 8.33% and 8.87% FID improvement compared to the state-of-the-art performance at COCO-Stuff and VG. Besides the ablation study has shown that each contrastive loss module can improve the quality of the generated results, and the visualizations have demonstrated the effectiveness of alleviating the above problems. From experimental results, we can see that the proposed cross-modality method can not only improve the quality of generation, but also alleviate semantic inconsistency problems effectively, such as missing objects and relationship errors.