WANG Penghui, HU Bo, MAO Zhendong. Cross-Modality Contrastive Image Generation from Scene Graph[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(6): 1222-1231. DOI: 10.16798/j.issn.1003-0530.2022.06.009
Citation: WANG Penghui, HU Bo, MAO Zhendong. Cross-Modality Contrastive Image Generation from Scene Graph[J]. JOURNAL OF SIGNAL PROCESSING, 2022, 38(6): 1222-1231. DOI: 10.16798/j.issn.1003-0530.2022.06.009

Cross-Modality Contrastive Image Generation from Scene Graph

  • ‍ ‍Conditional image generation generates realistic and reasonable images according to different forms of input, of which scene graph is a representative input form. The scene graph abstracts the objects and their relationships in the image as vertices and edges, which has become a widely used structured graph representation in computer vision and cross-modality fields. Existing scene graph-to-image methods lack relation-level and object-level constraints for multiple objects and complex relationships between objects in the scene graph, so they are prone to semantic inconsistency problems between the generated results and the inputs, such as missing objects and relationship errors. To address the above problems, a method based on cross-modality contrastive learning is proposed to solve the above problems for scene graph-to-image generation. First, we proposed the relational contrastive loss to keep the relation between the generated objects consistent with the input edges. In this part, we designed the union features to represent the relationships between objects and pulled them to become closer to the related edges than the unrelated edges in the feature space. Then we introduced the object contrastive loss to keep the generated object regions corresponding to the input vertices. In this part, we adopted the attention mechanism to get the features of objects in an image and pulled them to be closer to related vertices than unrelated edges. Finally, we proposed the global contrastive loss to make the generated image consistent with the input scene graph, which brings the distance between related images and graphs closer together and pushes unrelated samples away. Extensive experiments on benchmark COCO-stuff and VG datasets have shown the effectiveness of our method, we have achieved 8.33% and 8.87% FID improvement compared to the state-of-the-art performance at COCO-Stuff and VG. Besides the ablation study has shown that each contrastive loss module can improve the quality of the generated results, and the visualizations have demonstrated the effectiveness of alleviating the above problems. From experimental results, we can see that the proposed cross-modality method can not only improve the quality of generation, but also alleviate semantic inconsistency problems effectively, such as missing objects and relationship errors.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return