ZHANG Ruonan, AN Gaoyun. Multimodal Feature Translating Embedding Based Scene Graph Generation[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(1): 51-60. DOI: 10.16798/j.issn.1003-0530.2023.01.006
Citation: ZHANG Ruonan, AN Gaoyun. Multimodal Feature Translating Embedding Based Scene Graph Generation[J]. JOURNAL OF SIGNAL PROCESSING, 2023, 39(1): 51-60. DOI: 10.16798/j.issn.1003-0530.2023.01.006

Multimodal Feature Translating Embedding Based Scene Graph Generation

  • ‍ ‍Scene graph generation task is a hot research direction in the field of computer vision, which can bridge low-level visual tasks and high-level visual tasks. The scene graphs are composed of triplets in the form of <subject-predicate-object>, and the model needs to encode comprehensive global visual information of the whole image to assist scene understanding. However, in the current works, there are still many problems for models to deal with special visual relationship such as one-to-many, many-to-one and symmetric visual relations. Based on the similarity between knowledge graphs and scene graphs, we migrated the translating embedding model from knowledge graph to scene graph generation field. To better encode such visual relations mentioned above, we proposed a multimodal feature translating embedding based scene graph generation framework, which reprojected the extracted multimodel features, such as visual and semantic features, and finally used the reprojected features to predict predicate categories, so as to construct better relational representations without significantly increasing the complexity of the model. This framework encapsulated and complemented almost all existing translating embedding models for scene graph generation, and four translating embedding models (TransE, TransH, TransR, TransD) were applied to the scene graph generation task, meanwhile, the types of models applicable to different types of visual relations were also elaborated. The framework proposed in this paper is also an extension of the traditional application approach. In addition to being an independent graph generation model, this paper also designed a new implementation as a plug-and-play sub-module to be inserted into other network models. In this paper, experiments were conducted using a large-scale semantic understanding dataset, visual genome. And the effectiveness of our method was fully validated by experimental results. Meanwhile, we also observed a richer prediction category distribution, demonstrating that the proposed method in this paper is quite helpful to solve the long-tail bias problem in the dataset.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return