基于多模态特征转换嵌入的场景图生成

张若楠; 安高云

doi:10.16798/j.issn.1003-0530.2023.01.006

摘要: 场景图生成是计算机视觉领域中的热点研究方向，可连接上、下游视觉任务。场景图由形式为<主语-谓语-宾语>的三元组组成，模型需要对整幅图像的全局视觉信息进行编码，从而辅助场景理解。但目前模型在处理一对多、多对一和对称性等特殊的视觉关系时仍存在问题。基于知识图谱与场景图的相似性，我们将知识图谱中的转换嵌入模型迁移至场景图生成领域。为了更好地对此类视觉关系进行编码，本文提出了一种基于多模态特征转换嵌入的场景图生成框架，可对提取的视觉和语言等多模态特征进行重映射，最后使用重映射的特征进行谓语类别预测，从而在不明显增加模型复杂度的前提下构建更好的关系表示。该框架囊括并补充了现存的几乎所有转换嵌入模型的场景图实现，将四种转换嵌入模型（TransE、TransH、TransR、TransD）分别应用于场景图生成任务，同时详细阐述了不同的视觉关系类型适用的模型种类。本文所提框架扩展了传统应用方式，除独立模型之外，本文设计了新的应用方式，即作为即插即用的子模块插入到其他网络模型。本文利用大规模语义理解的视觉基因组数据集进行实验，实验结果充分验证了所提框架的有效性，同时，得到的更丰富的类别预测结果表明了本文所提框架有助于解决数据集中的长尾偏差问题。

Abstract: ‍ ‍Scene graph generation task is a hot research direction in the field of computer vision， which can bridge low-level visual tasks and high-level visual tasks. The scene graphs are composed of triplets in the form of <subject-predicate-object>， and the model needs to encode comprehensive global visual information of the whole image to assist scene understanding. However， in the current works， there are still many problems for models to deal with special visual relationship such as one-to-many， many-to-one and symmetric visual relations. Based on the similarity between knowledge graphs and scene graphs， we migrated the translating embedding model from knowledge graph to scene graph generation field. To better encode such visual relations mentioned above， we proposed a multimodal feature translating embedding based scene graph generation framework， which reprojected the extracted multimodel features， such as visual and semantic features， and finally used the reprojected features to predict predicate categories， so as to construct better relational representations without significantly increasing the complexity of the model. This framework encapsulated and complemented almost all existing translating embedding models for scene graph generation， and four translating embedding models （TransE， TransH， TransR， TransD） were applied to the scene graph generation task， meanwhile， the types of models applicable to different types of visual relations were also elaborated. The framework proposed in this paper is also an extension of the traditional application approach. In addition to being an independent graph generation model， this paper also designed a new implementation as a plug-and-play sub-module to be inserted into other network models. In this paper， experiments were conducted using a large-scale semantic understanding dataset， visual genome. And the effectiveness of our method was fully validated by experimental results. Meanwhile， we also observed a richer prediction category distribution， demonstrating that the proposed method in this paper is quite helpful to solve the long-tail bias problem in the dataset.

基于多模态特征转换嵌入的场景图生成

Multimodal Feature Translating Embedding Based Scene Graph Generation