FMA-DETR:一种无编码器的Transformer目标检测方法

FMA-DETR: A Transformer Object Detection Method Without Encoder

  • 摘要: DETR是第一个将Transformer应用于目标检测的视觉模型。在DETR结构中,Transformer编码器对已高度编码的图像特征进行再编码,这在一定程度上导致了网络功能的重复。此外,由于Transformer编码器具有多层深度堆叠的结构和巨大的参数量,导致网络优化变得困难,模型收敛速度缓慢。本文设计了一种无编码器的Transformer目标检测网络模型。由于不需要引入Transformer编码器,本文的模型比DETR参数量更小、计算量更低、模型收敛速度更快。但是,直接去除Transformer编码器将降低网络的表达能力,导致Transformer解码器无法从数量庞大的图像特征中关注到包含目标的图像特征,从而使检测性能大幅降低。为了缓解这个问题,本文提出了一种混合特征注意力(fusion-feature mixing attention, FMA)机制,它通过自适应特征混合和通道交叉注意力弥补检测网络特征表达能力的下降,将其应用于Transformer解码器可以减轻由于去除Transformer编码器带来的性能降低。在MS-COCO数据集上,本文网络模型(称为FMA-DETR)实现了与DETR相近的性能表现,同时本文的模型拥有更快的收敛速度、更小的参数量以及更低的计算量。本文还进行了大量消融实验来验证所提出方法的有效性。

     

    Abstract: ‍ ‍DETR is the first visual model to apply a Transformer to object detection. In the DETR structure, the Transformer encoder recodes highly encoded image features, which to some extent leads to duplication of network functionality. Furthermore, the Transformer encoder’s multi-layered deep stack and extensive parameter count complicate network optimization and slow down model convergence. This study designs a Transformer object detection network model without an encoder. Due to the elimination of the need to introduce a Transformer encoder, the network model proposed in this paper has fewer parameters, lower computational complexity, and faster convergence speed than DETR. However, directly removing the Transformer encoder will reduce the network’s expressive power, causing the Transformer decoder to fail to focus on image features containing an object from numerous image features, resulting in a significant decrease in detection performance. To alleviate this problem, this paper proposes a fusion-feature mixing attention (FMA) mechanism that compensates for the decrease in feature expression ability of the detection network through adaptive feature mixing and channel cross-attention. Applying it to the Transformer decoder can alleviate the performance degradation caused by removing the Transformer encoder. On the MS-COCO dataset, the network model proposed (called FMA-DETR) in this paper achieves similar performance to DETR, while having faster convergence speed, a lower parameter count, and a smaller computational complexity. Additionally, numerous ablation experiments were conducted in this study to verify the effectiveness of the proposed method.

     

/

返回文章
返回