BEVTrack:基于难例挖掘训练的端到端三维多目标跟踪方法
BEVTrack: An End-to-end 3D Multi Object Tracking Method Based on Hard Example Mining Training
-
摘要: 多目标跟踪已经成为自动驾驶系统中的一个关键组成部分,其目的是在连续的视频流与点云流中识别、定位并标识所有感兴趣的目标。目前三维多目标跟踪方法多依赖人工多阶段调参以保证整体跟踪性能,难以对复杂遮挡或运动进行有效建模。而现有的三维端到端多目标跟踪方法,如MUTR等,精度普遍较低。其核心原因为三维空间中的特征聚合和感知相对于二维图像更具挑战性,简单的网络难以实现复杂的三维特征聚合,并大量的噪声信息与难例信息干扰严重,影响模型的特征提取能力。针对以上问题,本文提出了一种基于难例挖掘训练的端到端多目标跟踪框架BEVTrack。针对三维特征关联问题,本文设计了基于鸟瞰图(BEV)位置编码的三维跟踪查询。通过基于BEV特征的三维跟踪查询,本文方法能够更好地将跟踪查询与实际三维特征进行有效关联,从而大幅度提升了跟踪精度。同时,模型依靠BEV数据进行特征关联,仅需轻量化的网络便可以实现快速有效的跟踪。针对数据噪声问题,本文提出了面向多目标跟踪的难例挖掘训练,通过针对检测难例与跟踪难例分别处理,训练模型去除检测错误噪声与跟踪匹配的能力,从而提升在真实场景下模型处理噪声信息与难例干扰的能力。在实验结果方面,基于Nuscenes数据集,我们进行了大量的对比实验与模型消融实验,实验结果证明本文的方法在该数据集上取得了领先的性能。Abstract: Multi-Object Tracking (MOT) has emerged as a crucial component in autonomous driving systems, with the goal of identifying, locating, and labeling all relevant objects in consecutive video and point cloud streams. Currently, there is a growing research emphasis on developing efficient and accurate multitarget tracking in the fields of computer vision and autonomous driving. The majority of existing 3D MOT methods adopt the two-stage heuristic approach, which relies on detection information and manually tuned parameters to effectively track targets in a scene. However, in the heuristic MOT paradigm, each object is associated by a meticulously tuned Kalman filter, and the tracking process is divided into multiple stages, including matching and re-matching. Consequently, extensive parameter tuning is necessary at each stage to ensure effective tracking, resulting in a cumbersome overall process. Furthermore, these methods are insufficient for modeling complex variations and encounter challenges in solving occlusion problems. Presently, in the current field of 3D multi-object tracking, there has been a rise in the use of end-to-end tracking methods, like MUTR, which implicitly establish temporal correlations and eschew explicit heuristic strategies utilized in the past. Nonetheless, the accuracy of these methods is typically subpar, falling significantly short of non-end-to-end heuristic approaches. The main reason behind this issue can be attributed to the challenges associated with feature aggregation and perception in three-dimensional space, as opposed to two-dimensional images. The real-time demands of multi-object tracking impose restrictions on the use of only a few thin Transformer layers in the tracker. Nevertheless, this reliance on a sparse number of Transformer layers presents difficulties in achieving intricate three-dimensional feature aggregation, substantially impacting the overall tracking accuracy. Moreover, the model's training process is frequently disrupted by extensive noisy and challenging information, thereby compromising its capacity for feature extraction throughout the tracking process.To tackle these challenges, this paper introduces a novel end-to-end framework for 3D multi-object tracking, named BEVTrack, which relies on training with hard example mining techniques. To address the issue of 3D feature correlation, this paper devises a three-dimensional tracking query utilizing Bird's Eye View (BEV) position encoding. Through utilization of the BEV cross-attention tracking module, the model is capable of connecting the tracking query with the corresponding three-dimensional features in the BEV view, ultimately delivering precise and refined features. The proposed method implicitly models the trajectory's positional and appearance alterations, thereby streamlining the 3D tracking process. Consequently, it becomes more proficient in associating tracking queries with authentic three-dimensional features, thereby substantially enhancing tracking accuracy. Moreover, the model utilizes BEV data for feature correlation, allowing for fast and efficient tracking using a lightweight network due to the benefits offered by BEV features, which include low computation cost and alleviating minor target position changes. In order to combat the problem of data noise, this paper presents simulated noise training via hard example mining. This approach involves introducing more challenging detections and false targets during the training process to enhance the model's capacity for filtering out corrupting noise and effectively handling interference encountered in real-world scenarios. Regarding experimental outcomes, an in-depth comparative analysis and model ablation experiments were conducted using the Nuscenes dataset, which achieves the highest level of tracking accuracy compared to other methods without the need for additional parameter tuning, highlighting the superiority and efficiency of the proposed approach.