ZHANG Hong,WAN Jiaxu,CHEN Haibo,et al. BEVTrack: An end-to-end 3D multi object tracking method based on hard example mining training[J]. Journal of Signal Processing,2024,40(1):152-165. DOI: 10.16798/j.issn.1003-0530.2024.01.010
Citation: ZHANG Hong,WAN Jiaxu,CHEN Haibo,et al. BEVTrack: An end-to-end 3D multi object tracking method based on hard example mining training[J]. Journal of Signal Processing,2024,40(1):152-165. DOI: 10.16798/j.issn.1003-0530.2024.01.010

BEVTrack: An End-to-end 3D Multi Object Tracking Method Based on Hard Example Mining Training

  • ‍ ‍Multi-Object Tracking (MOT) has emerged as a crucial component in autonomous driving systems, with the goal of identifying, locating, and labeling all relevant objects in consecutive video and point cloud streams. Currently, there is a growing research emphasis on developing efficient and accurate multitarget tracking in the fields of computer vision and autonomous driving. The majority of existing 3D MOT methods adopt the two-stage heuristic approach, which relies on detection information and manually tuned parameters to effectively track targets in a scene. However, in the heuristic MOT paradigm, each object is associated by a meticulously tuned Kalman filter, and the tracking process is divided into multiple stages, including matching and re-matching. Consequently, extensive parameter tuning is necessary at each stage to ensure effective tracking, resulting in a cumbersome overall process. Furthermore, these methods are insufficient for modeling complex variations and encounter challenges in solving occlusion problems. Presently, in the current field of 3D multi-object tracking, there has been a rise in the use of end-to-end tracking methods, like MUTR, which implicitly establish temporal correlations and eschew explicit heuristic strategies utilized in the past. Nonetheless, the accuracy of these methods is typically subpar, falling significantly short of non-end-to-end heuristic approaches. The main reason behind this issue can be attributed to the challenges associated with feature aggregation and perception in three-dimensional space, as opposed to two-dimensional images. The real-time demands of multi-object tracking impose restrictions on the use of only a few thin Transformer layers in the tracker. Nevertheless, this reliance on a sparse number of Transformer layers presents difficulties in achieving intricate three-dimensional feature aggregation, substantially impacting the overall tracking accuracy. Moreover, the model's training process is frequently disrupted by extensive noisy and challenging information, thereby compromising its capacity for feature extraction throughout the tracking process.To tackle these challenges, this paper introduces a novel end-to-end framework for 3D multi-object tracking, named BEVTrack, which relies on training with hard example mining techniques. To address the issue of 3D feature correlation, this paper devises a three-dimensional tracking query utilizing Bird's Eye View (BEV) position encoding. Through utilization of the BEV cross-attention tracking module, the model is capable of connecting the tracking query with the corresponding three-dimensional features in the BEV view, ultimately delivering precise and refined features. The proposed method implicitly models the trajectory's positional and appearance alterations, thereby streamlining the 3D tracking process. Consequently, it becomes more proficient in associating tracking queries with authentic three-dimensional features, thereby substantially enhancing tracking accuracy. Moreover, the model utilizes BEV data for feature correlation, allowing for fast and efficient tracking using a lightweight network due to the benefits offered by BEV features, which include low computation cost and alleviating minor target position changes. In order to combat the problem of data noise, this paper presents simulated noise training via hard example mining. This approach involves introducing more challenging detections and false targets during the training process to enhance the model's capacity for filtering out corrupting noise and effectively handling interference encountered in real-world scenarios. Regarding experimental outcomes, an in-depth comparative analysis and model ablation experiments were conducted using the Nuscenes dataset, which achieves the highest level of tracking accuracy compared to other methods without the need for additional parameter tuning, highlighting the superiority and efficiency of the proposed approach.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return