CL-fusionBEV: 3D object detection method with camera-LiDAR fusion in Bird’s Eye View

In the wave of research on autonomous driving, 3D object detection from the Bird’s Eye View (BEV) perspective has emerged as a pivotal area of focus. The essence of this challenge is the effective fusion of camera and LiDAR data into the BEV. Current approaches predominantly train and predict within...

Full description

Saved in:

Bibliographic Details
Published in	Complex & intelligent systems Vol. 10; no. 6; pp. 7681 - 7696
Main Authors	Shi, Peicheng, Liu, Zhiqiang, Dong, Xinlong, Yang, Aixi
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 01.12.2024 Springer Nature B.V Springer
Subjects	3D object detection Attention Attention mechanism Autonomous driving Bird’s Eye View (BEV) perception Cameras Cartesian coordinates Complexity Computational Intelligence Data integration Data Structures and Information Theory Datasets Engineering Lidar Modal data Modules Multisensor fusion Object recognition Original Article Pedestrians Three dimensional models Attention mechanism Bird’s Eye View (BEV) perception Autonomous driving 3D object detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In the wave of research on autonomous driving, 3D object detection from the Bird’s Eye View (BEV) perspective has emerged as a pivotal area of focus. The essence of this challenge is the effective fusion of camera and LiDAR data into the BEV. Current approaches predominantly train and predict within the front view and Cartesian coordinate system, often overlooking the inherent structural and operational differences between cameras and LiDAR sensors. This paper introduces CL-FusionBEV, an innovative 3D object detection methodology tailored for sensor data fusion in the BEV perspective. Our approach initiates with a view transformation, facilitated by an implicit learning module that transitions the camera’s perspective to the BEV space, thereby aligning the prediction module. Subsequently, to achieve modal fusion within the BEV framework, we employ voxelization to convert the LiDAR point cloud into BEV space, thereby generating LiDAR BEV spatial features. Moreover, to integrate the BEV spatial features from both camera and LiDAR, we have developed a multi-modal cross-attention mechanism and an implicit multi-modal fusion network, designed to enhance the synergy and application of dual-modal data. To counteract potential deficiencies in global reasoning and feature interaction arising from multi-modal cross-attention, we propose a BEV self-attention mechanism that facilitates comprehensive global feature operations. Our methodology has undergone rigorous evaluation on a substantial dataset within the autonomous driving domain, the nuScenes dataset. The outcomes demonstrate that our method achieves a mean Average Precision (mAP) of 73.3% and a nuScenes Detection Score (NDS) of 75.5%, particularly excelling in the detection of cars and pedestrians with high accuracies of 89% and 90.7%, respectively. Additionally, CL-FusionBEV exhibits superior performance in identifying occluded and distant objects, surpassing existing comparative methods.
ISSN:	2199-4536 2198-6053
DOI:	10.1007/s40747-024-01567-0