LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection

LiDAR and camera are two common sensors to collect data in time for 3D object detection under the autonomous driving context. Though the complementary information across sensors and time has great potential of benefiting 3D perception, taking full advantage of sequential cross-sensor data still rema...

Full description

Saved in:
Bibliographic Details
Published in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 17151 - 17160
Main Authors Zeng, Yihan, Zhang, Da, Wang, Chunwei, Miao, Zhenwei, Liu, Ting, Zhan, Xin, Hao, Dayang, Ma, Chao
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:LiDAR and camera are two common sensors to collect data in time for 3D object detection under the autonomous driving context. Though the complementary information across sensors and time has great potential of benefiting 3D perception, taking full advantage of sequential cross-sensor data still remains challenging. In this paper, we propose a novel LiDAR Image Fusion Transformer (LIFT) to model the mutual interaction relationship of cross-sensor data over time. LIFT learns to align the input 4D sequential cross-sensor data to achieve multi-frame multi-modal information aggregation. To alleviate computational load, we project both point clouds and images into the bird-eye-view maps to compute sparse grid-wise self-attention. LIFT also benefits from a cross-sensor and cross-time data augmentation scheme. We evaluate the proposed approach on the challenging nuScenes and Waymo datasets, where our LIFT performs well over the state-of-the-art and strong baselines.
ISSN:2575-7075
DOI:10.1109/CVPR52688.2022.01666