TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving

Vision-centric joint perception and prediction (PnP) has become an emerging trend in autonomous driving research. It predicts the future states of the traffic participants in the surrounding environment from raw RGB images. However, it is still a critical challenge to synchronize features obtained a...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1368 - 1378
Main Authors Fang, Shaoheng, Wang, Zi, Zhong, Yiqi, Ge, Junhao, Chen, Siheng
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Vision-centric joint perception and prediction (PnP) has become an emerging trend in autonomous driving research. It predicts the future states of the traffic participants in the surrounding environment from raw RGB images. However, it is still a critical challenge to synchronize features obtained at multiple camera views and timestamps due to inevitable geometric distortions and further exploit those spatial-temporal features. To address this issue, we propose a temporal bird's-eye-view pyramid transformer (TBP-Former) for vision-centric PnP; which includes two novel designs. First, a pose-synchronized BEV encoder is proposed to map raw image inputs with any camera pose at any time to a shared and synchronized BEV space for better spatial-temporal synchronization. Second, a spatial-temporal pyramid transformer is introduced to comprehensively extract multi-scale BEV features and predict future BEV states with the support of spatial priors. Extensive experiments on nuScenes dataset show that our proposed framework overall outperforms all state-of-the-art vision-based prediction methods. Code is available at: https://github.com/MediaBrain-SJTU/TBP-Former
ISSN:2575-7075
DOI:10.1109/CVPR52729.2023.00138