Delving into the Pre-training Paradigm of Monocular 3D Object Detection
The labels of monocular 3D object detection (M3OD) are expensive to obtain. Meanwhile, there usually exists numerous unlabeled data in practical applications, and pre-training is an efficient way of exploiting the knowledge in unlabeled data. However, the pre-training paradigm for M3OD is hardly stu...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
07.06.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The labels of monocular 3D object detection (M3OD) are expensive to obtain.
Meanwhile, there usually exists numerous unlabeled data in practical
applications, and pre-training is an efficient way of exploiting the knowledge
in unlabeled data. However, the pre-training paradigm for M3OD is hardly
studied. We aim to bridge this gap in this work. To this end, we first draw two
observations: (1) The guideline of devising pre-training tasks is imitating the
representation of the target task. (2) Combining depth estimation and 2D object
detection is a promising M3OD pre-training baseline. Afterwards, following the
guideline, we propose several strategies to further improve this baseline,
which mainly include target guided semi-dense depth estimation, keypoint-aware
2D object detection, and class-level loss adjustment. Combining all the
developed techniques, the obtained pre-training framework produces pre-trained
backbones that improve M3OD performance significantly on both the KITTI-3D and
nuScenes benchmarks. For example, by applying a DLA34 backbone to a naive
center-based M3OD detector, the moderate ${\rm AP}_{3D}70$ score of Car on the
KITTI-3D testing set is boosted by 18.71\% and the NDS score on the nuScenes
validation set is improved by 40.41\% relatively. |
---|---|
DOI: | 10.48550/arxiv.2206.03657 |