OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection

Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can l...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 32; pp. 6570 - 6581
Main Authors	Huang, Chenxi, He, Tong, Ren, Haidong, Wang, Wenxiao, Lin, Binbin, Cai, Deng
Format	Journal Article
Language	English
Published	New York IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	3D object detection camera project principles Convolution depth ambiguity Detectors Feature extraction Labels monocular images Object detection Object recognition Three-dimensional displays Training Visual discrimination Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, O ne B ounding Box M ultiple O bjects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are <inline-formula> <tex-math notation="LaTeX">{1.82\sim 10.91\%} </tex-math></inline-formula> mAP in BEV and <inline-formula> <tex-math notation="LaTeX">{1.18\sim 9.36\%} </tex-math></inline-formula> mAP in 3D). Codes have been released at https://github.com/mrsempress/OBMO .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1057-7149 1941-0042
DOI:	10.1109/TIP.2023.3333225