VoxT-GNN: A 3D object detection approach from point cloud based on voxel-level transformer and graph neural network
•Novel 3D Object Detection Framework: We present VoxT-GNN, a novel framework that synergistically combines Transformer and Graph Neural Network (GNN) architectures for enhanced 3D object detection from LiDAR point clouds. By conceptualizing point cloud processing as a region-to-region transformation...
Saved in:
Published in | Information processing & management Vol. 62; no. 4; p. 104155 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.07.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •Novel 3D Object Detection Framework: We present VoxT-GNN, a novel framework that synergistically combines Transformer and Graph Neural Network (GNN) architectures for enhanced 3D object detection from LiDAR point clouds. By conceptualizing point cloud processing as a region-to-region transformation to preserve the full resolution of the raw data, we enable end-to-end 3D object detection.•Voxel-Level Transformer (VoxelFormer) and GNN Feed-Forward Network (GnnFFN): The VoxelFormer module allows for the sampling of more points to preserve the original structure of the point cloud, thereby obtaining more discriminative local features. The GnnFFN intermediate layer enables information exchange across voxel regions and can scale the global receptive field to adapt to objects of different categories, sizes, and complex scenes, achieving high-quality global feature extraction. The combined use of VoxelFormer and GnnFFN enables superior fusion of local and global features of the point cloud, thereby enhancing 3D detection performance.•Optimized for Real-World Applications: Specifically tailored for use in autonomous driving, robotics, and augmented reality—domains where precise 3D perception is essential.•Versatile Detection Approach: Capable of both single-stage and two-stage detection methodologies, providing adaptability to meet diverse system requirements.•State-of-the-Art Performance: Achieves competitive results on the KITTI dataset, exceeding current benchmarks particularly in detecting Pedestrians and Cyclists.
Recently, a variety of LiDAR-based methods for the 3D detection of single-class objects, large objects, or in straightforward scenes have exhibited competitive performance. However, their detection performance in complex scenarios with multi - sized and multi - class objects is limited. We observe that the core problem leading to this phenomenon is the insufficient feature learning of small objects in point clouds, making it difficult to obtain more discriminative features. To address this challenge, we propose a 3D object detection framework based on point clouds that takes into account the detection of small objects, termed VoxT-GNN. The framework comprises two core components: a Voxel-Level Transformer (VoxelFormer) for local feature learning and a Graph Neural Network Feed-Forward Network (GnnFFN) for global feature learning. By embedding GnnFFN as an intermediate layer between the encoder and decoder of VoxelFormer, we achieve flexible scaling of the global receptive field while maximally preserving the original point cloud structure. This design enables effective adaptation to objects of varying sizes and categories, providing a viable solution for detection applications across diverse scenarios. Extensive experiments on KITTI and Waymo Open Dataset (WOD) demonstrate the strong competitiveness of our method, particularly showing significant improvements in small object detection. Notably, our approach achieves the second-highest mAP of 65.44% across three categories (car, pedestrian, and cyclist) on KITTI benchmark. The source code is available at https://github.com/yujianxinnian/VoxT-GNN. |
---|---|
ISSN: | 0306-4573 |
DOI: | 10.1016/j.ipm.2025.104155 |