Fully Decoupling Trajectory and Scene Encoding for Lightweight Heatmap-Oriented Trajectory Prediction

Recently, heatmap-oriented approaches have demonstrated their state-of-the-art performance in pedestrian trajectory prediction by exploiting scene information from input images before running the encoder. To align the image and trajectory information, existing methods centre the scene images to agen...

Full description

Saved in:

Bibliographic Details
Published in	IEEE robotics and automation letters Vol. 9; no. 10; pp. 9143 - 9150
Main Authors	Huang, Renhao, Ding, Jingtao, Pagnucco, Maurice, Song, Yang
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.10.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Coders Computer vision for automation Consumption Coordinates Decoding Decoupling deep learning for visual perception Encoding Feature extraction Heating systems Knowledge management Pedestrians Performance prediction semantic scene understanding Sequences Trajectory Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, heatmap-oriented approaches have demonstrated their state-of-the-art performance in pedestrian trajectory prediction by exploiting scene information from input images before running the encoder. To align the image and trajectory information, existing methods centre the scene images to agents' last observed locations or convert trajectory sequences into images. Such alignment processes cause repetitive executions of the scene encoder for each pedestrian in an input image while there are often many pedestrians in an image, thus leading to significant memory consumption. In this letter, we address this problem by fully decoupling scene and trajectory feature extractions so that the scene information is only encoded once for an input image regardless of the number of pedestrians in the image. To do this, we directly extract temporal information from trajectories in a global pixel coordinate system. Then, we propose a transformer-based heatmap decoder to model the complex interaction between high-level trajectory and image features via trajectory self-attention, trajectory-to-image cross-attention and image-to-trajectory cross-attention layers. We also introduce scene counterfactual learning to alleviate the over-focusing on the trajectory features and knowledge transfer from Segment Anything Model to simplify the training. Our experiments show that our framework shows highly competitive performance on multiple benchmarks, demonstrating scene-compliant predictions on complex terrains and much less memory consumption when handling multi-pedestrians.
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2024.3426376