Instruction-aligned hierarchical waypoint planner for vision-and-language navigation in continuous environments

Developing agents to follow language instructions is a compelling yet challenging research topic. Recently, vision-and-language navigation in continuous environments has been proposed to explore the multi-modal pattern analysis and mapless navigation abilities of intelligent agents. However, current...

Full description

Saved in:

Bibliographic Details
Published in	Pattern analysis and applications : PAA Vol. 27; no. 4
Main Authors	He, Zongtao, Wang, Naijia, Wang, Liuyi, Liu, Chengju, Chen, Qijun
Format	Journal Article
Language	English
Published	London Springer London 01.12.2024 Springer Nature B.V
Subjects	Alignment Computer Science Intelligent agents Misalignment Navigation Original Article Pattern analysis Pattern Recognition Vision Waypoints Vision-and-language navigation in continuous environments Hierarchical decision models Embodied cognition Multi-modal fusion
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Developing agents to follow language instructions is a compelling yet challenging research topic. Recently, vision-and-language navigation in continuous environments has been proposed to explore the multi-modal pattern analysis and mapless navigation abilities of intelligent agents. However, current waypoint-based methods still have shortcomings, such as the coupled decision process and the possible shortest path-instruction misalignment. To address these challenges, we propose an instruction-aligned hierarchical waypoint planner (IA-HWP) that ensures fine-grained waypoint prediction and enhances instruction alignment. Our HWP architecture decouples waypoint planning into a coarse view selection phase and a refined waypoint location phase, effectively improving waypoint quality and enabling specialized training supervision for different phases. In terms of instruction-aligned model design, we introduce the global action-vision co-grounding and local text-vision co-grounding modules to explicitly improve the understanding of visual landmarks and actions, thereby enhancing the alignment between instructions and trajectories. In terms of instruction-aligned model optimization, we employ reference-waypoint-oriented supervision and direction-aware loss to optimize the model for enhanced instruction following and waypoint execution capabilities. Experiments on the standard benchmark demonstrate the effectiveness of our approach, with improved success rate compared to existing methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1433-7541 1433-755X
DOI:	10.1007/s10044-024-01339-z