QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different infor...
Saved in:
Main Authors | , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
22.12.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The important manifestation of robot intelligence is the ability to naturally
interact and autonomously make decisions. Traditional approaches to robot
control often compartmentalize perception, planning, and decision-making,
simplifying system design but limiting the synergy between different
information streams. This compartmentalization poses challenges in achieving
seamless autonomous reasoning, decision-making, and action execution. To
address these limitations, a novel paradigm, named Vision-Language-Action tasks
for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This
approach tightly integrates visual information and instructions to generate
executable actions, effectively merging perception, planning, and
decision-making. The central idea is to elevate the overall intelligence of the
robot. Within this framework, a notable challenge lies in aligning fine-grained
instructions with visual perception information. This emphasizes the complexity
involved in ensuring that the robot accurately interprets and acts upon
detailed instructions in harmony with its visual observations. Consequently, we
propose QUAdruped Robotic Transformer (QUART), a family of VLA models to
integrate visual information and instructions from diverse modalities as input
and generates executable actions for real-world robots and present QUAdruped
Robot Dataset (QUARD), a large-scale multi-task dataset including navigation,
complex terrain locomotion, and whole-body manipulation tasks for training
QUART models. Our extensive evaluation (4000 evaluation trials) shows that our
approach leads to performant robotic policies and enables QUART to obtain a
range of emergent capabilities. |
---|---|
DOI: | 10.48550/arxiv.2312.14457 |