OpenEQA: Embodied Question Answering in the Era of Foundation Models

We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory, exemplified by agents on smart glasses, or...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 16488 - 16498
Main Authors	Majumdar, Arjun, Ajay, Anurag, Zhang, Xiaohan, Putta, Pranav, Yenamandra, Sriram, Henaff, Mikael, Silwal, Sneha, Mcvay, Paul, Maksymets, Oleksandr, Arnaud, Sergio, Yadav, Karmesh, Li, Qiyang, Newman, Ben, Sharma, Mohit, Berges, Vincent, Zhang, Shiqi, Agrawal, Pulkit, Bisk, Yonatan, Batra, Dhruv, Kalakrishnan, Mrinal, Meier, Franziska, Paxton, Chris, Sax, Alexander, Rajeswaran, Aravind
Format	Conference Proceeding
Language	English
Published	IEEE 16.06.2024
Subjects	Benchmark testing Embodied AI Embodied Question Answering Natural languages Pattern recognition Protocols Question answering (information retrieval) Reliability Semantics Vision-Language Models
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory, exemplified by agents on smart glasses, or by actively exploring the environment, as in the case of mobile robots. We accompany our formulation with OpenEQA - the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset, we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol, we evaluate several state-of-the-art foundation models including GPT-4V, and find that they significantly lag behind human-level performance. Consequently, OpenEQA stands out as a straightforward, measurable, and practically rele-vant benchmark that poses a considerable challenge to current generation offoundation models. We hope this inspires and stimulates future research at the intersection of Embod-ied AI, conversational agents, and world models.
ISSN:	2575-7075
DOI:	10.1109/CVPR52733.2024.01560