Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have ma...
Saved in:
Published in | 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 3674 - 3683 |
---|---|
Main Authors | , , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset1. |
---|---|
AbstractList | A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset1. |
Author | Wu, Qi Johnson, Mark Sunderhauf, Niko Anderson, Peter Reid, Ian van den Hengel, Anton Bruce, Jake Teney, Damien Gould, Stephen |
Author_xml | – sequence: 1 givenname: Peter surname: Anderson fullname: Anderson, Peter – sequence: 2 givenname: Qi surname: Wu fullname: Wu, Qi – sequence: 3 givenname: Damien surname: Teney fullname: Teney, Damien – sequence: 4 givenname: Jake surname: Bruce fullname: Bruce, Jake – sequence: 5 givenname: Mark surname: Johnson fullname: Johnson, Mark – sequence: 6 givenname: Niko surname: Sunderhauf fullname: Sunderhauf, Niko – sequence: 7 givenname: Ian surname: Reid fullname: Reid, Ian – sequence: 8 givenname: Stephen surname: Gould fullname: Gould, Stephen – sequence: 9 givenname: Anton surname: van den Hengel fullname: van den Hengel, Anton |
BookMark | eNpNj1FLwzAUhaMoOOeeffClfyAzaZLm1jcZcw6GytC9jrv2tkS7dCTtYP_eij74dD4OHwfONbvwrSfGbqWYSiny-9nmbT1NhYSpEArsGZvkFqRRkGU6Ffk5G0mRKZ7lMr9ikxg_hRBpBgq0GbGvjYuu9Rx9yVfo6x5rSl7w6Grshv4hWfqOwiFQ53ydDHKPTXPii9D2vqTynzqYsQt98cMxcT5ZEzbJ3B9daP2efBdv2GWFTaTJX47Zx9P8ffbMV6-L5exxxV0Kace1xErZVBhpjNJVgWBADE8NSrKaigoVljuwiMVOFmSk1AKkFXqnVFURqDG7-911RLQ9BLfHcNqCsaDBqG_UK1xI |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/CVPR.2018.00387 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences |
EISBN | 9781538664209 1538664208 |
EISSN | 1063-6919 |
EndPage | 3683 |
ExternalDocumentID | 8578485 |
Genre | orig-research |
GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
ID | FETCH-LOGICAL-i282t-41af3720515534fca85801105a1e74ecfa3adb87aacb1ce5114081704b33ffe83 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:52:16 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i282t-41af3720515534fca85801105a1e74ecfa3adb87aacb1ce5114081704b33ffe83 |
OpenAccessLink | https://eprints.qut.edu.au/124633/1/124633.pdf |
PageCount | 10 |
ParticipantIDs | ieee_primary_8578485 |
PublicationCentury | 2000 |
PublicationDate | 2018-06 |
PublicationDateYYYYMMDD | 2018-06-01 |
PublicationDate_xml | – month: 06 year: 2018 text: 2018-06 |
PublicationDecade | 2010 |
PublicationTitle | 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition |
PublicationTitleAbbrev | CVPR |
PublicationYear | 2018 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0002683845 ssj0003211698 |
Score | 2.616742 |
Snippet | A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 3674 |
SubjectTerms | Cameras Natural languages Navigation Robots Task analysis Three-dimensional displays Visualization |
Title | Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments |
URI | https://ieeexplore.ieee.org/document/8578485 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH4BTp5Qwfg7PXi0sK1dV7wSiBohhAjhRtquTQhkmAAm-tfbdnOg8eBtW5qt7br1fe9973sAd8rEgf03drBJUompiggWmhqLWoVLdZQJ89HzwZA9TujzLJ5V4L7MhdFae_KZbrlDH8tP12rnXGVtbpcX5XEVqha45blapT8lYpzwIkLmzolFNvY5hZpPGHTa3elo7LhcjjxJ-M9yKn436ddh8N2PnESybO22sqU-f0k0_rejx9Dc5-2hUbkjnUBFZ6dQLwxNVHzGmwYspz6jHIssxS-FwxINxbuX21hnD6ikItrbINt4J1arD-z8VM5hftAUPe0VaDdokaGxNTxR7yB7rgmTfu-1-4iLqgt4YeHXFtNQGFe6xtV-IdQowWPujIRYhDqhWhlBRCp5IoSSodLWYKOBU_mjkhBjNCdnUMvWmT4HxDqxUSljFoWkDjhKGUWaiYDpNCFhKC-g4eZu_pYLa8yLabv8-_IVHLm3l_O0rqFmR6dvrEWwlbd-KXwBcBG2sA |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8IwGH6DeNATKhi_7cGjhY12XfFKMKBACAHCjbRdmxDIMAFM9NfbbnOg8eBtW5qt7br1_Xie5wV4UCbw7L-xgU0YSUxVnWChqbFeq3BURxmyJHve67P2mL5Mg2kBHnMujNY6AZ_pqjtMcvnRSm1dqKzG7fKiPDiAQ7vvB37K1sojKnXGCc9yZO6cWN_GPinT8_G9Rq05GQwdmsvBJwn_WVAl2U-eS9D77kkKI1lUtxtZVZ-_RBr_29UTqOyYe2iQ70mnUNDxGZQyUxNlH_K6DItJwinHIo5wNwtZor54TwQ3VvETysGI9jbINt6K5fIDu0iVC5nvNUWdnQbtGs1jNLSmJ2rt8ecqMH5ujZptnNVdwHPrgG0w9YVxxWtc9RdCjRI84M5MCISvQ6qVEUREkodCKOkrbU026jmdPyoJMUZzcg7FeBXrC0CsERgVMWb9kMi5jlLW65oJj-koJL4vL6Hs5m72lkprzLJpu_r78j0ctUe97qzb6b9ew7F7kylq6waKdqT61toHG3mXLIsvI9q5-Q |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition&rft.atitle=Vision-and-Language+Navigation%3A+Interpreting+Visually-Grounded+Navigation+Instructions+in+Real+Environments&rft.au=Anderson%2C+Peter&rft.au=Wu%2C+Qi&rft.au=Teney%2C+Damien&rft.au=Bruce%2C+Jake&rft.date=2018-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=3674&rft.epage=3683&rft_id=info:doi/10.1109%2FCVPR.2018.00387&rft.externalDocID=8578485 |