Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have ma...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 3674 - 3683
Main Authors Anderson, Peter, Wu, Qi, Teney, Damien, Bruce, Jake, Johnson, Mark, Sunderhauf, Niko, Reid, Ian, Gould, Stephen, van den Hengel, Anton
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2018
Subjects
Online AccessGet full text

Cover

Loading…
Abstract A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset1.
AbstractList A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a natural-language navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions, we present the Matter-port3D Simulator - a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings - the Room-to-Room (R2R) dataset1.
Author Wu, Qi
Johnson, Mark
Sunderhauf, Niko
Anderson, Peter
Reid, Ian
van den Hengel, Anton
Bruce, Jake
Teney, Damien
Gould, Stephen
Author_xml – sequence: 1
  givenname: Peter
  surname: Anderson
  fullname: Anderson, Peter
– sequence: 2
  givenname: Qi
  surname: Wu
  fullname: Wu, Qi
– sequence: 3
  givenname: Damien
  surname: Teney
  fullname: Teney, Damien
– sequence: 4
  givenname: Jake
  surname: Bruce
  fullname: Bruce, Jake
– sequence: 5
  givenname: Mark
  surname: Johnson
  fullname: Johnson, Mark
– sequence: 6
  givenname: Niko
  surname: Sunderhauf
  fullname: Sunderhauf, Niko
– sequence: 7
  givenname: Ian
  surname: Reid
  fullname: Reid, Ian
– sequence: 8
  givenname: Stephen
  surname: Gould
  fullname: Gould, Stephen
– sequence: 9
  givenname: Anton
  surname: van den Hengel
  fullname: van den Hengel, Anton
BookMark eNpNj1FLwzAUhaMoOOeeffClfyAzaZLm1jcZcw6GytC9jrv2tkS7dCTtYP_eij74dD4OHwfONbvwrSfGbqWYSiny-9nmbT1NhYSpEArsGZvkFqRRkGU6Ffk5G0mRKZ7lMr9ikxg_hRBpBgq0GbGvjYuu9Rx9yVfo6x5rSl7w6Grshv4hWfqOwiFQ53ydDHKPTXPii9D2vqTynzqYsQt98cMxcT5ZEzbJ3B9daP2efBdv2GWFTaTJX47Zx9P8ffbMV6-L5exxxV0Kace1xErZVBhpjNJVgWBADE8NSrKaigoVljuwiMVOFmSk1AKkFXqnVFURqDG7-911RLQ9BLfHcNqCsaDBqG_UK1xI
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR.2018.00387
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9781538664209
1538664208
EISSN 1063-6919
EndPage 3683
ExternalDocumentID 8578485
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i282t-41af3720515534fca85801105a1e74ecfa3adb87aacb1ce5114081704b33ffe83
IEDL.DBID RIE
IngestDate Wed Aug 27 02:52:16 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i282t-41af3720515534fca85801105a1e74ecfa3adb87aacb1ce5114081704b33ffe83
OpenAccessLink https://eprints.qut.edu.au/124633/1/124633.pdf
PageCount 10
ParticipantIDs ieee_primary_8578485
PublicationCentury 2000
PublicationDate 2018-06
PublicationDateYYYYMMDD 2018-06-01
PublicationDate_xml – month: 06
  year: 2018
  text: 2018-06
PublicationDecade 2010
PublicationTitle 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
PublicationTitleAbbrev CVPR
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0002683845
ssj0003211698
Score 2.616742
Snippet A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a...
SourceID ieee
SourceType Publisher
StartPage 3674
SubjectTerms Cameras
Natural languages
Navigation
Robots
Task analysis
Three-dimensional displays
Visualization
Title Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
URI https://ieeexplore.ieee.org/document/8578485
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH4BTp5Qwfg7PXi0sK1dV7wSiBohhAjhRtquTQhkmAAm-tfbdnOg8eBtW5qt7br1fe9973sAd8rEgf03drBJUompiggWmhqLWoVLdZQJ89HzwZA9TujzLJ5V4L7MhdFae_KZbrlDH8tP12rnXGVtbpcX5XEVqha45blapT8lYpzwIkLmzolFNvY5hZpPGHTa3elo7LhcjjxJ-M9yKn436ddh8N2PnESybO22sqU-f0k0_rejx9Dc5-2hUbkjnUBFZ6dQLwxNVHzGmwYspz6jHIssxS-FwxINxbuX21hnD6ikItrbINt4J1arD-z8VM5hftAUPe0VaDdokaGxNTxR7yB7rgmTfu-1-4iLqgt4YeHXFtNQGFe6xtV-IdQowWPujIRYhDqhWhlBRCp5IoSSodLWYKOBU_mjkhBjNCdnUMvWmT4HxDqxUSljFoWkDjhKGUWaiYDpNCFhKC-g4eZu_pYLa8yLabv8-_IVHLm3l_O0rqFmR6dvrEWwlbd-KXwBcBG2sA
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8IwGH6DeNATKhi_7cGjhY12XfFKMKBACAHCjbRdmxDIMAFM9NfbbnOg8eBtW5qt7br1_Xie5wV4UCbw7L-xgU0YSUxVnWChqbFeq3BURxmyJHve67P2mL5Mg2kBHnMujNY6AZ_pqjtMcvnRSm1dqKzG7fKiPDiAQ7vvB37K1sojKnXGCc9yZO6cWN_GPinT8_G9Rq05GQwdmsvBJwn_WVAl2U-eS9D77kkKI1lUtxtZVZ-_RBr_29UTqOyYe2iQ70mnUNDxGZQyUxNlH_K6DItJwinHIo5wNwtZor54TwQ3VvETysGI9jbINt6K5fIDu0iVC5nvNUWdnQbtGs1jNLSmJ2rt8ecqMH5ujZptnNVdwHPrgG0w9YVxxWtc9RdCjRI84M5MCISvQ6qVEUREkodCKOkrbU026jmdPyoJMUZzcg7FeBXrC0CsERgVMWb9kMi5jlLW65oJj-koJL4vL6Hs5m72lkprzLJpu_r78j0ctUe97qzb6b9ew7F7kylq6waKdqT61toHG3mXLIsvI9q5-Q
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition&rft.atitle=Vision-and-Language+Navigation%3A+Interpreting+Visually-Grounded+Navigation+Instructions+in+Real+Environments&rft.au=Anderson%2C+Peter&rft.au=Wu%2C+Qi&rft.au=Teney%2C+Damien&rft.au=Bruce%2C+Jake&rft.date=2018-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=3674&rft.epage=3683&rft_id=info:doi/10.1109%2FCVPR.2018.00387&rft.externalDocID=8578485