Referring expression comprehension model with matching detection and linguistic feedback

The task of referring expression comprehension (REC) is to localise an image region of a specific object described by a natural language expression, and all existing REC methods assume that the object described by the referring expression must be located in the given image. However, this assumption...

Full description

Saved in:

Bibliographic Details
Published in	IET computer vision Vol. 14; no. 8; pp. 625 - 633
Main Authors	Wang, Jianming, Cui, Enjie, Liu, Kunliang, Sun, Yukuan, Liang, Jiayu, Yuan, Chunmiao, Duan, Xiaojie, Jin, Guanghao, Chung, Tae-Sun
Format	Journal Article
Language	English
Published	The Institution of Engineering and Technology 01.12.2020 Wiley
Subjects	entity detection module expression parsing module expression‐image mismatching expression‐image pairs image matching image region linguistic feedback matching detection module natural language expression natural language processing NP‐RefCOCO object detection REC referring expression comprehension relationship detection module Research Article natural language processing relationship detection module image region expression-image pairs entity detection module object detection image matching NP-RefCOCO REC expression parsing module linguistic feedback natural language expression referring expression comprehension matching detection module expression-image mismatching
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The task of referring expression comprehension (REC) is to localise an image region of a specific object described by a natural language expression, and all existing REC methods assume that the object described by the referring expression must be located in the given image. However, this assumption is not correct in some real applications. For example, a visually impaired user might tell his robot ‘please take the laptop on the table to me’. In fact, the laptop is not on the table anymore. To address this problem, the authors propose a novel REC model to deal with the situation where expression-image mismatching occurs and explain the mismatching by linguistic feedback. The authors' REC model consists of four modules: the expression parsing module, the entity detection module, the relationship detection module, and the matching detection module. They built a data set called NP-RefCOCO+ from RefCOCO+ including both positive samples and negative samples. The positive samples are original expression-image pairs in RefCOCO+. The negative samples are the expression-image pairs in RefCOCO+, whose expressions are replaced. They evaluate the model on NP-RefCOCO+ and the experimental results show the advantages of their method for dealing with the problem of expression-image mismatching.
ISSN:	1751-9632 1751-9640 1751-9640
DOI:	10.1049/iet-cvi.2019.0483