Learning to Assemble Neural Module Tree Networks for Visual Grounding
Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet....
Saved in:
Published in | Proceedings / IEEE International Conference on Computer Vision pp. 4672 - 4681 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.10.2019
|
Subjects | |
Online Access | Get full text |
ISSN | 2380-7504 |
DOI | 10.1109/ICCV.2019.00477 |
Cover
Loading…