Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding
From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
13.12.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | From a visual scene containing multiple people, human is able to distinguish
each individual given the context descriptions about what happened before,
their mental/physical states or intentions, etc. Above ability heavily relies
on human-centric commonsense knowledge and reasoning. For example, if asked to
identify the "person who needs healing" in an image, we need to first know that
they usually have injuries or suffering expressions, then find the
corresponding visual clues before finally grounding the person. We present a
new commonsense task, Human-centric Commonsense Grounding, that tests the
models' ability to ground individuals given the context descriptions about what
happened before, and their mental/physical states or intentions. We further
create a benchmark, HumanCog, a dataset with 130k grounded commonsensical
descriptions annotated on 67k images, covering diverse types of commonsense and
visual scenes. We set up a context-object-aware method as a strong baseline
that outperforms previous pre-trained and non-pretrained models. Further
analysis demonstrates that rich visual commonsense and powerful integration of
multi-modal commonsense are essential, which sheds light on future works. Data
and code will be available https://github.com/Hxyou/HumanCog. |
---|---|
DOI: | 10.48550/arxiv.2212.06971 |