Learning Visual Commonsense for Robust Scene Graph Generation
Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph, which do not follow real-world rules and pattern...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
16.06.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Scene graph generation models understand the scene through object and
predicate recognition, but are prone to mistakes due to the challenges of
perception in the wild. Perception errors often lead to nonsensical
compositions in the output scene graph, which do not follow real-world rules
and patterns, and can be corrected using commonsense knowledge. We propose the
first method to acquire visual commonsense such as affordance and intuitive
physics automatically from data, and use that to improve the robustness of
scene understanding. To this end, we extend Transformer models to incorporate
the structure of scene graphs, and train our Global-Local Attention Transformer
on a scene graph corpus. Once trained, our model can be applied on any scene
graph generation model and correct its obvious mistakes, resulting in more
semantically plausible scene graphs. Through extensive experiments, we show our
model learns commonsense better than any alternative, and improves the accuracy
of state-of-the-art scene graph generation methods. |
---|---|
DOI: | 10.48550/arxiv.2006.09623 |