Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus
Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression. Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video. This assumption, which we refer to as semantic consen...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
04.07.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Referring Video Object Segmentation (R-VOS) is a challenging task that aims
to segment an object in a video based on a linguistic expression. Most existing
R-VOS methods have a critical assumption: the object referred to must appear in
the video. This assumption, which we refer to as semantic consensus, is often
violated in real-world scenarios, where the expression may be queried against
false videos. In this work, we highlight the need for a robust R-VOS model that
can handle semantic mismatches. Accordingly, we propose an extended task called
Robust R-VOS, which accepts unpaired video-text inputs. We tackle this problem
by jointly modeling the primary R-VOS problem and its dual (text
reconstruction). A structural text-to-text cycle constraint is introduced to
discriminate semantic consensus between video-text pairs and impose it in
positive pairs, thereby achieving multi-modal alignment from both positive and
negative pairs. Our structural constraint effectively addresses the challenge
posed by linguistic diversity, overcoming the limitations of previous methods
that relied on the point-wise constraint. A new evaluation dataset,
R\textsuperscript{2}-Youtube-VOSis constructed to measure the model robustness.
Our model achieves state-of-the-art performance on R-VOS benchmarks,
Ref-DAVIS17 and Ref-Youtube-VOS, and also our
R\textsuperscript{2}-Youtube-VOS~dataset. |
---|---|
DOI: | 10.48550/arxiv.2207.01203 |