OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Constrained by the separate encoding of vision and language, existing grounding and referring segmentation works heavily rely on bulky Transformer-based fusion en-/decoders and a variety of early-stage interaction technologies. Simultaneously, the current mask visual language modeling (MVLM) fails t...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
10.10.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2410.08021 |
Cover
Loading…