OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling

Constrained by the separate encoding of vision and language, existing grounding and referring segmentation works heavily rely on bulky Transformer-based fusion en-/decoders and a variety of early-stage interaction technologies. Simultaneously, the current mask visual language modeling (MVLM) fails t...

Full description

Saved in:

Bibliographic Details
Main Authors	Xiao, Linhui, Yang, Xiaoshan, Peng, Fang, Wang, Yaowei, Xu, Changsheng
Format	Journal Article
Language	English
Published	10.10.2024
Subjects	Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text
DOI	10.48550/arxiv.2410.08021

Cover

Loading…

More Information
Metadata