OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling

Constrained by the separate encoding of vision and language, existing grounding and referring segmentation works heavily rely on bulky Transformer-based fusion en-/decoders and a variety of early-stage interaction technologies. Simultaneously, the current mask visual language modeling (MVLM) fails t...

Full description

Saved in:
Bibliographic Details
Main Authors Xiao, Linhui, Yang, Xiaoshan, Peng, Fang, Wang, Yaowei, Xu, Changsheng
Format Journal Article
LanguageEnglish
Published 10.10.2024
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2410.08021

Cover

Loading…