Decoupling Multimodal Transformers for Referring Video Object Segmentation

Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, m...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 33; no. 9; p. 1
Main Authors	Gao, Mingqi, Yang, Jinyu, Han, Jungong, Lu, Ke, Zheng, Feng, Montana, Giovanni
Format	Journal Article
Language	English
Published	New York IEEE 01.09.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Alignment Decoupled multimodal transformers Decoupling Referring video object segmentation Segmentation Transformers Vision Vision-language pre-training
Online Access	Get full text

Cover

Loading…

Abstract	Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, most efforts focus on the interaction between visual and textual features. However, they ignore the syntactic structures of the text during the interaction, where all textual components are intertwined, resulting in ambiguous vision-language alignment. In this paper, we improve the multimodal interaction by DECOUPLING the interweave. Specifically, we train a lightweight subject perceptron, which extracts the subject part from the input text. Then, the subject and text features are fed into two parallel branches to interact with visual features. This enables us to perform subject-aware and context-aware interactions, respectively, thus encouraging more explicit and discriminative feature embedding and alignment. Moreover, we find the decoupled architecture also facilitates incorporating the vision-language pre-trained alignment into RVOS, further improving the segmentation performance. Experimental results on all RVOS benchmark datasets demonstrate the superiority of our proposed method over the state-of-the-arts. The code of our method is available at: https://github.com/gaomingqi/dmformer.
AbstractList	Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, most efforts focus on the interaction between visual and textual features. However, they ignore the syntactic structures of the text during the interaction, where all textual components are intertwined, resulting in ambiguous vision-language alignment. In this paper, we improve the multimodal interaction by DECOUPLING the interweave. Specifically, we train a lightweight subject perceptron, which extracts the subject part from the input text. Then, the subject and text features are fed into two parallel branches to interact with visual features. This enables us to perform subject-aware and context-aware interactions, respectively, thus encouraging more explicit and discriminative feature embedding and alignment. Moreover, we find the decoupled architecture also facilitates incorporating the vision-language pre-trained alignment into RVOS, further improving the segmentation performance. Experimental results on all RVOS benchmark datasets demonstrate the superiority of our proposed method over the state-of-the-arts. The code of our method is available at: https://github.com/gaomingqi/dmformer .
Author	Lu, Ke Han, Jungong Montana, Giovanni Zheng, Feng Yang, Jinyu Gao, Mingqi
Author_xml	– sequence: 1 givenname: Mingqi orcidid: 0000-0002-8688-8228 surname: Gao fullname: Gao, Mingqi organization: Southern University of Science and Technology, Shenzhen, China – sequence: 2 givenname: Jinyu surname: Yang fullname: Yang, Jinyu organization: Southern University of Science and Technology, Shenzhen, China – sequence: 3 givenname: Jungong orcidid: 0000-0003-4361-956X surname: Han fullname: Han, Jungong organization: University of Warwick, Coventry, U.K – sequence: 4 givenname: Ke orcidid: 0000-0003-0176-3088 surname: Lu fullname: Lu, Ke organization: University of Chinese Academy of Sciences, Beijing, China – sequence: 5 givenname: Feng orcidid: 0000-0002-1701-9141 surname: Zheng fullname: Zheng, Feng organization: Southern University of Science and Technology, Shenzhen, China – sequence: 6 givenname: Giovanni orcidid: 0000-0003-3942-3900 surname: Montana fullname: Montana, Giovanni organization: University of Warwick, Coventry, U.K
BookMark	eNp9kEtPwzAQhC1UJNrCH0AcInFO8TO2j6i8VVSJhl4j19lUrtK42MmBf09KOCAOnGYP8-3szgSNGt8AQpcEzwjB-iafr9b5jGLKZowqrqU-QWMihEopxWLUz1iQVFEiztAkxh3GhCsux-jlDqzvDrVrtslrV7du70tTJ3kwTax82EOISa_JG1QQwtG1diX4ZLnZgW2TFWz30LSmdb45R6eVqSNc_OgUvT_c5_OndLF8fJ7fLlJLddamEgwVQoOiTDBcUmqZLImR_c2ghAVOmGZVRirLOCimoZLMcqGpkBsuNoZN0fWw9xD8RwexLXa-C00fWVCVEZ5JqUnvooPLBh9jgKo4BLc34bMguDh2Vnx3Vhw7K3466yH1B7JueK4NxtX_o1cD6gDgVxbhUmPJvgAcXHwK
CODEN	ITCTEM
CitedBy_id	crossref_primary_10_1016_j_neucom_2024_127878 crossref_primary_10_1109_TCSVT_2024_3450861 crossref_primary_10_1016_j_neucom_2025_129435 crossref_primary_10_1109_TCSVT_2024_3419119 crossref_primary_10_1109_TCSVT_2024_3383914 crossref_primary_10_1109_TETCI_2024_3409707 crossref_primary_10_1109_TCSVT_2023_3345852 crossref_primary_10_1109_TCSVT_2024_3431714 crossref_primary_10_1109_TCSVT_2024_3462100
Cites_doi	10.1007/978-3-030-58555-6_13 10.1109/CVPR52688.2022.01069 10.1007/978-3-319-46475-6_5 10.1609/aaai.v34i07.6895 10.1109/ICCV.2019.00404 10.1109/CVPR52688.2022.00493 10.1109/TIP.2022.3161832 10.1109/CVPR52688.2022.00494 10.1109/TPAMI.2022.3217852 10.18653/v1/2022.acl-long.421 10.1109/CVPR.2016.9 10.1109/ICCV48922.2021.00986 10.1109/CVPR.2018.00624 10.24963/ijcai.2020/132 10.1109/CVPR52688.2022.00492 10.1109/TCSVT.2017.2655624 10.1109/CVPR46437.2021.00417 10.1109/TPAMI.2023.3262578 10.1145/3474085.3479207 10.1609/aaai.v36i2.20017 10.1109/CVPR52688.2022.00491 10.1007/978-3-030-58452-8_13 10.1109/CVPR46437.2021.00863 10.1109/CVPR52688.2022.01144
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID	97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D
DOI	10.1109/TCSVT.2023.3284979
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional
DatabaseTitleList	Technology Research Database
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	1558-2205
EndPage	1
ExternalDocumentID	10_1109_TCSVT_2023_3284979 10147907
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China grantid: 61972188; 62122035 funderid: 10.13039/501100001809 – fundername: National Key Research and Development Program of China grantid: 2022YFF1202903 funderid: 10.13039/501100012166
GroupedDBID	-~X 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS HZ~ IFIPE IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS RXW TAE TN5 5VS AAYXX AETIX AGSQL AI. AIBXA ALLEH CITATION EJD H~9 ICLAB IFJZH RIG VH1 7SC 7SP 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c296t-7ea2559e823530d22c37d1a7979e85ce41393f61fc34e839ef73c459257b45ba3
IEDL.DBID	RIE
ISSN	1051-8215
IngestDate	Mon Jun 30 08:36:22 EDT 2025 Tue Jul 01 00:41:22 EDT 2025 Thu Apr 24 23:04:16 EDT 2025 Wed Aug 27 02:25:49 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	9
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c296t-7ea2559e823530d22c37d1a7979e85ce41393f61fc34e839ef73c459257b45ba3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0003-0176-3088 0000-0002-1701-9141 0000-0003-4361-956X 0000-0002-8688-8228 0000-0003-3942-3900
PQID	2861467791
PQPubID	85433
PageCount	1
ParticipantIDs	ieee_primary_10147907 crossref_primary_10_1109_TCSVT_2023_3284979 crossref_citationtrail_10_1109_TCSVT_2023_3284979 proquest_journals_2861467791
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2023-09-01
PublicationDateYYYYMMDD	2023-09-01
PublicationDate_xml	– month: 09 year: 2023 text: 2023-09-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York
PublicationTitle	IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev	TCSVT
PublicationYear	2023
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref13 ref35 ref12 ref15 ref14 ref31 dosovitskiy (ref26) 2021 ref30 li (ref19) 2022 ref11 ref10 ref1 ref17 ref39 ref16 ref38 ref18 radford (ref23) 2021 ye (ref8) 2022; 44 paszke (ref37) 2019; 32 zhu (ref28) 2021 liu (ref7) 2022; 44 vaswani (ref25) 2017; 30 chen (ref22) 2022 khoreva (ref20) 2018 ref24 ref21 bird (ref40) 2009 zhang (ref34) 2022 ref27 ref29 ref9 ref4 ref3 ref6 zhou (ref2) 2021 ref5 yao (ref33) 2022 zeng (ref32) 2022 devlin (ref36) 2019
References_xml	– ident: ref21 doi: 10.1007/978-3-030-58555-6_13 – volume: 30 start-page: 5998 year: 2017 ident: ref25 article-title: Attention is all you need publication-title: Proc NeurIPS – start-page: 123 year: 2018 ident: ref20 article-title: Video object segmentation with language referring expressions publication-title: Proc ACCV – ident: ref24 doi: 10.1109/CVPR52688.2022.01069 – ident: ref38 doi: 10.1007/978-3-319-46475-6_5 – ident: ref4 doi: 10.1609/aaai.v34i07.6895 – ident: ref5 doi: 10.1109/ICCV.2019.00404 – start-page: 36067 year: 2022 ident: ref34 article-title: GLIPv2: Unifying localization and vision-language understanding publication-title: Proc NeurIPS – ident: ref17 doi: 10.1109/CVPR52688.2022.00493 – ident: ref14 doi: 10.1109/TIP.2022.3161832 – volume: 44 start-page: 3719 year: 2022 ident: ref8 article-title: Referring segmentation in images and videos with cross-modal self-attention network publication-title: IEEE Trans Pattern Anal Mach Intell – volume: 44 start-page: 4761 year: 2022 ident: ref7 article-title: Cross-modal progressive comprehension for referring segmentation publication-title: IEEE Trans Pattern Anal Mach Intell – ident: ref13 doi: 10.1109/CVPR52688.2022.00494 – start-page: 8748 year: 2021 ident: ref23 article-title: Learning transferable visual models from natural language supervision publication-title: Proc ICML – start-page: 1 year: 2021 ident: ref28 article-title: Deformable DETR: Deformable transformers for end-to-end object detection publication-title: Proc ICLR – ident: ref15 doi: 10.1109/TPAMI.2022.3217852 – ident: ref31 doi: 10.18653/v1/2022.acl-long.421 – ident: ref39 doi: 10.1109/CVPR.2016.9 – ident: ref35 doi: 10.1109/ICCV48922.2021.00986 – volume: 32 start-page: 8026 year: 2019 ident: ref37 article-title: Pytorch: An imperative style, high-performance deep learning library publication-title: Proc NeurIPS – start-page: 25994 year: 2022 ident: ref32 article-title: Multi-grained vision language pre-training: Aligning texts with visual concepts publication-title: Proc ICML – ident: ref3 doi: 10.1109/CVPR.2018.00624 – ident: ref6 doi: 10.24963/ijcai.2020/132 – ident: ref18 doi: 10.1109/CVPR52688.2022.00492 – start-page: 4171 year: 2019 ident: ref36 article-title: Bert: Pre-training of deep bidirectional transformers for language understanding publication-title: Proc NAACL – ident: ref1 doi: 10.1109/TCSVT.2017.2655624 – ident: ref9 doi: 10.1109/CVPR46437.2021.00417 – year: 2021 ident: ref2 article-title: A survey on deep learning technique for video segmentation publication-title: arXiv 2107 01153 – ident: ref16 doi: 10.1109/TPAMI.2023.3262578 – year: 2022 ident: ref19 article-title: R2 VOS: Robust referring video object segmentation via relational multimodal cycle consistency publication-title: arXiv 2207 01203 – year: 2022 ident: ref22 article-title: VLP: A survey on vision-language pre-training publication-title: arXiv 2202 09061 – year: 2009 ident: ref40 publication-title: Natural Language Processing with Python Analyzing Text with the Natural Language Toolkit – ident: ref30 doi: 10.1145/3474085.3479207 – start-page: 1 year: 2021 ident: ref26 article-title: An image is worth 16×16 words: Transformers for image recognition at scale publication-title: Proc ICLR – ident: ref10 doi: 10.1609/aaai.v36i2.20017 – ident: ref12 doi: 10.1109/CVPR52688.2022.00491 – start-page: 1 year: 2022 ident: ref33 article-title: FILIP: Fine-grained interactive language-image pre-training publication-title: Proc ICLR – ident: ref27 doi: 10.1007/978-3-030-58452-8_13 – ident: ref29 doi: 10.1109/CVPR46437.2021.00863 – ident: ref11 doi: 10.1109/CVPR52688.2022.01144
SSID	ssj0014847
Score	2.4735582
Snippet	Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling...
SourceID	proquest crossref ieee
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	1
SubjectTerms	Alignment Decoupled multimodal transformers Decoupling Referring video object segmentation Segmentation Transformers Vision Vision-language pre-training
Title	Decoupling Multimodal Transformers for Referring Video Object Segmentation
URI	https://ieeexplore.ieee.org/document/10147907 https://www.proquest.com/docview/2861467791
Volume	33
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELWgEwx8FlEoKAMbStrEcRyPqFBVlShD06pbFDvnCkEbBO3Cr-fsJFUFArEkGezIurN97-x7d4TcdFWsM59JF3I_QgeFU1dAlrlRCAxUxBFUGILz4ygaTMLhjM0qsrrlwgCADT4Dz3zau_y8UGtzVNYxdWW5MNzxXfTcSrLW5sogjG01McQLvhujIasZMl3RSXrjaeKZQuEexe1YmLitLStky6r82IutgekfklE9tDKu5MVbr6SnPr9lbfz32I_IQQU1nbtybhyTHViekP2tBISnZHiP3ufakHLnjqXiLoocuyQ1mkVs6ODbseloTRdn-pxD4TxJc3zjjGG-qLhLyyaZ9B-S3sCtqiu4KhDRyuWQGXcC4oAy2s2DQFGe-xlH6UDMFKB1E1RHvlY0BIRRoDlVIRO4xmXIZEbPSGNZLOGcOEJoLlGrUc50qAGkEExSqvERUx1kLeLX0k5VlXrcVMB4Ta0L0hWp1VBqNJRWGmqR202ftzLxxp-tm0bkWy1LabdIu9ZqWi3OjzSII2MfuPAvful2SfbM38tYsjZprN7XcIXgYyWv7aT7AiNR1OY
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELYQDMDAs4jyzMCGkjZxHMcjKlSl0DIQEFsUO-cKASmCZuHXc3YSVIFALEkGn2Ld2b7P9n13hJx0Vawzn0kXcj_CDQqnroAsc6MQGKiII6gwBOfROBrchcMH9lCT1S0XBgBs8Bl45tPe5edTVZqjso6pK8uF4Y4voeNnfkXX-ro0CGNbTwwRg-_G6MoajkxXdJLe7X3imVLhHsUFWZjIrTk_ZAur_FiNrYvpr5Nx07kqsuTJK2fSUx_f8jb-u_cbZK0Gm85ZNTo2yQIUW2R1LgXhNhme4_6zNLTciWPJuC_THEWSBs8iOnTw7diEtEbEuX_MYercSHOA49zC5KVmLxUtcte_SHoDt66v4KpARDOXQ2Y2FBAHlNFuHgSK8tzPOGoHYqYA_ZugOvK1oiEgkALNqQqZwFkuQyYzukMWi2kBu8QRQnOJdo1ypkMNIIVgklKNj5jqIGsTv9F2qurk46YGxnNqNyFdkVoLpcZCaW2hNjn9knmtUm_82bplVD7XstJ2mxw0Vk3r6fmeBnFkPAQX_t4vYsdkeZCMrtPry_HVPlkxf6oiyw7I4uythEOEIjN5ZAfgJ2of2C8
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Decoupling+Multimodal+Transformers+for+Referring+Video+Object+Segmentation&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Gao%2C+Mingqi&rft.au=Yang%2C+Jinyu&rft.au=Han%2C+Jungong&rft.au=Lu%2C+Ke&rft.date=2023-09-01&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=33&rft.issue=9&rft.spage=4518&rft.epage=4528&rft_id=info:doi/10.1109%2FTCSVT.2023.3284979&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCSVT_2023_3284979
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon