Decoupling Multimodal Transformers for Referring Video Object Segmentation

Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, m...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 33; no. 9; p. 1
Main Authors Gao, Mingqi, Yang, Jinyu, Han, Jungong, Lu, Ke, Zheng, Feng, Montana, Giovanni
Format Journal Article
LanguageEnglish
Published New York IEEE 01.09.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, most efforts focus on the interaction between visual and textual features. However, they ignore the syntactic structures of the text during the interaction, where all textual components are intertwined, resulting in ambiguous vision-language alignment. In this paper, we improve the multimodal interaction by DECOUPLING the interweave. Specifically, we train a lightweight subject perceptron, which extracts the subject part from the input text. Then, the subject and text features are fed into two parallel branches to interact with visual features. This enables us to perform subject-aware and context-aware interactions, respectively, thus encouraging more explicit and discriminative feature embedding and alignment. Moreover, we find the decoupled architecture also facilitates incorporating the vision-language pre-trained alignment into RVOS, further improving the segmentation performance. Experimental results on all RVOS benchmark datasets demonstrate the superiority of our proposed method over the state-of-the-arts. The code of our method is available at: https://github.com/gaomingqi/dmformer.
AbstractList Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, most efforts focus on the interaction between visual and textual features. However, they ignore the syntactic structures of the text during the interaction, where all textual components are intertwined, resulting in ambiguous vision-language alignment. In this paper, we improve the multimodal interaction by DECOUPLING the interweave. Specifically, we train a lightweight subject perceptron, which extracts the subject part from the input text. Then, the subject and text features are fed into two parallel branches to interact with visual features. This enables us to perform subject-aware and context-aware interactions, respectively, thus encouraging more explicit and discriminative feature embedding and alignment. Moreover, we find the decoupled architecture also facilitates incorporating the vision-language pre-trained alignment into RVOS, further improving the segmentation performance. Experimental results on all RVOS benchmark datasets demonstrate the superiority of our proposed method over the state-of-the-arts. The code of our method is available at: https://github.com/gaomingqi/dmformer .
Author Lu, Ke
Han, Jungong
Montana, Giovanni
Zheng, Feng
Yang, Jinyu
Gao, Mingqi
Author_xml – sequence: 1
  givenname: Mingqi
  orcidid: 0000-0002-8688-8228
  surname: Gao
  fullname: Gao, Mingqi
  organization: Southern University of Science and Technology, Shenzhen, China
– sequence: 2
  givenname: Jinyu
  surname: Yang
  fullname: Yang, Jinyu
  organization: Southern University of Science and Technology, Shenzhen, China
– sequence: 3
  givenname: Jungong
  orcidid: 0000-0003-4361-956X
  surname: Han
  fullname: Han, Jungong
  organization: University of Warwick, Coventry, U.K
– sequence: 4
  givenname: Ke
  orcidid: 0000-0003-0176-3088
  surname: Lu
  fullname: Lu, Ke
  organization: University of Chinese Academy of Sciences, Beijing, China
– sequence: 5
  givenname: Feng
  orcidid: 0000-0002-1701-9141
  surname: Zheng
  fullname: Zheng, Feng
  organization: Southern University of Science and Technology, Shenzhen, China
– sequence: 6
  givenname: Giovanni
  orcidid: 0000-0003-3942-3900
  surname: Montana
  fullname: Montana, Giovanni
  organization: University of Warwick, Coventry, U.K
BookMark eNp9kEtPwzAQhC1UJNrCH0AcInFO8TO2j6i8VVSJhl4j19lUrtK42MmBf09KOCAOnGYP8-3szgSNGt8AQpcEzwjB-iafr9b5jGLKZowqrqU-QWMihEopxWLUz1iQVFEiztAkxh3GhCsux-jlDqzvDrVrtslrV7du70tTJ3kwTax82EOISa_JG1QQwtG1diX4ZLnZgW2TFWz30LSmdb45R6eVqSNc_OgUvT_c5_OndLF8fJ7fLlJLddamEgwVQoOiTDBcUmqZLImR_c2ghAVOmGZVRirLOCimoZLMcqGpkBsuNoZN0fWw9xD8RwexLXa-C00fWVCVEZ5JqUnvooPLBh9jgKo4BLc34bMguDh2Vnx3Vhw7K3466yH1B7JueK4NxtX_o1cD6gDgVxbhUmPJvgAcXHwK
CODEN ITCTEM
CitedBy_id crossref_primary_10_1016_j_neucom_2024_127878
crossref_primary_10_1109_TCSVT_2024_3450861
crossref_primary_10_1016_j_neucom_2025_129435
crossref_primary_10_1109_TCSVT_2024_3419119
crossref_primary_10_1109_TCSVT_2024_3383914
crossref_primary_10_1109_TETCI_2024_3409707
crossref_primary_10_1109_TCSVT_2023_3345852
crossref_primary_10_1109_TCSVT_2024_3431714
crossref_primary_10_1109_TCSVT_2024_3462100
Cites_doi 10.1007/978-3-030-58555-6_13
10.1109/CVPR52688.2022.01069
10.1007/978-3-319-46475-6_5
10.1609/aaai.v34i07.6895
10.1109/ICCV.2019.00404
10.1109/CVPR52688.2022.00493
10.1109/TIP.2022.3161832
10.1109/CVPR52688.2022.00494
10.1109/TPAMI.2022.3217852
10.18653/v1/2022.acl-long.421
10.1109/CVPR.2016.9
10.1109/ICCV48922.2021.00986
10.1109/CVPR.2018.00624
10.24963/ijcai.2020/132
10.1109/CVPR52688.2022.00492
10.1109/TCSVT.2017.2655624
10.1109/CVPR46437.2021.00417
10.1109/TPAMI.2023.3262578
10.1145/3474085.3479207
10.1609/aaai.v36i2.20017
10.1109/CVPR52688.2022.00491
10.1007/978-3-030-58452-8_13
10.1109/CVPR46437.2021.00863
10.1109/CVPR52688.2022.01144
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TCSVT.2023.3284979
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2205
EndPage 1
ExternalDocumentID 10_1109_TCSVT_2023_3284979
10147907
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: 61972188; 62122035
  funderid: 10.13039/501100001809
– fundername: National Key Research and Development Program of China
  grantid: 2022YFF1202903
  funderid: 10.13039/501100012166
GroupedDBID -~X
0R~
29I
4.4
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
HZ~
IFIPE
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
RIA
RIE
RNS
RXW
TAE
TN5
5VS
AAYXX
AETIX
AGSQL
AI.
AIBXA
ALLEH
CITATION
EJD
H~9
ICLAB
IFJZH
RIG
VH1
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c296t-7ea2559e823530d22c37d1a7979e85ce41393f61fc34e839ef73c459257b45ba3
IEDL.DBID RIE
ISSN 1051-8215
IngestDate Mon Jun 30 08:36:22 EDT 2025
Tue Jul 01 00:41:22 EDT 2025
Thu Apr 24 23:04:16 EDT 2025
Wed Aug 27 02:25:49 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 9
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c296t-7ea2559e823530d22c37d1a7979e85ce41393f61fc34e839ef73c459257b45ba3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-0176-3088
0000-0002-1701-9141
0000-0003-4361-956X
0000-0002-8688-8228
0000-0003-3942-3900
PQID 2861467791
PQPubID 85433
PageCount 1
ParticipantIDs ieee_primary_10147907
crossref_primary_10_1109_TCSVT_2023_3284979
crossref_citationtrail_10_1109_TCSVT_2023_3284979
proquest_journals_2861467791
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2023-09-01
PublicationDateYYYYMMDD 2023-09-01
PublicationDate_xml – month: 09
  year: 2023
  text: 2023-09-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev TCSVT
PublicationYear 2023
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref35
ref12
ref15
ref14
ref31
dosovitskiy (ref26) 2021
ref30
li (ref19) 2022
ref11
ref10
ref1
ref17
ref39
ref16
ref38
ref18
radford (ref23) 2021
ye (ref8) 2022; 44
paszke (ref37) 2019; 32
zhu (ref28) 2021
liu (ref7) 2022; 44
vaswani (ref25) 2017; 30
chen (ref22) 2022
khoreva (ref20) 2018
ref24
ref21
bird (ref40) 2009
zhang (ref34) 2022
ref27
ref29
ref9
ref4
ref3
ref6
zhou (ref2) 2021
ref5
yao (ref33) 2022
zeng (ref32) 2022
devlin (ref36) 2019
References_xml – ident: ref21
  doi: 10.1007/978-3-030-58555-6_13
– volume: 30
  start-page: 5998
  year: 2017
  ident: ref25
  article-title: Attention is all you need
  publication-title: Proc NeurIPS
– start-page: 123
  year: 2018
  ident: ref20
  article-title: Video object segmentation with language referring expressions
  publication-title: Proc ACCV
– ident: ref24
  doi: 10.1109/CVPR52688.2022.01069
– ident: ref38
  doi: 10.1007/978-3-319-46475-6_5
– ident: ref4
  doi: 10.1609/aaai.v34i07.6895
– ident: ref5
  doi: 10.1109/ICCV.2019.00404
– start-page: 36067
  year: 2022
  ident: ref34
  article-title: GLIPv2: Unifying localization and vision-language understanding
  publication-title: Proc NeurIPS
– ident: ref17
  doi: 10.1109/CVPR52688.2022.00493
– ident: ref14
  doi: 10.1109/TIP.2022.3161832
– volume: 44
  start-page: 3719
  year: 2022
  ident: ref8
  article-title: Referring segmentation in images and videos with cross-modal self-attention network
  publication-title: IEEE Trans Pattern Anal Mach Intell
– volume: 44
  start-page: 4761
  year: 2022
  ident: ref7
  article-title: Cross-modal progressive comprehension for referring segmentation
  publication-title: IEEE Trans Pattern Anal Mach Intell
– ident: ref13
  doi: 10.1109/CVPR52688.2022.00494
– start-page: 8748
  year: 2021
  ident: ref23
  article-title: Learning transferable visual models from natural language supervision
  publication-title: Proc ICML
– start-page: 1
  year: 2021
  ident: ref28
  article-title: Deformable DETR: Deformable transformers for end-to-end object detection
  publication-title: Proc ICLR
– ident: ref15
  doi: 10.1109/TPAMI.2022.3217852
– ident: ref31
  doi: 10.18653/v1/2022.acl-long.421
– ident: ref39
  doi: 10.1109/CVPR.2016.9
– ident: ref35
  doi: 10.1109/ICCV48922.2021.00986
– volume: 32
  start-page: 8026
  year: 2019
  ident: ref37
  article-title: Pytorch: An imperative style, high-performance deep learning library
  publication-title: Proc NeurIPS
– start-page: 25994
  year: 2022
  ident: ref32
  article-title: Multi-grained vision language pre-training: Aligning texts with visual concepts
  publication-title: Proc ICML
– ident: ref3
  doi: 10.1109/CVPR.2018.00624
– ident: ref6
  doi: 10.24963/ijcai.2020/132
– ident: ref18
  doi: 10.1109/CVPR52688.2022.00492
– start-page: 4171
  year: 2019
  ident: ref36
  article-title: Bert: Pre-training of deep bidirectional transformers for language understanding
  publication-title: Proc NAACL
– ident: ref1
  doi: 10.1109/TCSVT.2017.2655624
– ident: ref9
  doi: 10.1109/CVPR46437.2021.00417
– year: 2021
  ident: ref2
  article-title: A survey on deep learning technique for video segmentation
  publication-title: arXiv 2107 01153
– ident: ref16
  doi: 10.1109/TPAMI.2023.3262578
– year: 2022
  ident: ref19
  article-title: R2 VOS: Robust referring video object segmentation via relational multimodal cycle consistency
  publication-title: arXiv 2207 01203
– year: 2022
  ident: ref22
  article-title: VLP: A survey on vision-language pre-training
  publication-title: arXiv 2202 09061
– year: 2009
  ident: ref40
  publication-title: Natural Language Processing with Python Analyzing Text with the Natural Language Toolkit
– ident: ref30
  doi: 10.1145/3474085.3479207
– start-page: 1
  year: 2021
  ident: ref26
  article-title: An image is worth 16×16 words: Transformers for image recognition at scale
  publication-title: Proc ICLR
– ident: ref10
  doi: 10.1609/aaai.v36i2.20017
– ident: ref12
  doi: 10.1109/CVPR52688.2022.00491
– start-page: 1
  year: 2022
  ident: ref33
  article-title: FILIP: Fine-grained interactive language-image pre-training
  publication-title: Proc ICLR
– ident: ref27
  doi: 10.1007/978-3-030-58452-8_13
– ident: ref29
  doi: 10.1109/CVPR46437.2021.00863
– ident: ref11
  doi: 10.1109/CVPR52688.2022.01144
SSID ssj0014847
Score 2.4735582
Snippet Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1
SubjectTerms Alignment
Decoupled multimodal transformers
Decoupling
Referring video object segmentation
Segmentation
Transformers
Vision
Vision-language pre-training
Title Decoupling Multimodal Transformers for Referring Video Object Segmentation
URI https://ieeexplore.ieee.org/document/10147907
https://www.proquest.com/docview/2861467791
Volume 33
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELWgEwx8FlEoKAMbStrEcRyPqFBVlShD06pbFDvnCkEbBO3Cr-fsJFUFArEkGezIurN97-x7d4TcdFWsM59JF3I_QgeFU1dAlrlRCAxUxBFUGILz4ygaTMLhjM0qsrrlwgCADT4Dz3zau_y8UGtzVNYxdWW5MNzxXfTcSrLW5sogjG01McQLvhujIasZMl3RSXrjaeKZQuEexe1YmLitLStky6r82IutgekfklE9tDKu5MVbr6SnPr9lbfz32I_IQQU1nbtybhyTHViekP2tBISnZHiP3ufakHLnjqXiLoocuyQ1mkVs6ODbseloTRdn-pxD4TxJc3zjjGG-qLhLyyaZ9B-S3sCtqiu4KhDRyuWQGXcC4oAy2s2DQFGe-xlH6UDMFKB1E1RHvlY0BIRRoDlVIRO4xmXIZEbPSGNZLOGcOEJoLlGrUc50qAGkEExSqvERUx1kLeLX0k5VlXrcVMB4Ta0L0hWp1VBqNJRWGmqR202ftzLxxp-tm0bkWy1LabdIu9ZqWi3OjzSII2MfuPAvful2SfbM38tYsjZprN7XcIXgYyWv7aT7AiNR1OY
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELYQDMDAs4jyzMCGkjZxHMcjKlSl0DIQEFsUO-cKASmCZuHXc3YSVIFALEkGn2Ld2b7P9n13hJx0Vawzn0kXcj_CDQqnroAsc6MQGKiII6gwBOfROBrchcMH9lCT1S0XBgBs8Bl45tPe5edTVZqjso6pK8uF4Y4voeNnfkXX-ro0CGNbTwwRg-_G6MoajkxXdJLe7X3imVLhHsUFWZjIrTk_ZAur_FiNrYvpr5Nx07kqsuTJK2fSUx_f8jb-u_cbZK0Gm85ZNTo2yQIUW2R1LgXhNhme4_6zNLTciWPJuC_THEWSBs8iOnTw7diEtEbEuX_MYercSHOA49zC5KVmLxUtcte_SHoDt66v4KpARDOXQ2Y2FBAHlNFuHgSK8tzPOGoHYqYA_ZugOvK1oiEgkALNqQqZwFkuQyYzukMWi2kBu8QRQnOJdo1ypkMNIIVgklKNj5jqIGsTv9F2qurk46YGxnNqNyFdkVoLpcZCaW2hNjn9knmtUm_82bplVD7XstJ2mxw0Vk3r6fmeBnFkPAQX_t4vYsdkeZCMrtPry_HVPlkxf6oiyw7I4uythEOEIjN5ZAfgJ2of2C8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Decoupling+Multimodal+Transformers+for+Referring+Video+Object+Segmentation&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Gao%2C+Mingqi&rft.au=Yang%2C+Jinyu&rft.au=Han%2C+Jungong&rft.au=Lu%2C+Ke&rft.date=2023-09-01&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=33&rft.issue=9&rft.spage=4518&rft.epage=4528&rft_id=info:doi/10.1109%2FTCSVT.2023.3284979&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCSVT_2023_3284979
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon