Decoupling Multimodal Transformers for Referring Video Object Segmentation
Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, m...
Saved in:
Published in | IEEE transactions on circuits and systems for video technology Vol. 33; no. 9; p. 1 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.09.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, most efforts focus on the interaction between visual and textual features. However, they ignore the syntactic structures of the text during the interaction, where all textual components are intertwined, resulting in ambiguous vision-language alignment. In this paper, we improve the multimodal interaction by DECOUPLING the interweave. Specifically, we train a lightweight subject perceptron, which extracts the subject part from the input text. Then, the subject and text features are fed into two parallel branches to interact with visual features. This enables us to perform subject-aware and context-aware interactions, respectively, thus encouraging more explicit and discriminative feature embedding and alignment. Moreover, we find the decoupled architecture also facilitates incorporating the vision-language pre-trained alignment into RVOS, further improving the segmentation performance. Experimental results on all RVOS benchmark datasets demonstrate the superiority of our proposed method over the state-of-the-arts. The code of our method is available at: https://github.com/gaomingqi/dmformer. |
---|---|
AbstractList | Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling and information interaction, transformers have been increasingly applied in existing RVOS architectures. To better leverage multimodal data, most efforts focus on the interaction between visual and textual features. However, they ignore the syntactic structures of the text during the interaction, where all textual components are intertwined, resulting in ambiguous vision-language alignment. In this paper, we improve the multimodal interaction by DECOUPLING the interweave. Specifically, we train a lightweight subject perceptron, which extracts the subject part from the input text. Then, the subject and text features are fed into two parallel branches to interact with visual features. This enables us to perform subject-aware and context-aware interactions, respectively, thus encouraging more explicit and discriminative feature embedding and alignment. Moreover, we find the decoupled architecture also facilitates incorporating the vision-language pre-trained alignment into RVOS, further improving the segmentation performance. Experimental results on all RVOS benchmark datasets demonstrate the superiority of our proposed method over the state-of-the-arts. The code of our method is available at: https://github.com/gaomingqi/dmformer . |
Author | Lu, Ke Han, Jungong Montana, Giovanni Zheng, Feng Yang, Jinyu Gao, Mingqi |
Author_xml | – sequence: 1 givenname: Mingqi orcidid: 0000-0002-8688-8228 surname: Gao fullname: Gao, Mingqi organization: Southern University of Science and Technology, Shenzhen, China – sequence: 2 givenname: Jinyu surname: Yang fullname: Yang, Jinyu organization: Southern University of Science and Technology, Shenzhen, China – sequence: 3 givenname: Jungong orcidid: 0000-0003-4361-956X surname: Han fullname: Han, Jungong organization: University of Warwick, Coventry, U.K – sequence: 4 givenname: Ke orcidid: 0000-0003-0176-3088 surname: Lu fullname: Lu, Ke organization: University of Chinese Academy of Sciences, Beijing, China – sequence: 5 givenname: Feng orcidid: 0000-0002-1701-9141 surname: Zheng fullname: Zheng, Feng organization: Southern University of Science and Technology, Shenzhen, China – sequence: 6 givenname: Giovanni orcidid: 0000-0003-3942-3900 surname: Montana fullname: Montana, Giovanni organization: University of Warwick, Coventry, U.K |
BookMark | eNp9kEtPwzAQhC1UJNrCH0AcInFO8TO2j6i8VVSJhl4j19lUrtK42MmBf09KOCAOnGYP8-3szgSNGt8AQpcEzwjB-iafr9b5jGLKZowqrqU-QWMihEopxWLUz1iQVFEiztAkxh3GhCsux-jlDqzvDrVrtslrV7du70tTJ3kwTax82EOISa_JG1QQwtG1diX4ZLnZgW2TFWz30LSmdb45R6eVqSNc_OgUvT_c5_OndLF8fJ7fLlJLddamEgwVQoOiTDBcUmqZLImR_c2ghAVOmGZVRirLOCimoZLMcqGpkBsuNoZN0fWw9xD8RwexLXa-C00fWVCVEZ5JqUnvooPLBh9jgKo4BLc34bMguDh2Vnx3Vhw7K3466yH1B7JueK4NxtX_o1cD6gDgVxbhUmPJvgAcXHwK |
CODEN | ITCTEM |
CitedBy_id | crossref_primary_10_1016_j_neucom_2024_127878 crossref_primary_10_1109_TCSVT_2024_3450861 crossref_primary_10_1016_j_neucom_2025_129435 crossref_primary_10_1109_TCSVT_2024_3419119 crossref_primary_10_1109_TCSVT_2024_3383914 crossref_primary_10_1109_TETCI_2024_3409707 crossref_primary_10_1109_TCSVT_2023_3345852 crossref_primary_10_1109_TCSVT_2024_3431714 crossref_primary_10_1109_TCSVT_2024_3462100 |
Cites_doi | 10.1007/978-3-030-58555-6_13 10.1109/CVPR52688.2022.01069 10.1007/978-3-319-46475-6_5 10.1609/aaai.v34i07.6895 10.1109/ICCV.2019.00404 10.1109/CVPR52688.2022.00493 10.1109/TIP.2022.3161832 10.1109/CVPR52688.2022.00494 10.1109/TPAMI.2022.3217852 10.18653/v1/2022.acl-long.421 10.1109/CVPR.2016.9 10.1109/ICCV48922.2021.00986 10.1109/CVPR.2018.00624 10.24963/ijcai.2020/132 10.1109/CVPR52688.2022.00492 10.1109/TCSVT.2017.2655624 10.1109/CVPR46437.2021.00417 10.1109/TPAMI.2023.3262578 10.1145/3474085.3479207 10.1609/aaai.v36i2.20017 10.1109/CVPR52688.2022.00491 10.1007/978-3-030-58452-8_13 10.1109/CVPR46437.2021.00863 10.1109/CVPR52688.2022.01144 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/TCSVT.2023.3284979 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Technology Research Database |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1558-2205 |
EndPage | 1 |
ExternalDocumentID | 10_1109_TCSVT_2023_3284979 10147907 |
Genre | orig-research |
GrantInformation_xml | – fundername: National Natural Science Foundation of China grantid: 61972188; 62122035 funderid: 10.13039/501100001809 – fundername: National Key Research and Development Program of China grantid: 2022YFF1202903 funderid: 10.13039/501100012166 |
GroupedDBID | -~X 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS HZ~ IFIPE IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS RXW TAE TN5 5VS AAYXX AETIX AGSQL AI. AIBXA ALLEH CITATION EJD H~9 ICLAB IFJZH RIG VH1 7SC 7SP 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c296t-7ea2559e823530d22c37d1a7979e85ce41393f61fc34e839ef73c459257b45ba3 |
IEDL.DBID | RIE |
ISSN | 1051-8215 |
IngestDate | Mon Jun 30 08:36:22 EDT 2025 Tue Jul 01 00:41:22 EDT 2025 Thu Apr 24 23:04:16 EDT 2025 Wed Aug 27 02:25:49 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 9 |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c296t-7ea2559e823530d22c37d1a7979e85ce41393f61fc34e839ef73c459257b45ba3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0003-0176-3088 0000-0002-1701-9141 0000-0003-4361-956X 0000-0002-8688-8228 0000-0003-3942-3900 |
PQID | 2861467791 |
PQPubID | 85433 |
PageCount | 1 |
ParticipantIDs | ieee_primary_10147907 crossref_primary_10_1109_TCSVT_2023_3284979 crossref_citationtrail_10_1109_TCSVT_2023_3284979 proquest_journals_2861467791 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2023-09-01 |
PublicationDateYYYYMMDD | 2023-09-01 |
PublicationDate_xml | – month: 09 year: 2023 text: 2023-09-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE transactions on circuits and systems for video technology |
PublicationTitleAbbrev | TCSVT |
PublicationYear | 2023 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref35 ref12 ref15 ref14 ref31 dosovitskiy (ref26) 2021 ref30 li (ref19) 2022 ref11 ref10 ref1 ref17 ref39 ref16 ref38 ref18 radford (ref23) 2021 ye (ref8) 2022; 44 paszke (ref37) 2019; 32 zhu (ref28) 2021 liu (ref7) 2022; 44 vaswani (ref25) 2017; 30 chen (ref22) 2022 khoreva (ref20) 2018 ref24 ref21 bird (ref40) 2009 zhang (ref34) 2022 ref27 ref29 ref9 ref4 ref3 ref6 zhou (ref2) 2021 ref5 yao (ref33) 2022 zeng (ref32) 2022 devlin (ref36) 2019 |
References_xml | – ident: ref21 doi: 10.1007/978-3-030-58555-6_13 – volume: 30 start-page: 5998 year: 2017 ident: ref25 article-title: Attention is all you need publication-title: Proc NeurIPS – start-page: 123 year: 2018 ident: ref20 article-title: Video object segmentation with language referring expressions publication-title: Proc ACCV – ident: ref24 doi: 10.1109/CVPR52688.2022.01069 – ident: ref38 doi: 10.1007/978-3-319-46475-6_5 – ident: ref4 doi: 10.1609/aaai.v34i07.6895 – ident: ref5 doi: 10.1109/ICCV.2019.00404 – start-page: 36067 year: 2022 ident: ref34 article-title: GLIPv2: Unifying localization and vision-language understanding publication-title: Proc NeurIPS – ident: ref17 doi: 10.1109/CVPR52688.2022.00493 – ident: ref14 doi: 10.1109/TIP.2022.3161832 – volume: 44 start-page: 3719 year: 2022 ident: ref8 article-title: Referring segmentation in images and videos with cross-modal self-attention network publication-title: IEEE Trans Pattern Anal Mach Intell – volume: 44 start-page: 4761 year: 2022 ident: ref7 article-title: Cross-modal progressive comprehension for referring segmentation publication-title: IEEE Trans Pattern Anal Mach Intell – ident: ref13 doi: 10.1109/CVPR52688.2022.00494 – start-page: 8748 year: 2021 ident: ref23 article-title: Learning transferable visual models from natural language supervision publication-title: Proc ICML – start-page: 1 year: 2021 ident: ref28 article-title: Deformable DETR: Deformable transformers for end-to-end object detection publication-title: Proc ICLR – ident: ref15 doi: 10.1109/TPAMI.2022.3217852 – ident: ref31 doi: 10.18653/v1/2022.acl-long.421 – ident: ref39 doi: 10.1109/CVPR.2016.9 – ident: ref35 doi: 10.1109/ICCV48922.2021.00986 – volume: 32 start-page: 8026 year: 2019 ident: ref37 article-title: Pytorch: An imperative style, high-performance deep learning library publication-title: Proc NeurIPS – start-page: 25994 year: 2022 ident: ref32 article-title: Multi-grained vision language pre-training: Aligning texts with visual concepts publication-title: Proc ICML – ident: ref3 doi: 10.1109/CVPR.2018.00624 – ident: ref6 doi: 10.24963/ijcai.2020/132 – ident: ref18 doi: 10.1109/CVPR52688.2022.00492 – start-page: 4171 year: 2019 ident: ref36 article-title: Bert: Pre-training of deep bidirectional transformers for language understanding publication-title: Proc NAACL – ident: ref1 doi: 10.1109/TCSVT.2017.2655624 – ident: ref9 doi: 10.1109/CVPR46437.2021.00417 – year: 2021 ident: ref2 article-title: A survey on deep learning technique for video segmentation publication-title: arXiv 2107 01153 – ident: ref16 doi: 10.1109/TPAMI.2023.3262578 – year: 2022 ident: ref19 article-title: R2 VOS: Robust referring video object segmentation via relational multimodal cycle consistency publication-title: arXiv 2207 01203 – year: 2022 ident: ref22 article-title: VLP: A survey on vision-language pre-training publication-title: arXiv 2202 09061 – year: 2009 ident: ref40 publication-title: Natural Language Processing with Python Analyzing Text with the Natural Language Toolkit – ident: ref30 doi: 10.1145/3474085.3479207 – start-page: 1 year: 2021 ident: ref26 article-title: An image is worth 16×16 words: Transformers for image recognition at scale publication-title: Proc ICLR – ident: ref10 doi: 10.1609/aaai.v36i2.20017 – ident: ref12 doi: 10.1109/CVPR52688.2022.00491 – start-page: 1 year: 2022 ident: ref33 article-title: FILIP: Fine-grained interactive language-image pre-training publication-title: Proc ICLR – ident: ref27 doi: 10.1007/978-3-030-58452-8_13 – ident: ref29 doi: 10.1109/CVPR46437.2021.00863 – ident: ref11 doi: 10.1109/CVPR52688.2022.01144 |
SSID | ssj0014847 |
Score | 2.4735582 |
Snippet | Referring Video Object Segmentation (RVOS) aims to segment the text-depicted object from video sequences. With excellent capabilities in long-range modelling... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 1 |
SubjectTerms | Alignment Decoupled multimodal transformers Decoupling Referring video object segmentation Segmentation Transformers Vision Vision-language pre-training |
Title | Decoupling Multimodal Transformers for Referring Video Object Segmentation |
URI | https://ieeexplore.ieee.org/document/10147907 https://www.proquest.com/docview/2861467791 |
Volume | 33 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELWgEwx8FlEoKAMbStrEcRyPqFBVlShD06pbFDvnCkEbBO3Cr-fsJFUFArEkGezIurN97-x7d4TcdFWsM59JF3I_QgeFU1dAlrlRCAxUxBFUGILz4ygaTMLhjM0qsrrlwgCADT4Dz3zau_y8UGtzVNYxdWW5MNzxXfTcSrLW5sogjG01McQLvhujIasZMl3RSXrjaeKZQuEexe1YmLitLStky6r82IutgekfklE9tDKu5MVbr6SnPr9lbfz32I_IQQU1nbtybhyTHViekP2tBISnZHiP3ufakHLnjqXiLoocuyQ1mkVs6ODbseloTRdn-pxD4TxJc3zjjGG-qLhLyyaZ9B-S3sCtqiu4KhDRyuWQGXcC4oAy2s2DQFGe-xlH6UDMFKB1E1RHvlY0BIRRoDlVIRO4xmXIZEbPSGNZLOGcOEJoLlGrUc50qAGkEExSqvERUx1kLeLX0k5VlXrcVMB4Ta0L0hWp1VBqNJRWGmqR202ftzLxxp-tm0bkWy1LabdIu9ZqWi3OjzSII2MfuPAvful2SfbM38tYsjZprN7XcIXgYyWv7aT7AiNR1OY |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELYQDMDAs4jyzMCGkjZxHMcjKlSl0DIQEFsUO-cKASmCZuHXc3YSVIFALEkGn2Ld2b7P9n13hJx0Vawzn0kXcj_CDQqnroAsc6MQGKiII6gwBOfROBrchcMH9lCT1S0XBgBs8Bl45tPe5edTVZqjso6pK8uF4Y4voeNnfkXX-ro0CGNbTwwRg-_G6MoajkxXdJLe7X3imVLhHsUFWZjIrTk_ZAur_FiNrYvpr5Nx07kqsuTJK2fSUx_f8jb-u_cbZK0Gm85ZNTo2yQIUW2R1LgXhNhme4_6zNLTciWPJuC_THEWSBs8iOnTw7diEtEbEuX_MYercSHOA49zC5KVmLxUtcte_SHoDt66v4KpARDOXQ2Y2FBAHlNFuHgSK8tzPOGoHYqYA_ZugOvK1oiEgkALNqQqZwFkuQyYzukMWi2kBu8QRQnOJdo1ypkMNIIVgklKNj5jqIGsTv9F2qurk46YGxnNqNyFdkVoLpcZCaW2hNjn9knmtUm_82bplVD7XstJ2mxw0Vk3r6fmeBnFkPAQX_t4vYsdkeZCMrtPry_HVPlkxf6oiyw7I4uythEOEIjN5ZAfgJ2of2C8 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Decoupling+Multimodal+Transformers+for+Referring+Video+Object+Segmentation&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Gao%2C+Mingqi&rft.au=Yang%2C+Jinyu&rft.au=Han%2C+Jungong&rft.au=Lu%2C+Ke&rft.date=2023-09-01&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=33&rft.issue=9&rft.spage=4518&rft.epage=4528&rft_id=info:doi/10.1109%2FTCSVT.2023.3284979&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCSVT_2023_3284979 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon |