Human-Centric Spatio-Temporal Video Grounding With Visual Transformers

In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 32; no. 12; pp. 8238 - 8249
Main Authors	Tang, Zongheng, Liao, Yue, Liu, Si, Li, Guanbin, Jin, Xiaojie, Jiang, Hongxu, Yu, Qian, Xu, Dong
Format	Journal Article
Language	English
Published	New York IEEE 01.12.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	dataset Datasets Electron tubes Grounding Localization Location awareness Power transformers Spatial temporal resolution Spatio-temporal grounding transformer Transformers Video Visualization
Online Access	Get full text

Cover

Loading…

Abstract	In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security related applications, where the surveillance videos can be extremely long but only a specific person during a specific period is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG datasetThe new dataset is available at https://github.com/tzhhhh123/HC-STVG . consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating that the newly-proposed method outperforms the existing baseline methods.
AbstractList	In this work, we introduce a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security related applications, where the surveillance videos can be extremely long but only a specific person during a specific period is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG datasetThe new dataset is available at https://github.com/tzhhhh123/HC-STVG . consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating that the newly-proposed method outperforms the existing baseline methods.
Author	Li, Guanbin Liao, Yue Yu, Qian Liu, Si Xu, Dong Tang, Zongheng Jin, Xiaojie Jiang, Hongxu
Author_xml	– sequence: 1 givenname: Zongheng orcidid: 0000-0002-9903-802X surname: Tang fullname: Tang, Zongheng organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 2 givenname: Yue surname: Liao fullname: Liao, Yue organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 3 givenname: Si orcidid: 0000-0002-9180-2935 surname: Liu fullname: Liu, Si email: liusi@buaa.edu.cn organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 4 givenname: Guanbin orcidid: 0000-0002-4805-0926 surname: Li fullname: Li, Guanbin organization: School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China – sequence: 5 givenname: Xiaojie surname: Jin fullname: Jin, Xiaojie organization: ByteDance AI Lab, Beijing, China – sequence: 6 givenname: Hongxu surname: Jiang fullname: Jiang, Hongxu organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 7 givenname: Qian surname: Yu fullname: Yu, Qian organization: School of Software, Beihang University, Beijing, China – sequence: 8 givenname: Dong orcidid: 0000-0003-2775-9730 surname: Xu fullname: Xu, Dong organization: School of Electrical and Information Engineering, The University of Sydney, Sydney, NSW, Australia
BookMark	eNp9kDFPwzAQhS1UJNrCH4AlEnOKz7FjZ0QVbZEqMTSU0XISB1w1drCTgX9PQisGBqY73b3vTu_N0MQ6qxG6BbwAwNlDvtzt8wXBBBYJFizD_AJNgTERE4LZZOgxg1gQYFdoFsIBY6CC8ilabfpG2XipbedNGe1a1RkX57ppnVfHaG8q7aK1d72tjH2P3kz3MQxDP-xyr2yonW-0D9foslbHoG_OdY5eV0_5chNvX9bPy8dtXJKMdXFFOVVppRJBBChBGdHAUpyStEgKRkpeJ8CgxrRivNS0qKDM6gJSBRx4pUkyR_enu613n70OnTy43tvhpSSc8hSAk1ElTqrSuxC8rmVputHYYFKZowQsx9TkT2pyTE2eUxtQ8gdtvWmU__ofujtBRmv9C2SUpoMi-QbjcHn2
CODEN	ITCTEM
CitedBy_id	crossref_primary_10_1109_TCSVT_2024_3433547 crossref_primary_10_1109_TCSVT_2024_3372944 crossref_primary_10_1109_TMM_2024_3387696 crossref_primary_10_1109_TCDS_2023_3325358 crossref_primary_10_1109_TCSVT_2022_3161815 crossref_primary_10_1109_TCSVT_2022_3174136 crossref_primary_10_1109_TCSVT_2023_3310296 crossref_primary_10_1109_TPAMI_2023_3258628 crossref_primary_10_1109_TCSVT_2023_3275950 crossref_primary_10_1016_j_neucom_2025_129698 crossref_primary_10_1109_TCSVT_2023_3260115 crossref_primary_10_1109_TCSVT_2024_3369656 crossref_primary_10_1109_TCSVT_2023_3250518 crossref_primary_10_1109_TCSVT_2024_3413074 crossref_primary_10_1109_TCSVT_2023_3288353 crossref_primary_10_1109_TCSVT_2024_3399613 crossref_primary_10_1109_TCSVT_2023_3283282 crossref_primary_10_1016_j_patcog_2023_110169 crossref_primary_10_1007_s11633_022_1410_8 crossref_primary_10_1109_TIP_2023_3243525 crossref_primary_10_1109_TCSVT_2023_3312325 crossref_primary_10_1109_TIM_2024_3451586 crossref_primary_10_1109_TCSVT_2023_3259430 crossref_primary_10_1016_j_knosys_2025_113200 crossref_primary_10_1109_TCSVT_2024_3422869 crossref_primary_10_1109_TIP_2023_3345652 crossref_primary_10_1109_TCSVT_2024_3376373 crossref_primary_10_1109_TCSVT_2021_3113505 crossref_primary_10_1109_TMM_2024_3453062 crossref_primary_10_1016_j_eswa_2025_126650
Cites_doi	10.1609/aaai.v34i07.6795 10.1109/ICCV.2017.83 10.1007/978-3-319-46475-6_5 10.1609/aaai.v33i01.33018175 10.1109/TIP.2020.3013142 10.1109/CVPR.2017.375 10.18653/v1/D18-1015 10.1007/s11263-016-0981-7 10.1109/CVPR42600.2020.00056 10.1109/ICCV.2017.472 10.1109/ICCV.2017.618 10.1145/3323873.3325056 10.1007/978-3-030-58607-2_4 10.1109/CVPR.2016.9 10.1609/aaai.v33i01.33018199 10.1109/CVPR42600.2020.01068 10.1007/978-3-319-11752-2_15 10.18653/v1/D19-1219 10.1145/3209978.3210003 10.18653/v1/P19-1183 10.1609/aaai.v33i01.33019062 10.1109/CVPR.2018.00142 10.1145/3240508.3240549 10.3115/v1/D14-1086 10.1109/CVPR.2018.00935 10.18653/v1/D19-1518 10.1109/ICCV.2017.162 10.1109/WACV.2019.00032 10.18653/v1/D19-1514 10.1109/CVPR42600.2020.01050 10.1109/CVPR.2018.00633 10.1145/3343031.3350985 10.1109/ICCV.2017.563 10.1162/tacl_a_00207 10.1109/CVPR.2019.00134 10.18653/v1/P18-1238
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
DBID	97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D
DOI	10.1109/TCSVT.2021.3085907
DatabaseName	IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional
DatabaseTitleList	Technology Research Database
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	1558-2205
EndPage	8249
ExternalDocumentID	10_1109_TCSVT_2021_3085907 9446308
Genre	orig-research
GrantInformation_xml	– fundername: Fundamental Research Funds for the Central Universities, Zhejiang Lab grantid: 2019KD0AB04 funderid: 10.13039/501100012226 – fundername: Basic and Applied Basic Research Foundation of Guangdong Province; Guangdong Basic and Applied Basic Research Foundation grantid: 2020B1515020048 funderid: 10.13039/501100021171 – fundername: National Key Research and Development Project of China grantid: 2018AAA0101900 – fundername: Beijing Natural Science Foundation grantid: 4202034 funderid: 10.13039/501100004826 – fundername: National Natural Science Foundation of China grantid: 61876177 funderid: 10.13039/501100001809
GroupedDBID	-~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS RXW TAE TN5 VH1 AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c295t-d474a6da38281a8452e1560626b3b52c7f3151f04d57ce4bd1c9fb16a1717de23
IEDL.DBID	RIE
ISSN	1051-8215
IngestDate	Mon Jun 30 06:30:14 EDT 2025 Thu Apr 24 23:08:00 EDT 2025 Tue Jul 01 00:41:15 EDT 2025 Wed Aug 27 02:29:09 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	12
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c295t-d474a6da38281a8452e1560626b3b52c7f3151f04d57ce4bd1c9fb16a1717de23
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0002-9180-2935 0000-0002-9903-802X 0000-0002-4805-0926 0000-0003-2775-9730
PQID	2747611722
PQPubID	85433
PageCount	12
ParticipantIDs	crossref_citationtrail_10_1109_TCSVT_2021_3085907 ieee_primary_9446308 proquest_journals_2747611722 crossref_primary_10_1109_TCSVT_2021_3085907
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2022-12-01
PublicationDateYYYYMMDD	2022-12-01
PublicationDate_xml	– month: 12 year: 2022 text: 2022-12-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York
PublicationTitle	IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev	TCSVT
PublicationYear	2022
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref35 ref13 ref34 ref37 ref14 ref31 ref30 ref33 ref11 ref32 ref10 ref2 ref1 ref39 ref16 kipf (ref29) 2016 lu (ref19) 2019 zhang (ref45) 2019 su (ref18) 2019 ref46 ref24 harold li (ref17) 2019 ref23 ref26 ref25 ref20 ref42 ref22 ref44 ref21 duchi (ref41) 2011; 12 ref43 chen (ref15) 2020 ren (ref36) 2015 ref28 ref27 devlin (ref38) 2018 ref8 ref7 ref9 ref4 ref3 ref6 ref5 ref40 liao (ref12) 2019
References_xml	– ident: ref16 doi: 10.1609/aaai.v34i07.6795 – ident: ref32 doi: 10.1109/ICCV.2017.83 – ident: ref1 doi: 10.1007/978-3-319-46475-6_5 – ident: ref44 doi: 10.1609/aaai.v33i01.33018175 – year: 2019 ident: ref12 article-title: A real-time cross-modality correlation filtering method for referring expression comprehension publication-title: arXiv 1909 07072 – ident: ref7 doi: 10.1109/TIP.2020.3013142 – ident: ref42 doi: 10.1109/CVPR.2017.375 – ident: ref24 doi: 10.18653/v1/D18-1015 – ident: ref40 doi: 10.1007/s11263-016-0981-7 – ident: ref10 doi: 10.1109/CVPR42600.2020.00056 – ident: ref37 doi: 10.1109/ICCV.2017.472 – ident: ref4 doi: 10.1109/ICCV.2017.618 – ident: ref35 doi: 10.1145/3323873.3325056 – year: 2019 ident: ref45 article-title: Learning 2D temporal adjacent networks for moment localization with natural language publication-title: arXiv 1912 03590 – ident: ref9 doi: 10.1007/978-3-030-58607-2_4 – ident: ref6 doi: 10.1109/CVPR.2016.9 – ident: ref25 doi: 10.1609/aaai.v33i01.33018199 – ident: ref33 doi: 10.1109/CVPR42600.2020.01068 – ident: ref31 doi: 10.1007/978-3-319-11752-2_15 – ident: ref14 doi: 10.18653/v1/D19-1219 – year: 2016 ident: ref29 article-title: Semi-supervised classification with graph convolutional networks publication-title: arXiv 1609 02907 – ident: ref21 doi: 10.1145/3209978.3210003 – ident: ref13 doi: 10.18653/v1/P19-1183 – ident: ref23 doi: 10.1609/aaai.v33i01.33019062 – start-page: 13 year: 2019 ident: ref19 article-title: VILBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks publication-title: Proc Adv Neural Inf Process Syst – ident: ref11 doi: 10.1109/CVPR.2018.00142 – ident: ref22 doi: 10.1145/3240508.3240549 – start-page: 91 year: 2015 ident: ref36 article-title: Faster R-CNN: Towards real-time object detection with region proposal networks publication-title: Proc Adv Neural Inf Process Syst – ident: ref2 doi: 10.3115/v1/D14-1086 – ident: ref5 doi: 10.1109/CVPR.2018.00935 – ident: ref39 doi: 10.18653/v1/D19-1518 – ident: ref43 doi: 10.1109/ICCV.2017.162 – ident: ref26 doi: 10.1109/WACV.2019.00032 – ident: ref20 doi: 10.18653/v1/D19-1514 – ident: ref8 doi: 10.1109/CVPR42600.2020.01050 – year: 2019 ident: ref17 article-title: VisualBERT: A simple and performant baseline for vision and language publication-title: arXiv 1908 03557 – ident: ref30 doi: 10.1109/CVPR.2018.00633 – ident: ref28 doi: 10.1145/3343031.3350985 – year: 2018 ident: ref38 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding publication-title: arXiv 1810 04805 – year: 2019 ident: ref18 article-title: VL-BERT: Pre-training of generic visual-linguistic representations publication-title: arXiv 1908 08530 – ident: ref3 doi: 10.1109/ICCV.2017.563 – ident: ref34 doi: 10.1162/tacl_a_00207 – ident: ref27 doi: 10.1109/CVPR.2019.00134 – ident: ref46 doi: 10.18653/v1/P18-1238 – start-page: 104 year: 2020 ident: ref15 article-title: Uniter: Universal image-text representation learning publication-title: Proc Eur Conf Comput Vis – volume: 12 start-page: 2121 year: 2011 ident: ref41 article-title: Adaptive subgradient methods for online learning and stochastic optimization publication-title: J Mach Learn Res
SSID	ssj0014847
Score	2.6061277
Snippet	In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or... In this work, we introduce a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or...
SourceID	proquest crossref ieee
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	8238
SubjectTerms	dataset Datasets Electron tubes Grounding Localization Location awareness Power transformers Spatial temporal resolution Spatio-temporal grounding transformer Transformers Video Visualization
Title	Human-Centric Spatio-Temporal Video Grounding With Visual Transformers
URI	https://ieeexplore.ieee.org/document/9446308 https://www.proquest.com/docview/2747611722
Volume	32
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJxh4FUShoAxs4DR2nh5RRVUhlaVp6Rb5FVFRtYgmC7-es5NUFSDEFiW2ZPkc3_fZ990hdOslEmA4UZhzaUqYEYm5iAUGV8uBPXhUWpHY-DkaTYOneThvofutFkZrbYPPtGse7V2-WsvSHJX1GXAX3yh794C4VVqt7Y1BkNhiYgAXCE7AjzUCGY_108FklgIVpMT1TT4vUzp2xwnZqio_tmLrX4ZHaNyMrAoreXPLQrjy81vSxv8O_Rgd1kDTeahWxglq6dUpOthJP9hBQ3uCj-0B70I6ExtcjdMqWdXSmS2UXjvmcMoqX5yXRfEKLzclfEsbvAvo8QxNh4_pYITrugpYUhYWWAVxwCPFfWBbhCdBSLXRUwO1Eb4IqYxzH3BA7gUqjKUOhCKS5YJEnID1lKb-OWqv1it9gRwKaFFyCjtFHgWcx0kYcZ8zzgQROfVYF5FmojNZJx03tS-WmSUfHsuscTJjnKw2Thfdbfu8Vyk3_mzdMbO9bVlPdBf1Gntm9V-5yQwDjwhANnr5e68rtE-NvMGGq_RQu_go9TWAjkLc2NX2BUEQ0U8
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1LT8MwDLYQHIADb8R49gAnlNGk7wMHBEzjtcu6wa0kaSom0IZYJwS_hb_Cf8NJ2wkB4obErWpTqY0t-7PjzwbYtUOJMJymhHOpR5hRSbgIBEFXyzF6sJk0JLGrlt_suOc33s0EvI25MEopU3ym6vrSnOWnAznSqbKDCGMXxw7LEsoL9fKMAdrw8OwEpbnHWOM0Pm6ScoYAkSzycpK6gcv9lDsYWVAeuh5TmjuMMF44wmMyyBz0eZntpl4glStSKqNMUJ9T_NJU6bYGaOCnEGd4rGCHjc8o3NCML0OAQkmInrOi5NjRQXzc7sYYfDJad3QHMT2s9pPbM3Ncvhl_49Ea8_Be7UVRyHJfH-WiLl-_tIn8r5u1AHMllLaOCt1fhAnVX4LZTw0Wl6FhziiISWH3pNU25eMkLtpxPVjdXqoGlk6_GW6Pdd3L7_DmcITP4grRIz5egc6f_MgqTPYHfbUGFkM8LDlDW5j5LudB6Pnc4RGPBBUZs6Ma0EqwiSzbquvpHg-JCa_sKDHKkGhlSEplqMH--J3HoqnIr6uXtXTHK0vB1mCz0p-ktDvDROcYfIqglK3__NYOTDfjq8vk8qx1sQEzTJM5THHOJkzmTyO1hRArF9tG0y24_Wtt-QCtni1C
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Human-Centric+Spatio-Temporal+Video+Grounding+With+Visual+Transformers&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Tang%2C+Zongheng&rft.au=Liao%2C+Yue&rft.au=Liu%2C+Si&rft.au=Li%2C+Guanbin&rft.date=2022-12-01&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=32&rft.issue=12&rft.spage=8238&rft.epage=8249&rft_id=info:doi/10.1109%2FTCSVT.2021.3085907&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCSVT_2021_3085907
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon