Human-Centric Spatio-Temporal Video Grounding With Visual Transformers
In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given...
Saved in:
Published in | IEEE transactions on circuits and systems for video technology Vol. 32; no. 12; pp. 8238 - 8249 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.12.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security related applications, where the surveillance videos can be extremely long but only a specific person during a specific period is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG datasetThe new dataset is available at https://github.com/tzhhhh123/HC-STVG . consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating that the newly-proposed method outperforms the existing baseline methods. |
---|---|
AbstractList | In this work, we introduce a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security related applications, where the surveillance videos can be extremely long but only a specific person during a specific period is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG datasetThe new dataset is available at https://github.com/tzhhhh123/HC-STVG . consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating that the newly-proposed method outperforms the existing baseline methods. |
Author | Li, Guanbin Liao, Yue Yu, Qian Liu, Si Xu, Dong Tang, Zongheng Jin, Xiaojie Jiang, Hongxu |
Author_xml | – sequence: 1 givenname: Zongheng orcidid: 0000-0002-9903-802X surname: Tang fullname: Tang, Zongheng organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 2 givenname: Yue surname: Liao fullname: Liao, Yue organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 3 givenname: Si orcidid: 0000-0002-9180-2935 surname: Liu fullname: Liu, Si email: liusi@buaa.edu.cn organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 4 givenname: Guanbin orcidid: 0000-0002-4805-0926 surname: Li fullname: Li, Guanbin organization: School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China – sequence: 5 givenname: Xiaojie surname: Jin fullname: Jin, Xiaojie organization: ByteDance AI Lab, Beijing, China – sequence: 6 givenname: Hongxu surname: Jiang fullname: Jiang, Hongxu organization: School of Computer Science and Engineering, Beihang University, Beijing, China – sequence: 7 givenname: Qian surname: Yu fullname: Yu, Qian organization: School of Software, Beihang University, Beijing, China – sequence: 8 givenname: Dong orcidid: 0000-0003-2775-9730 surname: Xu fullname: Xu, Dong organization: School of Electrical and Information Engineering, The University of Sydney, Sydney, NSW, Australia |
BookMark | eNp9kDFPwzAQhS1UJNrCH4AlEnOKz7FjZ0QVbZEqMTSU0XISB1w1drCTgX9PQisGBqY73b3vTu_N0MQ6qxG6BbwAwNlDvtzt8wXBBBYJFizD_AJNgTERE4LZZOgxg1gQYFdoFsIBY6CC8ilabfpG2XipbedNGe1a1RkX57ppnVfHaG8q7aK1d72tjH2P3kz3MQxDP-xyr2yonW-0D9foslbHoG_OdY5eV0_5chNvX9bPy8dtXJKMdXFFOVVppRJBBChBGdHAUpyStEgKRkpeJ8CgxrRivNS0qKDM6gJSBRx4pUkyR_enu613n70OnTy43tvhpSSc8hSAk1ElTqrSuxC8rmVputHYYFKZowQsx9TkT2pyTE2eUxtQ8gdtvWmU__ofujtBRmv9C2SUpoMi-QbjcHn2 |
CODEN | ITCTEM |
CitedBy_id | crossref_primary_10_1109_TCSVT_2024_3433547 crossref_primary_10_1109_TCSVT_2024_3372944 crossref_primary_10_1109_TMM_2024_3387696 crossref_primary_10_1109_TCDS_2023_3325358 crossref_primary_10_1109_TCSVT_2022_3161815 crossref_primary_10_1109_TCSVT_2022_3174136 crossref_primary_10_1109_TCSVT_2023_3310296 crossref_primary_10_1109_TPAMI_2023_3258628 crossref_primary_10_1109_TCSVT_2023_3275950 crossref_primary_10_1016_j_neucom_2025_129698 crossref_primary_10_1109_TCSVT_2023_3260115 crossref_primary_10_1109_TCSVT_2024_3369656 crossref_primary_10_1109_TCSVT_2023_3250518 crossref_primary_10_1109_TCSVT_2024_3413074 crossref_primary_10_1109_TCSVT_2023_3288353 crossref_primary_10_1109_TCSVT_2024_3399613 crossref_primary_10_1109_TCSVT_2023_3283282 crossref_primary_10_1016_j_patcog_2023_110169 crossref_primary_10_1007_s11633_022_1410_8 crossref_primary_10_1109_TIP_2023_3243525 crossref_primary_10_1109_TCSVT_2023_3312325 crossref_primary_10_1109_TIM_2024_3451586 crossref_primary_10_1109_TCSVT_2023_3259430 crossref_primary_10_1016_j_knosys_2025_113200 crossref_primary_10_1109_TCSVT_2024_3422869 crossref_primary_10_1109_TIP_2023_3345652 crossref_primary_10_1109_TCSVT_2024_3376373 crossref_primary_10_1109_TCSVT_2021_3113505 crossref_primary_10_1109_TMM_2024_3453062 crossref_primary_10_1016_j_eswa_2025_126650 |
Cites_doi | 10.1609/aaai.v34i07.6795 10.1109/ICCV.2017.83 10.1007/978-3-319-46475-6_5 10.1609/aaai.v33i01.33018175 10.1109/TIP.2020.3013142 10.1109/CVPR.2017.375 10.18653/v1/D18-1015 10.1007/s11263-016-0981-7 10.1109/CVPR42600.2020.00056 10.1109/ICCV.2017.472 10.1109/ICCV.2017.618 10.1145/3323873.3325056 10.1007/978-3-030-58607-2_4 10.1109/CVPR.2016.9 10.1609/aaai.v33i01.33018199 10.1109/CVPR42600.2020.01068 10.1007/978-3-319-11752-2_15 10.18653/v1/D19-1219 10.1145/3209978.3210003 10.18653/v1/P19-1183 10.1609/aaai.v33i01.33019062 10.1109/CVPR.2018.00142 10.1145/3240508.3240549 10.3115/v1/D14-1086 10.1109/CVPR.2018.00935 10.18653/v1/D19-1518 10.1109/ICCV.2017.162 10.1109/WACV.2019.00032 10.18653/v1/D19-1514 10.1109/CVPR42600.2020.01050 10.1109/CVPR.2018.00633 10.1145/3343031.3350985 10.1109/ICCV.2017.563 10.1162/tacl_a_00207 10.1109/CVPR.2019.00134 10.18653/v1/P18-1238 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/TCSVT.2021.3085907 |
DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Technology Research Database |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1558-2205 |
EndPage | 8249 |
ExternalDocumentID | 10_1109_TCSVT_2021_3085907 9446308 |
Genre | orig-research |
GrantInformation_xml | – fundername: Fundamental Research Funds for the Central Universities, Zhejiang Lab grantid: 2019KD0AB04 funderid: 10.13039/501100012226 – fundername: Basic and Applied Basic Research Foundation of Guangdong Province; Guangdong Basic and Applied Basic Research Foundation grantid: 2020B1515020048 funderid: 10.13039/501100021171 – fundername: National Key Research and Development Project of China grantid: 2018AAA0101900 – fundername: Beijing Natural Science Foundation grantid: 4202034 funderid: 10.13039/501100004826 – fundername: National Natural Science Foundation of China grantid: 61876177 funderid: 10.13039/501100001809 |
GroupedDBID | -~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS RXW TAE TN5 VH1 AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c295t-d474a6da38281a8452e1560626b3b52c7f3151f04d57ce4bd1c9fb16a1717de23 |
IEDL.DBID | RIE |
ISSN | 1051-8215 |
IngestDate | Mon Jun 30 06:30:14 EDT 2025 Thu Apr 24 23:08:00 EDT 2025 Tue Jul 01 00:41:15 EDT 2025 Wed Aug 27 02:29:09 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 12 |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c295t-d474a6da38281a8452e1560626b3b52c7f3151f04d57ce4bd1c9fb16a1717de23 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0002-9180-2935 0000-0002-9903-802X 0000-0002-4805-0926 0000-0003-2775-9730 |
PQID | 2747611722 |
PQPubID | 85433 |
PageCount | 12 |
ParticipantIDs | crossref_citationtrail_10_1109_TCSVT_2021_3085907 ieee_primary_9446308 proquest_journals_2747611722 crossref_primary_10_1109_TCSVT_2021_3085907 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2022-12-01 |
PublicationDateYYYYMMDD | 2022-12-01 |
PublicationDate_xml | – month: 12 year: 2022 text: 2022-12-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE transactions on circuits and systems for video technology |
PublicationTitleAbbrev | TCSVT |
PublicationYear | 2022 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref35 ref13 ref34 ref37 ref14 ref31 ref30 ref33 ref11 ref32 ref10 ref2 ref1 ref39 ref16 kipf (ref29) 2016 lu (ref19) 2019 zhang (ref45) 2019 su (ref18) 2019 ref46 ref24 harold li (ref17) 2019 ref23 ref26 ref25 ref20 ref42 ref22 ref44 ref21 duchi (ref41) 2011; 12 ref43 chen (ref15) 2020 ren (ref36) 2015 ref28 ref27 devlin (ref38) 2018 ref8 ref7 ref9 ref4 ref3 ref6 ref5 ref40 liao (ref12) 2019 |
References_xml | – ident: ref16 doi: 10.1609/aaai.v34i07.6795 – ident: ref32 doi: 10.1109/ICCV.2017.83 – ident: ref1 doi: 10.1007/978-3-319-46475-6_5 – ident: ref44 doi: 10.1609/aaai.v33i01.33018175 – year: 2019 ident: ref12 article-title: A real-time cross-modality correlation filtering method for referring expression comprehension publication-title: arXiv 1909 07072 – ident: ref7 doi: 10.1109/TIP.2020.3013142 – ident: ref42 doi: 10.1109/CVPR.2017.375 – ident: ref24 doi: 10.18653/v1/D18-1015 – ident: ref40 doi: 10.1007/s11263-016-0981-7 – ident: ref10 doi: 10.1109/CVPR42600.2020.00056 – ident: ref37 doi: 10.1109/ICCV.2017.472 – ident: ref4 doi: 10.1109/ICCV.2017.618 – ident: ref35 doi: 10.1145/3323873.3325056 – year: 2019 ident: ref45 article-title: Learning 2D temporal adjacent networks for moment localization with natural language publication-title: arXiv 1912 03590 – ident: ref9 doi: 10.1007/978-3-030-58607-2_4 – ident: ref6 doi: 10.1109/CVPR.2016.9 – ident: ref25 doi: 10.1609/aaai.v33i01.33018199 – ident: ref33 doi: 10.1109/CVPR42600.2020.01068 – ident: ref31 doi: 10.1007/978-3-319-11752-2_15 – ident: ref14 doi: 10.18653/v1/D19-1219 – year: 2016 ident: ref29 article-title: Semi-supervised classification with graph convolutional networks publication-title: arXiv 1609 02907 – ident: ref21 doi: 10.1145/3209978.3210003 – ident: ref13 doi: 10.18653/v1/P19-1183 – ident: ref23 doi: 10.1609/aaai.v33i01.33019062 – start-page: 13 year: 2019 ident: ref19 article-title: VILBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks publication-title: Proc Adv Neural Inf Process Syst – ident: ref11 doi: 10.1109/CVPR.2018.00142 – ident: ref22 doi: 10.1145/3240508.3240549 – start-page: 91 year: 2015 ident: ref36 article-title: Faster R-CNN: Towards real-time object detection with region proposal networks publication-title: Proc Adv Neural Inf Process Syst – ident: ref2 doi: 10.3115/v1/D14-1086 – ident: ref5 doi: 10.1109/CVPR.2018.00935 – ident: ref39 doi: 10.18653/v1/D19-1518 – ident: ref43 doi: 10.1109/ICCV.2017.162 – ident: ref26 doi: 10.1109/WACV.2019.00032 – ident: ref20 doi: 10.18653/v1/D19-1514 – ident: ref8 doi: 10.1109/CVPR42600.2020.01050 – year: 2019 ident: ref17 article-title: VisualBERT: A simple and performant baseline for vision and language publication-title: arXiv 1908 03557 – ident: ref30 doi: 10.1109/CVPR.2018.00633 – ident: ref28 doi: 10.1145/3343031.3350985 – year: 2018 ident: ref38 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding publication-title: arXiv 1810 04805 – year: 2019 ident: ref18 article-title: VL-BERT: Pre-training of generic visual-linguistic representations publication-title: arXiv 1908 08530 – ident: ref3 doi: 10.1109/ICCV.2017.563 – ident: ref34 doi: 10.1162/tacl_a_00207 – ident: ref27 doi: 10.1109/CVPR.2019.00134 – ident: ref46 doi: 10.18653/v1/P18-1238 – start-page: 104 year: 2020 ident: ref15 article-title: Uniter: Universal image-text representation learning publication-title: Proc Eur Conf Comput Vis – volume: 12 start-page: 2121 year: 2011 ident: ref41 article-title: Adaptive subgradient methods for online learning and stochastic optimization publication-title: J Mach Learn Res |
SSID | ssj0014847 |
Score | 2.6061277 |
Snippet | In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or... In this work, we introduce a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 8238 |
SubjectTerms | dataset Datasets Electron tubes Grounding Localization Location awareness Power transformers Spatial temporal resolution Spatio-temporal grounding transformer Transformers Video Visualization |
Title | Human-Centric Spatio-Temporal Video Grounding With Visual Transformers |
URI | https://ieeexplore.ieee.org/document/9446308 https://www.proquest.com/docview/2747611722 |
Volume | 32 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJxh4FUShoAxs4DR2nh5RRVUhlaVp6Rb5FVFRtYgmC7-es5NUFSDEFiW2ZPkc3_fZ990hdOslEmA4UZhzaUqYEYm5iAUGV8uBPXhUWpHY-DkaTYOneThvofutFkZrbYPPtGse7V2-WsvSHJX1GXAX3yh794C4VVqt7Y1BkNhiYgAXCE7AjzUCGY_108FklgIVpMT1TT4vUzp2xwnZqio_tmLrX4ZHaNyMrAoreXPLQrjy81vSxv8O_Rgd1kDTeahWxglq6dUpOthJP9hBQ3uCj-0B70I6ExtcjdMqWdXSmS2UXjvmcMoqX5yXRfEKLzclfEsbvAvo8QxNh4_pYITrugpYUhYWWAVxwCPFfWBbhCdBSLXRUwO1Eb4IqYxzH3BA7gUqjKUOhCKS5YJEnID1lKb-OWqv1it9gRwKaFFyCjtFHgWcx0kYcZ8zzgQROfVYF5FmojNZJx03tS-WmSUfHsuscTJjnKw2Thfdbfu8Vyk3_mzdMbO9bVlPdBf1Gntm9V-5yQwDjwhANnr5e68rtE-NvMGGq_RQu_go9TWAjkLc2NX2BUEQ0U8 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1LT8MwDLYQHIADb8R49gAnlNGk7wMHBEzjtcu6wa0kaSom0IZYJwS_hb_Cf8NJ2wkB4obErWpTqY0t-7PjzwbYtUOJMJymhHOpR5hRSbgIBEFXyzF6sJk0JLGrlt_suOc33s0EvI25MEopU3ym6vrSnOWnAznSqbKDCGMXxw7LEsoL9fKMAdrw8OwEpbnHWOM0Pm6ScoYAkSzycpK6gcv9lDsYWVAeuh5TmjuMMF44wmMyyBz0eZntpl4glStSKqNMUJ9T_NJU6bYGaOCnEGd4rGCHjc8o3NCML0OAQkmInrOi5NjRQXzc7sYYfDJad3QHMT2s9pPbM3Ncvhl_49Ea8_Be7UVRyHJfH-WiLl-_tIn8r5u1AHMllLaOCt1fhAnVX4LZTw0Wl6FhziiISWH3pNU25eMkLtpxPVjdXqoGlk6_GW6Pdd3L7_DmcITP4grRIz5egc6f_MgqTPYHfbUGFkM8LDlDW5j5LudB6Pnc4RGPBBUZs6Ma0EqwiSzbquvpHg-JCa_sKDHKkGhlSEplqMH--J3HoqnIr6uXtXTHK0vB1mCz0p-ktDvDROcYfIqglK3__NYOTDfjq8vk8qx1sQEzTJM5THHOJkzmTyO1hRArF9tG0y24_Wtt-QCtni1C |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Human-Centric+Spatio-Temporal+Video+Grounding+With+Visual+Transformers&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Tang%2C+Zongheng&rft.au=Liao%2C+Yue&rft.au=Liu%2C+Si&rft.au=Li%2C+Guanbin&rft.date=2022-12-01&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=32&rft.issue=12&rft.spage=8238&rft.epage=8249&rft_id=info:doi/10.1109%2FTCSVT.2021.3085907&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCSVT_2021_3085907 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon |