Human-Centric Spatio-Temporal Video Grounding With Visual Transformers

In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 32; no. 12; pp. 8238 - 8249
Main Authors Tang, Zongheng, Liao, Yue, Liu, Si, Li, Guanbin, Jin, Xiaojie, Jiang, Hongxu, Yu, Qian, Xu, Dong
Format Journal Article
LanguageEnglish
Published New York IEEE 01.12.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security related applications, where the surveillance videos can be extremely long but only a specific person during a specific period is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG datasetThe new dataset is available at https://github.com/tzhhhh123/HC-STVG . consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating that the newly-proposed method outperforms the existing baseline methods.
AbstractList In this work, we introduce a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatio-temporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security related applications, where the surveillance videos can be extremely long but only a specific person during a specific period is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG datasetThe new dataset is available at https://github.com/tzhhhh123/HC-STVG . consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating that the newly-proposed method outperforms the existing baseline methods.
Author Li, Guanbin
Liao, Yue
Yu, Qian
Liu, Si
Xu, Dong
Tang, Zongheng
Jin, Xiaojie
Jiang, Hongxu
Author_xml – sequence: 1
  givenname: Zongheng
  orcidid: 0000-0002-9903-802X
  surname: Tang
  fullname: Tang, Zongheng
  organization: School of Computer Science and Engineering, Beihang University, Beijing, China
– sequence: 2
  givenname: Yue
  surname: Liao
  fullname: Liao, Yue
  organization: School of Computer Science and Engineering, Beihang University, Beijing, China
– sequence: 3
  givenname: Si
  orcidid: 0000-0002-9180-2935
  surname: Liu
  fullname: Liu, Si
  email: liusi@buaa.edu.cn
  organization: School of Computer Science and Engineering, Beihang University, Beijing, China
– sequence: 4
  givenname: Guanbin
  orcidid: 0000-0002-4805-0926
  surname: Li
  fullname: Li, Guanbin
  organization: School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
– sequence: 5
  givenname: Xiaojie
  surname: Jin
  fullname: Jin, Xiaojie
  organization: ByteDance AI Lab, Beijing, China
– sequence: 6
  givenname: Hongxu
  surname: Jiang
  fullname: Jiang, Hongxu
  organization: School of Computer Science and Engineering, Beihang University, Beijing, China
– sequence: 7
  givenname: Qian
  surname: Yu
  fullname: Yu, Qian
  organization: School of Software, Beihang University, Beijing, China
– sequence: 8
  givenname: Dong
  orcidid: 0000-0003-2775-9730
  surname: Xu
  fullname: Xu, Dong
  organization: School of Electrical and Information Engineering, The University of Sydney, Sydney, NSW, Australia
BookMark eNp9kDFPwzAQhS1UJNrCH4AlEnOKz7FjZ0QVbZEqMTSU0XISB1w1drCTgX9PQisGBqY73b3vTu_N0MQ6qxG6BbwAwNlDvtzt8wXBBBYJFizD_AJNgTERE4LZZOgxg1gQYFdoFsIBY6CC8ilabfpG2XipbedNGe1a1RkX57ppnVfHaG8q7aK1d72tjH2P3kz3MQxDP-xyr2yonW-0D9foslbHoG_OdY5eV0_5chNvX9bPy8dtXJKMdXFFOVVppRJBBChBGdHAUpyStEgKRkpeJ8CgxrRivNS0qKDM6gJSBRx4pUkyR_enu613n70OnTy43tvhpSSc8hSAk1ElTqrSuxC8rmVputHYYFKZowQsx9TkT2pyTE2eUxtQ8gdtvWmU__ofujtBRmv9C2SUpoMi-QbjcHn2
CODEN ITCTEM
CitedBy_id crossref_primary_10_1109_TCSVT_2024_3433547
crossref_primary_10_1109_TCSVT_2024_3372944
crossref_primary_10_1109_TMM_2024_3387696
crossref_primary_10_1109_TCDS_2023_3325358
crossref_primary_10_1109_TCSVT_2022_3161815
crossref_primary_10_1109_TCSVT_2022_3174136
crossref_primary_10_1109_TCSVT_2023_3310296
crossref_primary_10_1109_TPAMI_2023_3258628
crossref_primary_10_1109_TCSVT_2023_3275950
crossref_primary_10_1016_j_neucom_2025_129698
crossref_primary_10_1109_TCSVT_2023_3260115
crossref_primary_10_1109_TCSVT_2024_3369656
crossref_primary_10_1109_TCSVT_2023_3250518
crossref_primary_10_1109_TCSVT_2024_3413074
crossref_primary_10_1109_TCSVT_2023_3288353
crossref_primary_10_1109_TCSVT_2024_3399613
crossref_primary_10_1109_TCSVT_2023_3283282
crossref_primary_10_1016_j_patcog_2023_110169
crossref_primary_10_1007_s11633_022_1410_8
crossref_primary_10_1109_TIP_2023_3243525
crossref_primary_10_1109_TCSVT_2023_3312325
crossref_primary_10_1109_TIM_2024_3451586
crossref_primary_10_1109_TCSVT_2023_3259430
crossref_primary_10_1016_j_knosys_2025_113200
crossref_primary_10_1109_TCSVT_2024_3422869
crossref_primary_10_1109_TIP_2023_3345652
crossref_primary_10_1109_TCSVT_2024_3376373
crossref_primary_10_1109_TCSVT_2021_3113505
crossref_primary_10_1109_TMM_2024_3453062
crossref_primary_10_1016_j_eswa_2025_126650
Cites_doi 10.1609/aaai.v34i07.6795
10.1109/ICCV.2017.83
10.1007/978-3-319-46475-6_5
10.1609/aaai.v33i01.33018175
10.1109/TIP.2020.3013142
10.1109/CVPR.2017.375
10.18653/v1/D18-1015
10.1007/s11263-016-0981-7
10.1109/CVPR42600.2020.00056
10.1109/ICCV.2017.472
10.1109/ICCV.2017.618
10.1145/3323873.3325056
10.1007/978-3-030-58607-2_4
10.1109/CVPR.2016.9
10.1609/aaai.v33i01.33018199
10.1109/CVPR42600.2020.01068
10.1007/978-3-319-11752-2_15
10.18653/v1/D19-1219
10.1145/3209978.3210003
10.18653/v1/P19-1183
10.1609/aaai.v33i01.33019062
10.1109/CVPR.2018.00142
10.1145/3240508.3240549
10.3115/v1/D14-1086
10.1109/CVPR.2018.00935
10.18653/v1/D19-1518
10.1109/ICCV.2017.162
10.1109/WACV.2019.00032
10.18653/v1/D19-1514
10.1109/CVPR42600.2020.01050
10.1109/CVPR.2018.00633
10.1145/3343031.3350985
10.1109/ICCV.2017.563
10.1162/tacl_a_00207
10.1109/CVPR.2019.00134
10.18653/v1/P18-1238
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TCSVT.2021.3085907
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2205
EndPage 8249
ExternalDocumentID 10_1109_TCSVT_2021_3085907
9446308
Genre orig-research
GrantInformation_xml – fundername: Fundamental Research Funds for the Central Universities, Zhejiang Lab
  grantid: 2019KD0AB04
  funderid: 10.13039/501100012226
– fundername: Basic and Applied Basic Research Foundation of Guangdong Province; Guangdong Basic and Applied Basic Research Foundation
  grantid: 2020B1515020048
  funderid: 10.13039/501100021171
– fundername: National Key Research and Development Project of China
  grantid: 2018AAA0101900
– fundername: Beijing Natural Science Foundation
  grantid: 4202034
  funderid: 10.13039/501100004826
– fundername: National Natural Science Foundation of China
  grantid: 61876177
  funderid: 10.13039/501100001809
GroupedDBID -~X
0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
H~9
ICLAB
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
RIA
RIE
RNS
RXW
TAE
TN5
VH1
AAYXX
CITATION
RIG
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c295t-d474a6da38281a8452e1560626b3b52c7f3151f04d57ce4bd1c9fb16a1717de23
IEDL.DBID RIE
ISSN 1051-8215
IngestDate Mon Jun 30 06:30:14 EDT 2025
Thu Apr 24 23:08:00 EDT 2025
Tue Jul 01 00:41:15 EDT 2025
Wed Aug 27 02:29:09 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 12
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c295t-d474a6da38281a8452e1560626b3b52c7f3151f04d57ce4bd1c9fb16a1717de23
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-9180-2935
0000-0002-9903-802X
0000-0002-4805-0926
0000-0003-2775-9730
PQID 2747611722
PQPubID 85433
PageCount 12
ParticipantIDs crossref_citationtrail_10_1109_TCSVT_2021_3085907
ieee_primary_9446308
proquest_journals_2747611722
crossref_primary_10_1109_TCSVT_2021_3085907
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2022-12-01
PublicationDateYYYYMMDD 2022-12-01
PublicationDate_xml – month: 12
  year: 2022
  text: 2022-12-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev TCSVT
PublicationYear 2022
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref35
ref13
ref34
ref37
ref14
ref31
ref30
ref33
ref11
ref32
ref10
ref2
ref1
ref39
ref16
kipf (ref29) 2016
lu (ref19) 2019
zhang (ref45) 2019
su (ref18) 2019
ref46
ref24
harold li (ref17) 2019
ref23
ref26
ref25
ref20
ref42
ref22
ref44
ref21
duchi (ref41) 2011; 12
ref43
chen (ref15) 2020
ren (ref36) 2015
ref28
ref27
devlin (ref38) 2018
ref8
ref7
ref9
ref4
ref3
ref6
ref5
ref40
liao (ref12) 2019
References_xml – ident: ref16
  doi: 10.1609/aaai.v34i07.6795
– ident: ref32
  doi: 10.1109/ICCV.2017.83
– ident: ref1
  doi: 10.1007/978-3-319-46475-6_5
– ident: ref44
  doi: 10.1609/aaai.v33i01.33018175
– year: 2019
  ident: ref12
  article-title: A real-time cross-modality correlation filtering method for referring expression comprehension
  publication-title: arXiv 1909 07072
– ident: ref7
  doi: 10.1109/TIP.2020.3013142
– ident: ref42
  doi: 10.1109/CVPR.2017.375
– ident: ref24
  doi: 10.18653/v1/D18-1015
– ident: ref40
  doi: 10.1007/s11263-016-0981-7
– ident: ref10
  doi: 10.1109/CVPR42600.2020.00056
– ident: ref37
  doi: 10.1109/ICCV.2017.472
– ident: ref4
  doi: 10.1109/ICCV.2017.618
– ident: ref35
  doi: 10.1145/3323873.3325056
– year: 2019
  ident: ref45
  article-title: Learning 2D temporal adjacent networks for moment localization with natural language
  publication-title: arXiv 1912 03590
– ident: ref9
  doi: 10.1007/978-3-030-58607-2_4
– ident: ref6
  doi: 10.1109/CVPR.2016.9
– ident: ref25
  doi: 10.1609/aaai.v33i01.33018199
– ident: ref33
  doi: 10.1109/CVPR42600.2020.01068
– ident: ref31
  doi: 10.1007/978-3-319-11752-2_15
– ident: ref14
  doi: 10.18653/v1/D19-1219
– year: 2016
  ident: ref29
  article-title: Semi-supervised classification with graph convolutional networks
  publication-title: arXiv 1609 02907
– ident: ref21
  doi: 10.1145/3209978.3210003
– ident: ref13
  doi: 10.18653/v1/P19-1183
– ident: ref23
  doi: 10.1609/aaai.v33i01.33019062
– start-page: 13
  year: 2019
  ident: ref19
  article-title: VILBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
  publication-title: Proc Adv Neural Inf Process Syst
– ident: ref11
  doi: 10.1109/CVPR.2018.00142
– ident: ref22
  doi: 10.1145/3240508.3240549
– start-page: 91
  year: 2015
  ident: ref36
  article-title: Faster R-CNN: Towards real-time object detection with region proposal networks
  publication-title: Proc Adv Neural Inf Process Syst
– ident: ref2
  doi: 10.3115/v1/D14-1086
– ident: ref5
  doi: 10.1109/CVPR.2018.00935
– ident: ref39
  doi: 10.18653/v1/D19-1518
– ident: ref43
  doi: 10.1109/ICCV.2017.162
– ident: ref26
  doi: 10.1109/WACV.2019.00032
– ident: ref20
  doi: 10.18653/v1/D19-1514
– ident: ref8
  doi: 10.1109/CVPR42600.2020.01050
– year: 2019
  ident: ref17
  article-title: VisualBERT: A simple and performant baseline for vision and language
  publication-title: arXiv 1908 03557
– ident: ref30
  doi: 10.1109/CVPR.2018.00633
– ident: ref28
  doi: 10.1145/3343031.3350985
– year: 2018
  ident: ref38
  article-title: BERT: Pre-training of deep bidirectional transformers for language understanding
  publication-title: arXiv 1810 04805
– year: 2019
  ident: ref18
  article-title: VL-BERT: Pre-training of generic visual-linguistic representations
  publication-title: arXiv 1908 08530
– ident: ref3
  doi: 10.1109/ICCV.2017.563
– ident: ref34
  doi: 10.1162/tacl_a_00207
– ident: ref27
  doi: 10.1109/CVPR.2019.00134
– ident: ref46
  doi: 10.18653/v1/P18-1238
– start-page: 104
  year: 2020
  ident: ref15
  article-title: Uniter: Universal image-text representation learning
  publication-title: Proc Eur Conf Comput Vis
– volume: 12
  start-page: 2121
  year: 2011
  ident: ref41
  article-title: Adaptive subgradient methods for online learning and stochastic optimization
  publication-title: J Mach Learn Res
SSID ssj0014847
Score 2.6061277
Snippet In this work, we introduce a novel task - Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or...
In this work, we introduce a novel task – Human-centric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 8238
SubjectTerms dataset
Datasets
Electron tubes
Grounding
Localization
Location awareness
Power transformers
Spatial temporal resolution
Spatio-temporal grounding
transformer
Transformers
Video
Visualization
Title Human-Centric Spatio-Temporal Video Grounding With Visual Transformers
URI https://ieeexplore.ieee.org/document/9446308
https://www.proquest.com/docview/2747611722
Volume 32
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJxh4FUShoAxs4DR2nh5RRVUhlaVp6Rb5FVFRtYgmC7-es5NUFSDEFiW2ZPkc3_fZ990hdOslEmA4UZhzaUqYEYm5iAUGV8uBPXhUWpHY-DkaTYOneThvofutFkZrbYPPtGse7V2-WsvSHJX1GXAX3yh794C4VVqt7Y1BkNhiYgAXCE7AjzUCGY_108FklgIVpMT1TT4vUzp2xwnZqio_tmLrX4ZHaNyMrAoreXPLQrjy81vSxv8O_Rgd1kDTeahWxglq6dUpOthJP9hBQ3uCj-0B70I6ExtcjdMqWdXSmS2UXjvmcMoqX5yXRfEKLzclfEsbvAvo8QxNh4_pYITrugpYUhYWWAVxwCPFfWBbhCdBSLXRUwO1Eb4IqYxzH3BA7gUqjKUOhCKS5YJEnID1lKb-OWqv1it9gRwKaFFyCjtFHgWcx0kYcZ8zzgQROfVYF5FmojNZJx03tS-WmSUfHsuscTJjnKw2Thfdbfu8Vyk3_mzdMbO9bVlPdBf1Gntm9V-5yQwDjwhANnr5e68rtE-NvMGGq_RQu_go9TWAjkLc2NX2BUEQ0U8
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1LT8MwDLYQHIADb8R49gAnlNGk7wMHBEzjtcu6wa0kaSom0IZYJwS_hb_Cf8NJ2wkB4obErWpTqY0t-7PjzwbYtUOJMJymhHOpR5hRSbgIBEFXyzF6sJk0JLGrlt_suOc33s0EvI25MEopU3ym6vrSnOWnAznSqbKDCGMXxw7LEsoL9fKMAdrw8OwEpbnHWOM0Pm6ScoYAkSzycpK6gcv9lDsYWVAeuh5TmjuMMF44wmMyyBz0eZntpl4glStSKqNMUJ9T_NJU6bYGaOCnEGd4rGCHjc8o3NCML0OAQkmInrOi5NjRQXzc7sYYfDJad3QHMT2s9pPbM3Ncvhl_49Ea8_Be7UVRyHJfH-WiLl-_tIn8r5u1AHMllLaOCt1fhAnVX4LZTw0Wl6FhziiISWH3pNU25eMkLtpxPVjdXqoGlk6_GW6Pdd3L7_DmcITP4grRIz5egc6f_MgqTPYHfbUGFkM8LDlDW5j5LudB6Pnc4RGPBBUZs6Ma0EqwiSzbquvpHg-JCa_sKDHKkGhlSEplqMH--J3HoqnIr6uXtXTHK0vB1mCz0p-ktDvDROcYfIqglK3__NYOTDfjq8vk8qx1sQEzTJM5THHOJkzmTyO1hRArF9tG0y24_Wtt-QCtni1C
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Human-Centric+Spatio-Temporal+Video+Grounding+With+Visual+Transformers&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Tang%2C+Zongheng&rft.au=Liao%2C+Yue&rft.au=Liu%2C+Si&rft.au=Li%2C+Guanbin&rft.date=2022-12-01&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=32&rft.issue=12&rft.spage=8238&rft.epage=8249&rft_id=info:doi/10.1109%2FTCSVT.2021.3085907&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCSVT_2021_3085907
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon