Selective Structured State-Spaces for Long-Form Video Understanding

Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence ( S4 ) model with its linear complexity offers a promising direction in this space. However, we demonstrate that treating all imagetokens equal...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 6387 - 6397
Main Authors Wang, Jue, Zhu, Wentao, Wang, Pichao, Yu, Xiang, Liu, Linda, Omar, Mohamed, Hamid, Raffay
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence ( S4 ) model with its linear complexity offers a promising direction in this space. However, we demonstrate that treating all imagetokens equally as done by S4 model can adversely affect its efficiency and accuracy. To address this limitation, we present a novel Selective S4 (i.e., S5) model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos. Unlike previous mask-based token reduction methods used in transformers, our S5 model avoids the dense self-attention calculation by making use of the guidance of the momentum-updated S4 model. This enables our model to efficiently discard less informative tokens and adapt to various long-form video understanding tasks more effectively. However, as is the case for most token reduction methods, the informative image tokens could be dropped incorrectly. To improve the robustness and the temporal horizon of our model, we propose a novel long-short masked contrastive learning (LSMCL) approach that enables our model to predict longer temporal context using shorter input videos. We present extensive comparative results using three challenging long-form video understanding datasets (LVU, COIN and Breakfast), demonstrating that our approach consistently outperforms the previous state-of-the-art S4 model by up to 9.6% accuracy while reducing its memory footprint by 23%.
AbstractList Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence ( S4 ) model with its linear complexity offers a promising direction in this space. However, we demonstrate that treating all imagetokens equally as done by S4 model can adversely affect its efficiency and accuracy. To address this limitation, we present a novel Selective S4 (i.e., S5) model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos. Unlike previous mask-based token reduction methods used in transformers, our S5 model avoids the dense self-attention calculation by making use of the guidance of the momentum-updated S4 model. This enables our model to efficiently discard less informative tokens and adapt to various long-form video understanding tasks more effectively. However, as is the case for most token reduction methods, the informative image tokens could be dropped incorrectly. To improve the robustness and the temporal horizon of our model, we propose a novel long-short masked contrastive learning (LSMCL) approach that enables our model to predict longer temporal context using shorter input videos. We present extensive comparative results using three challenging long-form video understanding datasets (LVU, COIN and Breakfast), demonstrating that our approach consistently outperforms the previous state-of-the-art S4 model by up to 9.6% accuracy while reducing its memory footprint by 23%.
Author Wang, Pichao
Wang, Jue
Yu, Xiang
Liu, Linda
Omar, Mohamed
Hamid, Raffay
Zhu, Wentao
Author_xml – sequence: 1
  givenname: Jue
  surname: Wang
  fullname: Wang, Jue
  email: juewangn@amazon.com
  organization: Amazon Prime Video
– sequence: 2
  givenname: Wentao
  surname: Zhu
  fullname: Zhu, Wentao
  email: zhuwent@amazon.com
  organization: Amazon Prime Video
– sequence: 3
  givenname: Pichao
  surname: Wang
  fullname: Wang, Pichao
  email: wpichao@amazon.com
  organization: Amazon Prime Video
– sequence: 4
  givenname: Xiang
  surname: Yu
  fullname: Yu, Xiang
  email: xiangnyu@amazon.com
  organization: Amazon Prime Video
– sequence: 5
  givenname: Linda
  surname: Liu
  fullname: Liu, Linda
  email: lindliu@amazon.com
  organization: Amazon Prime Video
– sequence: 6
  givenname: Mohamed
  surname: Omar
  fullname: Omar, Mohamed
  email: omarmk@amazon.com
  organization: Amazon Prime Video
– sequence: 7
  givenname: Raffay
  surname: Hamid
  fullname: Hamid, Raffay
  email: raffay@amazon.com
  organization: Amazon Prime Video
BookMark eNotjMFKw0AQQFdRsNb8QQ_5gdTZ2W4mc5RgrRBQjO21pNlJWWiTstkK_r0VPb13eLx7ddMPvSg10zDXGvix3Lx_WCTkOQKaOUCuiyuVMHFhLBjQyMW1mqAlmxGQvVPJOPodWAQgw8VElbUcpI3-S9I6hnMbz0HcRZsoWX1qWhnTbghpNfT7bDmEY7rxToZ03TsJY2x65_v9g7rtmsMoyT-nar18_ixXWfX28lo-VZlHWMSs02CBAa1hELFNi-iYsMvJgdHOcd4ZjTu2QiQFF05YU0vO5ovfLjdTNfv7ehHZnoI_NuF7q-Fyz8maH2fUTRs
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52729.2023.00618
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350301298
EISSN 2575-7075
EndPage 6397
ExternalDocumentID 10204675
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i204t-f10509025390ee5ac22d972f67d031dd96f312b95e77e898de917c7d564d97263
IEDL.DBID RIE
IngestDate Wed Jun 26 19:26:18 EDT 2024
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-f10509025390ee5ac22d972f67d031dd96f312b95e77e898de917c7d564d97263
PageCount 11
ParticipantIDs ieee_primary_10204675
PublicationCentury 2000
PublicationDate 2023-June
PublicationDateYYYYMMDD 2023-06-01
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June
PublicationDecade 2020
PublicationTitle 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublicationTitleAbbrev CVPR
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib052007398
ssib042469789
Score 2.3154817
Snippet Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence (...
SourceID ieee
SourceType Publisher
StartPage 6387
SubjectTerms Adaptation models
Computational modeling
Generators
Predictive models
Robustness
Spatiotemporal phenomena
Transformers
Video: Action and event understanding
Title Selective Structured State-Spaces for Long-Form Video Understanding
URI https://ieeexplore.ieee.org/document/10204675
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFH64nTypOPE3OXhNXdMmac7DMUTHcG7sNprmVYbQDlkv_vW-pJsOQfBWQkuTpun3vfR97wO4o1ZLOE_MzcYJTylQ5jnGgjsiE_1SmURlXij8PFajWfq4kIutWD1oYRAxJJ9h5A_Dv3xXF43fKqMVLiic07IDHW1MK9bavTypoEBvr3S6LyekE5Nt5XJx39wP5pMXKYhNRt4z3Gd0eauPPVOVgCnDIxjvetOmkrxHzcZGxeevQo3_7u4x9H7ke2zyDUwncIDVKQymwfGGPm5sGmrGNh_oWOCafLr2iVmM-Ct7qqs3PiQiy-YrhzWb7YtfejAbPrwORnzroMBXdOsNL2Nf3oVoTWL6iDIvhHBGi1JpR4vZOaPKJBbWSNQaM5M5pOit0E6q1J-nkjPoVnWF58DyzBJ02cKVFIIRxtNFRrq8TF0aLK4uoOefwHLdFslY7gZ_-Uf7FRz6WWizrq6hSyPHG8L3jb0N8_oFOzOhzQ
link.rule.ids 310,311,783,787,792,793,799,27939,55088
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1dS8MwFL348aBPKk78Ng--pq5pkzbPwzF1G8N9sLfRNLcyhHbI9uKv9ybddAiCbyW0NE2annPTe-4BuKdWQzhPzM2EEY8pUOYZhoJbIhPNQulIpU4o3Ourzjh-nsrpWqzutTCI6JPPMHCH_l--rfKV2yqjFS4onEvkLuxLRyxqudbm9YkFhXpbxdNdQaEk0ulaMBc29UNrMniVgvhk4FzDXU6XM_vYslXxqNI-gv6mP3UyyXuwWpog__xVqvHfHT6Gxo-Ajw2-oekEdrA8hdbQe97Q540NfdXY1Qda5tkmHy5cahYjBsu6VfnG20Rl2WRusWLjbflLA8btx1Grw9ceCnxOt17yInQFXojYRLqJKLNcCKsTUajE0nK2VqsiCoXREpMEU51apPgtT6xUsTtPRWewV1YlngPLUkPgZXJbUBBGKE8XaWmzIraxN7m6gIYbgdmiLpMx2zz85R_td3DQGfW6s-5T_-UKDt2M1DlY17BHo4A3hPZLc-vn-AvpAKUa
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2023+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=Selective+Structured+State-Spaces+for+Long-Form+Video+Understanding&rft.au=Wang%2C+Jue&rft.au=Zhu%2C+Wentao&rft.au=Wang%2C+Pichao&rft.au=Yu%2C+Xiang&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=2575-7075&rft.spage=6387&rft.epage=6397&rft_id=info:doi/10.1109%2FCVPR52729.2023.00618&rft.externalDocID=10204675