Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks

Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 1046 - 1056
Main Authors Materzynska, Joanna, Xiao, Tete, Herzig, Roei, Xu, Huijuan, Wang, Xiaolong, Darrell, Trevor
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2020
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. To train our model, we collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. The novel aspects of our model are applicable to activities with prominent object interaction dynamics and to objects which can be tracked using state-of-the-art approaches; for activities without clearly defined spatial object-agent interactions, we rely on baseline scene-level spatio-temporal representations. We show the effectiveness of our approach not only on the proposed compositional action recognition task but also in a few-shot compositional setting which requires the model to generalize across both object appearance and action category.
AbstractList Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. To train our model, we collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. The novel aspects of our model are applicable to activities with prominent object interaction dynamics and to objects which can be tracked using state-of-the-art approaches; for activities without clearly defined spatial object-agent interactions, we rely on baseline scene-level spatio-temporal representations. We show the effectiveness of our approach not only on the proposed compositional action recognition task but also in a few-shot compositional setting which requires the model to generalize across both object appearance and action category.
Author Xiao, Tete
Xu, Huijuan
Wang, Xiaolong
Darrell, Trevor
Herzig, Roei
Materzynska, Joanna
Author_xml – sequence: 1
  givenname: Joanna
  surname: Materzynska
  fullname: Materzynska, Joanna
  organization: University of Oxford, TwentyBN
– sequence: 2
  givenname: Tete
  surname: Xiao
  fullname: Xiao, Tete
  organization: UC Berkeley
– sequence: 3
  givenname: Roei
  surname: Herzig
  fullname: Herzig, Roei
  organization: Tel Aviv University
– sequence: 4
  givenname: Huijuan
  surname: Xu
  fullname: Xu, Huijuan
  organization: UC Berkeley
– sequence: 5
  givenname: Xiaolong
  surname: Wang
  fullname: Wang, Xiaolong
  organization: UC Berkeley
– sequence: 6
  givenname: Trevor
  surname: Darrell
  fullname: Darrell, Trevor
  organization: UC Berkeley
BookMark eNotj0FLw0AUhFdRsNb-Aj3kDyS-t5vsZr2V0GqhqLRVDx7Ka_LarrbZkiyI_95oPQwzDB8DcynOal-zEDcICSLY2-L1eZZKDZBIkJAAIKoTMbAmRyM7oc6zU9FD0CrWFu2FGLTtBwAoiaht3hPvc7_nsHX1Jh7tWr6LCr8_-NYF52vaRcPyN0QzLv2m_iujNxe20fxAwdEuXnBHNx04qQM3dKQfOXz55rO9Eudr6kYH_94XL-PRoniIp0_3k2I4jZ0EFeLKrDItlV3nqkpTU9qSmFAbbVCVACwr1llFErkqtZRE65UlnRoGTHNgUn1xfdx1zLw8NG5PzffSYta9z9UPbslXUg
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR42600.2020.00113
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9781728171685
1728171687
EISSN 1063-6919
EndPage 1056
ExternalDocumentID 9156858
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-d7b56239f83d447c9caea1676713c00e2de65da21edc622aafb9a647e01480ea3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:30:34 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-d7b56239f83d447c9caea1676713c00e2de65da21edc622aafb9a647e01480ea3
PageCount 11
ParticipantIDs ieee_primary_9156858
PublicationCentury 2000
PublicationDate 2020-Jun
PublicationDateYYYYMMDD 2020-06-01
PublicationDate_xml – month: 06
  year: 2020
  text: 2020-Jun
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2020
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.563481
Snippet Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training...
SourceID ieee
SourceType Publisher
StartPage 1046
SubjectTerms Cognition
Computational modeling
Detectors
Feature extraction
Task analysis
Training
Videos
Title Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks
URI https://ieeexplore.ieee.org/document/9156858
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LTwIxEJ4AJ0_4wPhODx4t7LbL7tabIRJiAiEISuKB9DEbiQSMLBd_vW13wWg8eGs2m2zT2Xbmm37fDMB1mIVoMEaatqXLVsUhVXZDUp5pHSUZR4yc3rk_iHuT6GHanlbgZqeFQURPPsOmG_q7fLPSG5cqawkLNtJ2WoWqBW6FVmuXT-EWycQiLdVxYSBanafhyNdftyiQBf7Kgf_ooeJdSLcO_e3HC-bIW3OTq6b-_FWX8b-z24fGt1iPDHdu6AAquDyEehldknLvro_g5XHl2kXbd6iTmdwSdxSUlC25IHde4EBGW0KRHT_P81fiehbbf5SOixpWC-JziIUcggwKEvm6AZPu_bjTo2VrBTpnAc-pSZQLfESWchNFiRZaogxd8baQ6yBAZu3XNpJZQ-qYMSkzJWQcJegSkAFKfgy15WqJJ0A0C1JjYYqysUBk0OIVRKGyTJlUGibYKRy5tZq9F9UzZuUynf39-Bz2nLUKMtYF1PKPDV5at5-rK2_vLyuCr3g
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LTgIxFG0QF7pCBePbLlxamGnnVXeGSFCBEAQlcUH6uBOJBIwMG7_etjNiNC7cNZNJpumd9j56zrkIXfipDxoiIEkobLUq8ok0G5KwVKkgThlAYPnO3V7UHgV343BcQpdrLgwAOPAZ1O3Q3eXrhVrZUlmDm2QjCZMNtGn8fujnbK11RYWZXCbiScGP8z3eaD72B06B3eSB1HOXDuxHFxXnRFoV1P36fI4dea2vMllXH7-UGf87vx1U-6br4f7aEe2iEsz3UKWIL3Gxe5dV9PywsA2jzTvEEk2usD0MCtCWmOFrR3HAgy9IkRk_TbMXbLsWm7-UDHMVqxl2VcScEIF7OYx8WUOj1s2w2SZFcwUypR7LiI6lDX14mjAdBLHiSoDwrXybz5TnATUWDLWgxpQqolSIVHIRBTHYEqQHgu2j8nwxhwOEFfUSbRIVaaKBQIPJWAC4TFOpE6Epp4eoatdq8pbrZ0yKZTr6-_E52moPu51J57Z3f4y2reVyaNYJKmfvKzg1QUAmz5ztPwGAIbLB
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Something-Else%3A+Compositional+Action+Recognition+With+Spatial-Temporal+Interaction+Networks&rft.au=Materzynska%2C+Joanna&rft.au=Xiao%2C+Tete&rft.au=Herzig%2C+Roei&rft.au=Xu%2C+Huijuan&rft.date=2020-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=1046&rft.epage=1056&rft_id=info:doi/10.1109%2FCVPR42600.2020.00113&rft.externalDocID=9156858