HOTR: End-to-End Human-Object Interaction Detection with Transformers

Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods hav...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 74 - 83
Main Authors Kim, Bumsoo, Lee, Junhyun, Kang, Jaewoo, Kim, Eun-Sol, Kim, Hyunwoo J.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2021
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of 〈human, object, interaction〉 triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
AbstractList Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of 〈human, object, interaction〉 triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
Author Lee, Junhyun
Kim, Eun-Sol
Kim, Hyunwoo J.
Kim, Bumsoo
Kang, Jaewoo
Author_xml – sequence: 1
  givenname: Bumsoo
  surname: Kim
  fullname: Kim, Bumsoo
  email: bumsoo.brain@kakaobrain.com
  organization: Kakao Brain
– sequence: 2
  givenname: Junhyun
  surname: Lee
  fullname: Lee, Junhyun
  email: ljhyun33@korea.ac.kr
  organization: Korea University
– sequence: 3
  givenname: Jaewoo
  surname: Kang
  fullname: Kang, Jaewoo
  email: kangj@korea.ac.kr
  organization: Korea University
– sequence: 4
  givenname: Eun-Sol
  surname: Kim
  fullname: Kim, Eun-Sol
  email: eunsol.kim@kakaobrain.com
  organization: Kakao Brain
– sequence: 5
  givenname: Hyunwoo J.
  surname: Kim
  fullname: Kim, Hyunwoo J.
  email: hyunwoojkim@korea.ac.kr
  organization: Korea University
BookMark eNotjNFKwzAYRqMouM09gV70BVL_P0nTxDup1Q4GlVG9HUmbYodNJY2Ib-9kXp3vwMdZkgs_eUfILUKKCPqueHvZCSl4njJgmAIAijOyRCkzITLQ7JwsECSnUqO-Iut5Phw_nCFKrRakrOpmd5-UvqNxokck1ddoPK3twbUx2fjogmnjMPnk0UV3Wt9DfE-aYPzcT2F0Yb4ml735mN36nyvy-lQ2RUW39fOmeNjSgQGPtM3BtIaLTgnWa4vMtn-ipULgzhiZ5RZs57IMDFO9MBqURqUEWAsgO74iN6fu4Jzbf4ZhNOFnr7NcQS75L1tdTI4
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR46437.2021.00014
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1665445092
9781665445092
EISSN 1063-6919
EndPage 83
ExternalDocumentID 9578076
Genre orig-research
GrantInformation_xml – fundername: National Research Foundation
  funderid: 10.13039/501100001321
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-c70aca34d842f9b12bc34d8968103eaa657b0bde550a28f4a908918840bb006d3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:24:15 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-c70aca34d842f9b12bc34d8968103eaa657b0bde550a28f4a908918840bb006d3
PageCount 10
ParticipantIDs ieee_primary_9578076
PublicationCentury 2000
PublicationDate 2021-June
PublicationDateYYYYMMDD 2021-06-01
PublicationDate_xml – month: 06
  year: 2021
  text: 2021-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6212382
Snippet Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e.,...
SourceID ieee
SourceType Publisher
StartPage 74
SubjectTerms Benchmark testing
Detectors
Prediction algorithms
Predictive models
Semantics
Training
Transformers
Title HOTR: End-to-End Human-Object Interaction Detection with Transformers
URI https://ieeexplore.ieee.org/document/9578076
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH9sO3mauonf5ODRdGubNo3XuTEE3Rib7Dby8QoidKLtxb_eJO0migdPTQsl4SWP936_9xGAG2cEYp0oaniuKeNGWZ2TguqE5UZE7lj5LN-ndLpiD-tk3YLbfS0MIvrkMwzc0MfyzVZXjiobCHu8LO5uQ9sCt7pWa8-nxBbJpCJrquPCoRiMnucL5uJSFgVGYeDRwI87VLwJmXThcTd5nTnyGlSlCvTnr76M_13dIfS_i_XIfG-GjqCFxTF0G--SNLr70YPxdLZc3JFxYWi5pfZBPIFPZ8pRMcRTg3WVA7nHEuuRo2nJcufcWlexD6vJeDma0uYSBfpiJV1SzYdSy5iZjEW5UGGktHsRrg9ZjFKmCVdDZdAiFRllOZMuEBhmFvcpp5EmPoFOsS3wFAjPjBAcU8xCZFIJZ9gMCuQpt78adgY9J5XNW90nY9MI5Pzvzxdw4PalTru6hE75XuGVNfCluvY7-wWd76R2
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAq0CLeeGDEbZM4ccxaWgXoS1WKulV-XCSElCJIF349tpMWgRiY4kSKEp19-u777s5G6MaCQKBCSTTLFKFMS-NzghMV0kxz3y4rV-U7jpI5fVyEixq63fbCAIArPoO2Hbpcvl6ptZXKOtwsL8O7d9Cuwf3QL7u1topKYLhMxOOqP87r8k7veTqjNjNleKDvtR0f-HGKigORQQONNp8va0de2-tCttXnr50Z__t_B6j13a6Hp1sgOkQ1yI9Qo4ovceW9H03UTybp7A73c02KFTEX7CR8MpFWjMFOHCz7HPA9FFCOrFCL0014a4LFFpoP-mkvIdUxCuTF2LoginWFEgHVMfUzLj1fKnvD7U5kAQgRhUx2pQbDVYQfZ1TYVKAXG-YnrU_q4BjV81UOJwizWHPOIILYAyokt9CmgQOLmHlV01PUtFZZvpU7ZSwrg5z9_fga7SXpaLgcPoyfztG-naOyCOsC1Yv3NVwauC_klZvlL2cTp8A
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=HOTR%3A+End-to-End+Human-Object+Interaction+Detection+with+Transformers&rft.au=Kim%2C+Bumsoo&rft.au=Lee%2C+Junhyun&rft.au=Kang%2C+Jaewoo&rft.au=Kim%2C+Eun-Sol&rft.date=2021-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=74&rft.epage=83&rft_id=info:doi/10.1109%2FCVPR46437.2021.00014&rft.externalDocID=9578076