Multi-Level Knowledge Injecting for Visual Commonsense Reasoning

When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 31; no. 3; pp. 1042 - 1054
Main Authors Wen, Zhang, Peng, Yuxin
Format Journal Article
LanguageEnglish
Published New York IEEE 01.03.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world works. To address this issue, we propose Commonsense Knowledge based Reasoning Model (CKRM) to acquire external knowledge to support Visual Commonsense Reasoning (VCR) task, where the computer is expected to answer challenging visual questions. Our key ideas are: (1) To bridge the gap between recognition-level and cognition-level image understanding, we inject external commonsense knowledge via multi-level knowledge transfer network , achieving cell-level, layer-level and attention-level joint information transfer. It can effectively capture knowledge from different perspectives, and perceive common sense of human in advance. (2) To further promote image understanding at cognitive level, we propose a knowledge based reasoning approach , which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer. Experiments are conducted on the challenging visual commonsense reasoning dataset VCR to verify the effectiveness of our proposed CKRM approach, which can significantly improve reasoning performance and achieve the state-of-the-art accuracy.
AbstractList When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects’ functions, people’s intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world works. To address this issue, we propose Commonsense Knowledge based Reasoning Model (CKRM) to acquire external knowledge to support Visual Commonsense Reasoning (VCR) task, where the computer is expected to answer challenging visual questions. Our key ideas are: (1) To bridge the gap between recognition-level and cognition-level image understanding, we inject external commonsense knowledge via multi-level knowledge transfer network , achieving cell-level, layer-level and attention-level joint information transfer. It can effectively capture knowledge from different perspectives, and perceive common sense of human in advance. (2) To further promote image understanding at cognitive level, we propose a knowledge based reasoning approach , which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer. Experiments are conducted on the challenging visual commonsense reasoning dataset VCR to verify the effectiveness of our proposed CKRM approach, which can significantly improve reasoning performance and achieve the state-of-the-art accuracy.
Author Wen, Zhang
Peng, Yuxin
Author_xml – sequence: 1
  givenname: Zhang
  surname: Wen
  fullname: Wen, Zhang
  organization: Wangxuan Institute of Computer Technology, Peking University, Beijing, China
– sequence: 2
  givenname: Yuxin
  orcidid: 0000-0001-7658-3845
  surname: Peng
  fullname: Peng, Yuxin
  email: pengyuxin@pku.edu.cn
  organization: Wangxuan Institute of Computer Technology, Peking University, Beijing, China
BookMark eNp9kN9LwzAQx4NMcJv-A_pS8LkzSXtt8qYUfwwngs69hrS7jowumUmr-N_bueGDD8LBHdz3cwefERlYZ5GQc0YnjFF5NS9eF_MJp5xOuJRMZNkRGTIAEXNOYdDPFFgsOIMTMgphTSlLRZoPyfVT17QmnuEHNtGjdZ8NLlcYTe0aq9bYVVQ7Hy1M6HQTFW6zcTZgX9EL6uBsHzglx7VuAp4d-pi83d3Oi4d49nw_LW5mccUltDFPASVokUDGmNBUlwlQCSjKEtNa01JWDPIMdbLbZXVWpjqrJUPIYVnSZTIml_u7W-_eOwytWrvO2_6l4qkUImeQ5X1K7FOVdyF4rFVlWt0aZ1uvTaMYVTtf6seX2vlSB189yv-gW2822n_9D13sIYOIv4CkIpHAkm80X3io
CODEN ITCTEM
CitedBy_id crossref_primary_10_1109_TCSVT_2024_3428487
crossref_primary_10_1007_s10462_024_10825_z
crossref_primary_10_1007_s11390_024_4125_1
crossref_primary_10_1016_j_inffus_2023_102000
crossref_primary_10_1109_TCSVT_2023_3326279
crossref_primary_10_1109_TCSVT_2024_3407785
crossref_primary_10_1109_TMM_2023_3279691
crossref_primary_10_1109_TCSVT_2023_3278492
crossref_primary_10_1109_TCSVT_2023_3281507
crossref_primary_10_1142_S0218348X23401333
crossref_primary_10_1016_j_cviu_2024_104165
crossref_primary_10_1016_j_knosys_2025_113214
crossref_primary_10_1007_s11042_022_13428_4
crossref_primary_10_1016_j_neunet_2022_05_008
crossref_primary_10_1109_TCSVT_2024_3382684
crossref_primary_10_1109_TMM_2023_3275874
crossref_primary_10_1016_j_knosys_2023_111153
crossref_primary_10_1109_TCSVT_2023_3284474
crossref_primary_10_1007_s11042_022_12776_5
crossref_primary_10_1109_TMM_2021_3091882
crossref_primary_10_1109_TNNLS_2023_3323491
Cites_doi 10.1109/CVPR.2018.00636
10.1007/978-3-319-46484-8_44
10.24963/ijcai.2017/179
10.1609/aaai.v33i01.33013027
10.1016/j.neunet.2005.06.042
10.1109/CVPR.2017.215
10.1145/219717.219748
10.1007/s11263-016-0966-6
10.1109/CVPR.2019.00688
10.1109/TPAMI.2017.2754246
10.1109/CVPR.2016.540
10.1109/ICCV.2015.169
10.1007/s11263-018-1116-0
10.1109/TPAMI.2016.2577031
10.1145/2701413
10.1109/CVPR.2018.00895
10.1109/CVPR.2018.00807
10.24963/ijcai.2017/263
10.1007/978-3-319-46484-8_2
10.24963/ijcai.2018/126
10.1109/TKDE.2009.191
10.1109/CVPR.2016.90
10.1109/WACV.2019.00036
10.1109/TCSVT.2018.2808685
10.1109/ICCV.2017.285
10.1109/CVPR.2016.499
10.1016/j.cviu.2017.05.001
10.18653/v1/P18-1224
10.1109/ICCV.2017.362
10.18653/v1/D15-1075
10.1007/978-3-642-29694-9_1
10.1007/978-3-030-01261-8_30
10.1109/CVPR.2015.7298965
10.1109/ICCV.2017.322
10.1007/978-3-540-76298-0_52
10.1109/CVPR.2019.00648
10.1109/CVPR.2019.00857
10.18653/v1/D18-1009
10.1109/CVPR.2017.660
10.18653/v1/P18-1043
10.1109/CVPR.2018.00801
10.1007/s11263-016-0981-7
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TCSVT.2020.2991866
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2205
EndPage 1054
ExternalDocumentID 10_1109_TCSVT_2020_2991866
9083951
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: 61925201; 61771025
  funderid: 10.13039/501100001809
GroupedDBID -~X
0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
H~9
ICLAB
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
RIA
RIE
RNS
RXW
TAE
TN5
VH1
AAYXX
CITATION
RIG
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c295t-245e95a8356118a0ab35095e8bbe4fa0b9c1576ea3a0ab6f6b4a6f91e575db0d3
IEDL.DBID RIE
ISSN 1051-8215
IngestDate Sun Jun 29 16:53:39 EDT 2025
Thu Apr 24 23:12:22 EDT 2025
Tue Jul 01 00:41:13 EDT 2025
Wed Aug 27 02:45:36 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c295t-245e95a8356118a0ab35095e8bbe4fa0b9c1576ea3a0ab6f6b4a6f91e575db0d3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-7658-3845
PQID 2498871567
PQPubID 85433
PageCount 13
ParticipantIDs crossref_citationtrail_10_1109_TCSVT_2020_2991866
proquest_journals_2498871567
crossref_primary_10_1109_TCSVT_2020_2991866
ieee_primary_9083951
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2021-03-01
PublicationDateYYYYMMDD 2021-03-01
PublicationDate_xml – month: 03
  year: 2021
  text: 2021-03-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev TCSVT
PublicationYear 2021
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref12
ref15
lu (ref25) 2016
ref14
ref53
ren (ref20) 2015
ref11
ref54
ref17
ref16
ref19
ref18
zhang (ref42) 2015
ref51
ref50
ref46
ref48
ref41
ref44
maccartney (ref10) 2009
ref8
ref7
ref9
ref4
ref3
ref6
devlin (ref49) 2019
ref5
ref40
jialin pan (ref43) 2010; 22
ref34
ref37
ref31
ref30
ref33
ref32
hinton (ref36) 2015
ref2
zagoruyko (ref47) 2017
ref39
ref38
ref24
ref26
veit (ref55) 2016
ref22
ref21
ref28
ref27
ref29
kim (ref52) 2017
yi (ref23) 2018
speer (ref35) 2017
pan (ref45) 2018
simonyan (ref1) 2015
References_xml – ident: ref14
  doi: 10.1109/CVPR.2018.00636
– ident: ref51
  doi: 10.1007/978-3-319-46484-8_44
– ident: ref32
  doi: 10.24963/ijcai.2017/179
– year: 2015
  ident: ref36
  article-title: Distilling the knowledge in a neural network
  publication-title: ArXiv 1503 02531
– ident: ref40
  doi: 10.1609/aaai.v33i01.33013027
– ident: ref50
  doi: 10.1016/j.neunet.2005.06.042
– ident: ref22
  doi: 10.1109/CVPR.2017.215
– ident: ref31
  doi: 10.1145/219717.219748
– ident: ref12
  doi: 10.1007/s11263-016-0966-6
– ident: ref48
  doi: 10.1109/CVPR.2019.00688
– ident: ref28
  doi: 10.1109/TPAMI.2017.2754246
– ident: ref21
  doi: 10.1109/CVPR.2016.540
– ident: ref3
  doi: 10.1109/ICCV.2015.169
– start-page: 1039
  year: 2018
  ident: ref23
  article-title: Neural-symbolic VQA: Disentangling reasoning from vision and language understanding
  publication-title: Proc Neural Inf Process Syst (NeurIPS)
– ident: ref13
  doi: 10.1007/s11263-018-1116-0
– ident: ref4
  doi: 10.1109/TPAMI.2016.2577031
– ident: ref18
  doi: 10.1145/2701413
– ident: ref37
  doi: 10.1109/CVPR.2018.00895
– ident: ref17
  doi: 10.1109/CVPR.2018.00807
– ident: ref44
  doi: 10.24963/ijcai.2017/263
– ident: ref6
  doi: 10.1007/978-3-319-46484-8_2
– start-page: 1
  year: 2015
  ident: ref1
  article-title: Very deep convolutional networks for large-scale image recognition
  publication-title: Proc Int Conf Learn Represent (ICLR)
– ident: ref16
  doi: 10.24963/ijcai.2018/126
– year: 2015
  ident: ref20
  article-title: Image question answering: A visual semantic embedding model and a new dataset
  publication-title: arXiv 1505 02074v1
– volume: 22
  start-page: 1345
  year: 2010
  ident: ref43
  article-title: A survey on transfer learning
  publication-title: IEEE Trans Knowl Data Eng
  doi: 10.1109/TKDE.2009.191
– year: 2016
  ident: ref55
  article-title: Coco-text: Dataset and benchmark for text detection and recognition in natural images
  publication-title: arXiv 1601 07140
– ident: ref2
  doi: 10.1109/CVPR.2016.90
– start-page: 1
  year: 2017
  ident: ref47
  article-title: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer
  publication-title: Proc Int Conf Learn Represent (ICLR)
– ident: ref34
  doi: 10.1109/WACV.2019.00036
– ident: ref7
  doi: 10.1109/TCSVT.2018.2808685
– ident: ref53
  doi: 10.1109/ICCV.2017.285
– ident: ref24
  doi: 10.1109/CVPR.2016.499
– start-page: 289
  year: 2016
  ident: ref25
  article-title: Hierarchical question-image co-attention for visual question answering
  publication-title: Proc Neural Inf Process Syst (NeurIPS)
– ident: ref29
  doi: 10.1016/j.cviu.2017.05.001
– ident: ref33
  doi: 10.18653/v1/P18-1224
– start-page: 4171
  year: 2019
  ident: ref49
  article-title: BERT: Pre-training of deep bidirectional transformers for language understanding
  publication-title: Proc Conf North Amer Chapter Assoc Comput Linguistics Hum Lang Technol (NAACL-HLT)
– start-page: 1394
  year: 2015
  ident: ref42
  article-title: CORPP: Commonsense reasoning and probabilistic planning, as applied to dialog with a mobile robot
  publication-title: Proc AAAI Conf Artif Intell (AAAI)
– start-page: 1
  year: 2017
  ident: ref52
  article-title: Hadamard product for low-rank bilinear pooling
  publication-title: Proc Int Conf Learn Represent (ICLR)
– start-page: 6095
  year: 2018
  ident: ref45
  article-title: Macnet: Transferring knowledge from machine comprehension to sequence-to-sequence models
  publication-title: Proc Neural Inf Process Syst (NeurIPS)
– ident: ref38
  doi: 10.1109/ICCV.2017.362
– ident: ref54
  doi: 10.18653/v1/D15-1075
– year: 2009
  ident: ref10
  article-title: Natural language inference
– ident: ref11
  doi: 10.1007/978-3-642-29694-9_1
– ident: ref46
  doi: 10.1007/978-3-030-01261-8_30
– start-page: 4444
  year: 2017
  ident: ref35
  article-title: ConceptNet 5.5: An open multilingual graph of general knowledge
  publication-title: Proc AAAI Conf Artif Intell (AAAI)
– ident: ref9
  doi: 10.1109/CVPR.2015.7298965
– ident: ref5
  doi: 10.1109/ICCV.2017.322
– ident: ref30
  doi: 10.1007/978-3-540-76298-0_52
– ident: ref26
  doi: 10.1109/CVPR.2019.00648
– ident: ref27
  doi: 10.1109/CVPR.2019.00857
– ident: ref39
  doi: 10.18653/v1/D18-1009
– ident: ref8
  doi: 10.1109/CVPR.2017.660
– ident: ref41
  doi: 10.18653/v1/P18-1043
– ident: ref15
  doi: 10.1109/CVPR.2018.00801
– ident: ref19
  doi: 10.1007/s11263-016-0981-7
SSID ssj0014847
Score 2.470835
Snippet When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental...
When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects’ functions, people’s intents and mental...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1042
SubjectTerms Cognition
Image recognition
Information transfer
Knowledge
Knowledge based systems
Knowledge discovery
Knowledge management
knowledge representation
Object recognition
Reasoning
Task analysis
transfer learning
Visual commonsense reasoning
visual question answering
Visual tasks
Visualization
Title Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
URI https://ieeexplore.ieee.org/document/9083951
https://www.proquest.com/docview/2498871567
Volume 31
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELVKT3BgK4hCQTlwA6dO4iy-gSoq9gO0VW-RnUwkBEoRTS58PWNnEZsQUg6RYkeRx7HfG7-ZIeQ4jUKVZF5AeSAF5coLqUhZRn0IHUhSN2PKqHzvg8spv5778w45bWNhAMCIz8DWt-YsP10kpXaVDQXiBaHjpVeQuFWxWu2JAY9MMTGECw6NcB9rAmSYGE5Gj7MJUkGX2bj46gxvXzYhU1Xlx1Js9pfxBrlrvqySlTzbZaHs5P1b0sb_fvomWa-BpnVezYwt0oF8m6x9Sj_YI2cm-pbeat2QddM416yrXPtmsIWFgNaaPS1LfI8OJNHCa7ysB5BL48XdIdPxxWR0SeuKCjRxhV9Ql_sgfImoK0BiIZlUHgIGHyKlgGeSKZE4SEBAevpZkAWKyyATDiCoSxVLvV3SzRc57BFLZwILnEiF4DOeCSVVyCD104xzCHnC-sRphjhO6nTjuurFS2xoBxOxMUuszRLXZumTk7bPa5Vs48_WPT3Obct6iPtk0Fgyrv_HZYwkE1dT5Krh_u-9Dsiqq9UqRl02IN3irYRDhBuFOjLz7AOWE9Dv
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1LT-MwEB7xOAAH3ogCCznACbk4iZ3Eh5UWsaCWFg5QELdgJxMJgQqirdDub-Gv7H_bsZtUvMQNCSmHSLEtxTPyfDP-ZgZgO09ikxVhxESkFRMmjJnKecEkxj5meVBw41i-p1HjQhxfyasxeB7lwiCiI59h3b66u_z8PhvYUNmeIrxAiKCkULbwzxM5aL2fzd8kzZ0gODrsHDRY2UOAZYGSfRYIiUpqwhkRQWnNtQnJREpMjEFRaG5U5hPkRh3ab1ERGaGjQvlIMCY3PA9p3XGYJJwhg2F22OiOQiSufRkBFJ8lZDmrlByu9joH55cdcj4DXqfj3taUe2X2XB-Xd4e_s2hHc_Cv2oshkeW2Puibevb3TZnI77pZ8zBbQmlvf6j7CzCG3UWYeVFgcQl-ufxi1rbMKK9VhQ-9ZtdGn2iER5Ddu7zpDWgdmypjqeX0eGeoey5OvQwXX_ILKzDRve_iKni21lnkJyZGyUWhjDYxx1zmhRAYi4zXwK9EmmZlQXXb1-MudY4VV6lTg9SqQVqqQQ12R3MehuVEPh29ZOU6GlmKtAYbleak5YnTS8mNJntB3ni89vGsLZhqdE7aabt52lqH6cBycxyXbgMm-o8D_EHgqm82nY57cP3VevIfWvUvKw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-Level+Knowledge+Injecting+for+Visual+Commonsense+Reasoning&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Zhang%2C+Wen&rft.au=Peng%2C+Yuxin&rft.date=2021-03-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=31&rft.issue=3&rft.spage=1042&rft_id=info:doi/10.1109%2FTCSVT.2020.2991866&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon