Multi-Level Knowledge Injecting for Visual Commonsense Reasoning

When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 31; no. 3; pp. 1042 - 1054
Main Authors	Wen, Zhang, Peng, Yuxin
Format	Journal Article
Language	English
Published	New York IEEE 01.03.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Cognition Image recognition Information transfer Knowledge Knowledge based systems Knowledge discovery Knowledge management knowledge representation Object recognition Reasoning Task analysis transfer learning Visual commonsense reasoning visual question answering Visual tasks Visualization
Online Access	Get full text

Cover

Loading…

Abstract	When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world works. To address this issue, we propose Commonsense Knowledge based Reasoning Model (CKRM) to acquire external knowledge to support Visual Commonsense Reasoning (VCR) task, where the computer is expected to answer challenging visual questions. Our key ideas are: (1) To bridge the gap between recognition-level and cognition-level image understanding, we inject external commonsense knowledge via multi-level knowledge transfer network , achieving cell-level, layer-level and attention-level joint information transfer. It can effectively capture knowledge from different perspectives, and perceive common sense of human in advance. (2) To further promote image understanding at cognitive level, we propose a knowledge based reasoning approach , which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer. Experiments are conducted on the challenging visual commonsense reasoning dataset VCR to verify the effectiveness of our proposed CKRM approach, which can significantly improve reasoning performance and achieve the state-of-the-art accuracy.
AbstractList	When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects’ functions, people’s intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world works. To address this issue, we propose Commonsense Knowledge based Reasoning Model (CKRM) to acquire external knowledge to support Visual Commonsense Reasoning (VCR) task, where the computer is expected to answer challenging visual questions. Our key ideas are: (1) To bridge the gap between recognition-level and cognition-level image understanding, we inject external commonsense knowledge via multi-level knowledge transfer network , achieving cell-level, layer-level and attention-level joint information transfer. It can effectively capture knowledge from different perspectives, and perceive common sense of human in advance. (2) To further promote image understanding at cognitive level, we propose a knowledge based reasoning approach , which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer. Experiments are conducted on the challenging visual commonsense reasoning dataset VCR to verify the effectiveness of our proposed CKRM approach, which can significantly improve reasoning performance and achieve the state-of-the-art accuracy.
Author	Wen, Zhang Peng, Yuxin
Author_xml	– sequence: 1 givenname: Zhang surname: Wen fullname: Wen, Zhang organization: Wangxuan Institute of Computer Technology, Peking University, Beijing, China – sequence: 2 givenname: Yuxin orcidid: 0000-0001-7658-3845 surname: Peng fullname: Peng, Yuxin email: pengyuxin@pku.edu.cn organization: Wangxuan Institute of Computer Technology, Peking University, Beijing, China
BookMark	eNp9kN9LwzAQx4NMcJv-A_pS8LkzSXtt8qYUfwwngs69hrS7jowumUmr-N_bueGDD8LBHdz3cwefERlYZ5GQc0YnjFF5NS9eF_MJp5xOuJRMZNkRGTIAEXNOYdDPFFgsOIMTMgphTSlLRZoPyfVT17QmnuEHNtGjdZ8NLlcYTe0aq9bYVVQ7Hy1M6HQTFW6zcTZgX9EL6uBsHzglx7VuAp4d-pi83d3Oi4d49nw_LW5mccUltDFPASVokUDGmNBUlwlQCSjKEtNa01JWDPIMdbLbZXVWpjqrJUPIYVnSZTIml_u7W-_eOwytWrvO2_6l4qkUImeQ5X1K7FOVdyF4rFVlWt0aZ1uvTaMYVTtf6seX2vlSB189yv-gW2822n_9D13sIYOIv4CkIpHAkm80X3io
CODEN	ITCTEM
CitedBy_id	crossref_primary_10_1109_TCSVT_2024_3428487 crossref_primary_10_1007_s10462_024_10825_z crossref_primary_10_1007_s11390_024_4125_1 crossref_primary_10_1016_j_inffus_2023_102000 crossref_primary_10_1109_TCSVT_2023_3326279 crossref_primary_10_1109_TCSVT_2024_3407785 crossref_primary_10_1109_TMM_2023_3279691 crossref_primary_10_1109_TCSVT_2023_3278492 crossref_primary_10_1109_TCSVT_2023_3281507 crossref_primary_10_1142_S0218348X23401333 crossref_primary_10_1016_j_cviu_2024_104165 crossref_primary_10_1016_j_knosys_2025_113214 crossref_primary_10_1007_s11042_022_13428_4 crossref_primary_10_1016_j_neunet_2022_05_008 crossref_primary_10_1109_TCSVT_2024_3382684 crossref_primary_10_1109_TMM_2023_3275874 crossref_primary_10_1016_j_knosys_2023_111153 crossref_primary_10_1109_TCSVT_2023_3284474 crossref_primary_10_1007_s11042_022_12776_5 crossref_primary_10_1109_TMM_2021_3091882 crossref_primary_10_1109_TNNLS_2023_3323491
Cites_doi	10.1109/CVPR.2018.00636 10.1007/978-3-319-46484-8_44 10.24963/ijcai.2017/179 10.1609/aaai.v33i01.33013027 10.1016/j.neunet.2005.06.042 10.1109/CVPR.2017.215 10.1145/219717.219748 10.1007/s11263-016-0966-6 10.1109/CVPR.2019.00688 10.1109/TPAMI.2017.2754246 10.1109/CVPR.2016.540 10.1109/ICCV.2015.169 10.1007/s11263-018-1116-0 10.1109/TPAMI.2016.2577031 10.1145/2701413 10.1109/CVPR.2018.00895 10.1109/CVPR.2018.00807 10.24963/ijcai.2017/263 10.1007/978-3-319-46484-8_2 10.24963/ijcai.2018/126 10.1109/TKDE.2009.191 10.1109/CVPR.2016.90 10.1109/WACV.2019.00036 10.1109/TCSVT.2018.2808685 10.1109/ICCV.2017.285 10.1109/CVPR.2016.499 10.1016/j.cviu.2017.05.001 10.18653/v1/P18-1224 10.1109/ICCV.2017.362 10.18653/v1/D15-1075 10.1007/978-3-642-29694-9_1 10.1007/978-3-030-01261-8_30 10.1109/CVPR.2015.7298965 10.1109/ICCV.2017.322 10.1007/978-3-540-76298-0_52 10.1109/CVPR.2019.00648 10.1109/CVPR.2019.00857 10.18653/v1/D18-1009 10.1109/CVPR.2017.660 10.18653/v1/P18-1043 10.1109/CVPR.2018.00801 10.1007/s11263-016-0981-7
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
DBID	97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D
DOI	10.1109/TCSVT.2020.2991866
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional
DatabaseTitleList	Technology Research Database
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	1558-2205
EndPage	1054
ExternalDocumentID	10_1109_TCSVT_2020_2991866 9083951
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China grantid: 61925201; 61771025 funderid: 10.13039/501100001809
GroupedDBID	-~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS RXW TAE TN5 VH1 AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c295t-245e95a8356118a0ab35095e8bbe4fa0b9c1576ea3a0ab6f6b4a6f91e575db0d3
IEDL.DBID	RIE
ISSN	1051-8215
IngestDate	Sun Jun 29 16:53:39 EDT 2025 Thu Apr 24 23:12:22 EDT 2025 Tue Jul 01 00:41:13 EDT 2025 Wed Aug 27 02:45:36 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	3
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c295t-245e95a8356118a0ab35095e8bbe4fa0b9c1576ea3a0ab6f6b4a6f91e575db0d3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0001-7658-3845
PQID	2498871567
PQPubID	85433
PageCount	13
ParticipantIDs	crossref_citationtrail_10_1109_TCSVT_2020_2991866 proquest_journals_2498871567 crossref_primary_10_1109_TCSVT_2020_2991866 ieee_primary_9083951
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2021-03-01
PublicationDateYYYYMMDD	2021-03-01
PublicationDate_xml	– month: 03 year: 2021 text: 2021-03-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York
PublicationTitle	IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev	TCSVT
PublicationYear	2021
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref13 ref12 ref15 lu (ref25) 2016 ref14 ref53 ren (ref20) 2015 ref11 ref54 ref17 ref16 ref19 ref18 zhang (ref42) 2015 ref51 ref50 ref46 ref48 ref41 ref44 maccartney (ref10) 2009 ref8 ref7 ref9 ref4 ref3 ref6 devlin (ref49) 2019 ref5 ref40 jialin pan (ref43) 2010; 22 ref34 ref37 ref31 ref30 ref33 ref32 hinton (ref36) 2015 ref2 zagoruyko (ref47) 2017 ref39 ref38 ref24 ref26 veit (ref55) 2016 ref22 ref21 ref28 ref27 ref29 kim (ref52) 2017 yi (ref23) 2018 speer (ref35) 2017 pan (ref45) 2018 simonyan (ref1) 2015
References_xml	– ident: ref14 doi: 10.1109/CVPR.2018.00636 – ident: ref51 doi: 10.1007/978-3-319-46484-8_44 – ident: ref32 doi: 10.24963/ijcai.2017/179 – year: 2015 ident: ref36 article-title: Distilling the knowledge in a neural network publication-title: ArXiv 1503 02531 – ident: ref40 doi: 10.1609/aaai.v33i01.33013027 – ident: ref50 doi: 10.1016/j.neunet.2005.06.042 – ident: ref22 doi: 10.1109/CVPR.2017.215 – ident: ref31 doi: 10.1145/219717.219748 – ident: ref12 doi: 10.1007/s11263-016-0966-6 – ident: ref48 doi: 10.1109/CVPR.2019.00688 – ident: ref28 doi: 10.1109/TPAMI.2017.2754246 – ident: ref21 doi: 10.1109/CVPR.2016.540 – ident: ref3 doi: 10.1109/ICCV.2015.169 – start-page: 1039 year: 2018 ident: ref23 article-title: Neural-symbolic VQA: Disentangling reasoning from vision and language understanding publication-title: Proc Neural Inf Process Syst (NeurIPS) – ident: ref13 doi: 10.1007/s11263-018-1116-0 – ident: ref4 doi: 10.1109/TPAMI.2016.2577031 – ident: ref18 doi: 10.1145/2701413 – ident: ref37 doi: 10.1109/CVPR.2018.00895 – ident: ref17 doi: 10.1109/CVPR.2018.00807 – ident: ref44 doi: 10.24963/ijcai.2017/263 – ident: ref6 doi: 10.1007/978-3-319-46484-8_2 – start-page: 1 year: 2015 ident: ref1 article-title: Very deep convolutional networks for large-scale image recognition publication-title: Proc Int Conf Learn Represent (ICLR) – ident: ref16 doi: 10.24963/ijcai.2018/126 – year: 2015 ident: ref20 article-title: Image question answering: A visual semantic embedding model and a new dataset publication-title: arXiv 1505 02074v1 – volume: 22 start-page: 1345 year: 2010 ident: ref43 article-title: A survey on transfer learning publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2009.191 – year: 2016 ident: ref55 article-title: Coco-text: Dataset and benchmark for text detection and recognition in natural images publication-title: arXiv 1601 07140 – ident: ref2 doi: 10.1109/CVPR.2016.90 – start-page: 1 year: 2017 ident: ref47 article-title: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer publication-title: Proc Int Conf Learn Represent (ICLR) – ident: ref34 doi: 10.1109/WACV.2019.00036 – ident: ref7 doi: 10.1109/TCSVT.2018.2808685 – ident: ref53 doi: 10.1109/ICCV.2017.285 – ident: ref24 doi: 10.1109/CVPR.2016.499 – start-page: 289 year: 2016 ident: ref25 article-title: Hierarchical question-image co-attention for visual question answering publication-title: Proc Neural Inf Process Syst (NeurIPS) – ident: ref29 doi: 10.1016/j.cviu.2017.05.001 – ident: ref33 doi: 10.18653/v1/P18-1224 – start-page: 4171 year: 2019 ident: ref49 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding publication-title: Proc Conf North Amer Chapter Assoc Comput Linguistics Hum Lang Technol (NAACL-HLT) – start-page: 1394 year: 2015 ident: ref42 article-title: CORPP: Commonsense reasoning and probabilistic planning, as applied to dialog with a mobile robot publication-title: Proc AAAI Conf Artif Intell (AAAI) – start-page: 1 year: 2017 ident: ref52 article-title: Hadamard product for low-rank bilinear pooling publication-title: Proc Int Conf Learn Represent (ICLR) – start-page: 6095 year: 2018 ident: ref45 article-title: Macnet: Transferring knowledge from machine comprehension to sequence-to-sequence models publication-title: Proc Neural Inf Process Syst (NeurIPS) – ident: ref38 doi: 10.1109/ICCV.2017.362 – ident: ref54 doi: 10.18653/v1/D15-1075 – year: 2009 ident: ref10 article-title: Natural language inference – ident: ref11 doi: 10.1007/978-3-642-29694-9_1 – ident: ref46 doi: 10.1007/978-3-030-01261-8_30 – start-page: 4444 year: 2017 ident: ref35 article-title: ConceptNet 5.5: An open multilingual graph of general knowledge publication-title: Proc AAAI Conf Artif Intell (AAAI) – ident: ref9 doi: 10.1109/CVPR.2015.7298965 – ident: ref5 doi: 10.1109/ICCV.2017.322 – ident: ref30 doi: 10.1007/978-3-540-76298-0_52 – ident: ref26 doi: 10.1109/CVPR.2019.00648 – ident: ref27 doi: 10.1109/CVPR.2019.00857 – ident: ref39 doi: 10.18653/v1/D18-1009 – ident: ref8 doi: 10.1109/CVPR.2017.660 – ident: ref41 doi: 10.18653/v1/P18-1043 – ident: ref15 doi: 10.1109/CVPR.2018.00801 – ident: ref19 doi: 10.1007/s11263-016-0981-7
SSID	ssj0014847
Score	2.470835
Snippet	When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental... When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects’ functions, people’s intents and mental...
SourceID	proquest crossref ieee
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	1042
SubjectTerms	Cognition Image recognition Information transfer Knowledge Knowledge based systems Knowledge discovery Knowledge management knowledge representation Object recognition Reasoning Task analysis transfer learning Visual commonsense reasoning visual question answering Visual tasks Visualization
Title	Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
URI	https://ieeexplore.ieee.org/document/9083951 https://www.proquest.com/docview/2498871567
Volume	31
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELVKT3BgK4hCQTlwA6dO4iy-gSoq9gO0VW-RnUwkBEoRTS58PWNnEZsQUg6RYkeRx7HfG7-ZIeQ4jUKVZF5AeSAF5coLqUhZRn0IHUhSN2PKqHzvg8spv5778w45bWNhAMCIz8DWt-YsP10kpXaVDQXiBaHjpVeQuFWxWu2JAY9MMTGECw6NcB9rAmSYGE5Gj7MJUkGX2bj46gxvXzYhU1Xlx1Js9pfxBrlrvqySlTzbZaHs5P1b0sb_fvomWa-BpnVezYwt0oF8m6x9Sj_YI2cm-pbeat2QddM416yrXPtmsIWFgNaaPS1LfI8OJNHCa7ysB5BL48XdIdPxxWR0SeuKCjRxhV9Ql_sgfImoK0BiIZlUHgIGHyKlgGeSKZE4SEBAevpZkAWKyyATDiCoSxVLvV3SzRc57BFLZwILnEiF4DOeCSVVyCD104xzCHnC-sRphjhO6nTjuurFS2xoBxOxMUuszRLXZumTk7bPa5Vs48_WPT3Obct6iPtk0Fgyrv_HZYwkE1dT5Krh_u-9Dsiqq9UqRl02IN3irYRDhBuFOjLz7AOWE9Dv
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1LT-MwEB7xOAAH3ogCCznACbk4iZ3Eh5UWsaCWFg5QELdgJxMJgQqirdDub-Gv7H_bsZtUvMQNCSmHSLEtxTPyfDP-ZgZgO09ikxVhxESkFRMmjJnKecEkxj5meVBw41i-p1HjQhxfyasxeB7lwiCiI59h3b66u_z8PhvYUNmeIrxAiKCkULbwzxM5aL2fzd8kzZ0gODrsHDRY2UOAZYGSfRYIiUpqwhkRQWnNtQnJREpMjEFRaG5U5hPkRh3ab1ERGaGjQvlIMCY3PA9p3XGYJJwhg2F22OiOQiSufRkBFJ8lZDmrlByu9joH55cdcj4DXqfj3taUe2X2XB-Xd4e_s2hHc_Cv2oshkeW2Puibevb3TZnI77pZ8zBbQmlvf6j7CzCG3UWYeVFgcQl-ufxi1rbMKK9VhQ-9ZtdGn2iER5Ddu7zpDWgdmypjqeX0eGeoey5OvQwXX_ILKzDRve_iKni21lnkJyZGyUWhjDYxx1zmhRAYi4zXwK9EmmZlQXXb1-MudY4VV6lTg9SqQVqqQQ12R3MehuVEPh29ZOU6GlmKtAYbleak5YnTS8mNJntB3ni89vGsLZhqdE7aabt52lqH6cBycxyXbgMm-o8D_EHgqm82nY57cP3VevIfWvUvKw
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-Level+Knowledge+Injecting+for+Visual+Commonsense+Reasoning&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Zhang%2C+Wen&rft.au=Peng%2C+Yuxin&rft.date=2021-03-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=31&rft.issue=3&rft.spage=1042&rft_id=info:doi/10.1109%2FTCSVT.2020.2991866&rft.externalDBID=NO_FULL_TEXT
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon