Multi-Level Knowledge Injecting for Visual Commonsense Reasoning
When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world...
Saved in:
Published in | IEEE transactions on circuits and systems for video technology Vol. 31; no. 3; pp. 1042 - 1054 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.03.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world works. To address this issue, we propose Commonsense Knowledge based Reasoning Model (CKRM) to acquire external knowledge to support Visual Commonsense Reasoning (VCR) task, where the computer is expected to answer challenging visual questions. Our key ideas are: (1) To bridge the gap between recognition-level and cognition-level image understanding, we inject external commonsense knowledge via multi-level knowledge transfer network , achieving cell-level, layer-level and attention-level joint information transfer. It can effectively capture knowledge from different perspectives, and perceive common sense of human in advance. (2) To further promote image understanding at cognitive level, we propose a knowledge based reasoning approach , which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer. Experiments are conducted on the challenging visual commonsense reasoning dataset VCR to verify the effectiveness of our proposed CKRM approach, which can significantly improve reasoning performance and achieve the state-of-the-art accuracy. |
---|---|
AbstractList | When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects’ functions, people’s intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world works. To address this issue, we propose Commonsense Knowledge based Reasoning Model (CKRM) to acquire external knowledge to support Visual Commonsense Reasoning (VCR) task, where the computer is expected to answer challenging visual questions. Our key ideas are: (1) To bridge the gap between recognition-level and cognition-level image understanding, we inject external commonsense knowledge via multi-level knowledge transfer network , achieving cell-level, layer-level and attention-level joint information transfer. It can effectively capture knowledge from different perspectives, and perceive common sense of human in advance. (2) To further promote image understanding at cognitive level, we propose a knowledge based reasoning approach , which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer. Experiments are conducted on the challenging visual commonsense reasoning dataset VCR to verify the effectiveness of our proposed CKRM approach, which can significantly improve reasoning performance and achieve the state-of-the-art accuracy. |
Author | Wen, Zhang Peng, Yuxin |
Author_xml | – sequence: 1 givenname: Zhang surname: Wen fullname: Wen, Zhang organization: Wangxuan Institute of Computer Technology, Peking University, Beijing, China – sequence: 2 givenname: Yuxin orcidid: 0000-0001-7658-3845 surname: Peng fullname: Peng, Yuxin email: pengyuxin@pku.edu.cn organization: Wangxuan Institute of Computer Technology, Peking University, Beijing, China |
BookMark | eNp9kN9LwzAQx4NMcJv-A_pS8LkzSXtt8qYUfwwngs69hrS7jowumUmr-N_bueGDD8LBHdz3cwefERlYZ5GQc0YnjFF5NS9eF_MJp5xOuJRMZNkRGTIAEXNOYdDPFFgsOIMTMgphTSlLRZoPyfVT17QmnuEHNtGjdZ8NLlcYTe0aq9bYVVQ7Hy1M6HQTFW6zcTZgX9EL6uBsHzglx7VuAp4d-pi83d3Oi4d49nw_LW5mccUltDFPASVokUDGmNBUlwlQCSjKEtNa01JWDPIMdbLbZXVWpjqrJUPIYVnSZTIml_u7W-_eOwytWrvO2_6l4qkUImeQ5X1K7FOVdyF4rFVlWt0aZ1uvTaMYVTtf6seX2vlSB189yv-gW2822n_9D13sIYOIv4CkIpHAkm80X3io |
CODEN | ITCTEM |
CitedBy_id | crossref_primary_10_1109_TCSVT_2024_3428487 crossref_primary_10_1007_s10462_024_10825_z crossref_primary_10_1007_s11390_024_4125_1 crossref_primary_10_1016_j_inffus_2023_102000 crossref_primary_10_1109_TCSVT_2023_3326279 crossref_primary_10_1109_TCSVT_2024_3407785 crossref_primary_10_1109_TMM_2023_3279691 crossref_primary_10_1109_TCSVT_2023_3278492 crossref_primary_10_1109_TCSVT_2023_3281507 crossref_primary_10_1142_S0218348X23401333 crossref_primary_10_1016_j_cviu_2024_104165 crossref_primary_10_1016_j_knosys_2025_113214 crossref_primary_10_1007_s11042_022_13428_4 crossref_primary_10_1016_j_neunet_2022_05_008 crossref_primary_10_1109_TCSVT_2024_3382684 crossref_primary_10_1109_TMM_2023_3275874 crossref_primary_10_1016_j_knosys_2023_111153 crossref_primary_10_1109_TCSVT_2023_3284474 crossref_primary_10_1007_s11042_022_12776_5 crossref_primary_10_1109_TMM_2021_3091882 crossref_primary_10_1109_TNNLS_2023_3323491 |
Cites_doi | 10.1109/CVPR.2018.00636 10.1007/978-3-319-46484-8_44 10.24963/ijcai.2017/179 10.1609/aaai.v33i01.33013027 10.1016/j.neunet.2005.06.042 10.1109/CVPR.2017.215 10.1145/219717.219748 10.1007/s11263-016-0966-6 10.1109/CVPR.2019.00688 10.1109/TPAMI.2017.2754246 10.1109/CVPR.2016.540 10.1109/ICCV.2015.169 10.1007/s11263-018-1116-0 10.1109/TPAMI.2016.2577031 10.1145/2701413 10.1109/CVPR.2018.00895 10.1109/CVPR.2018.00807 10.24963/ijcai.2017/263 10.1007/978-3-319-46484-8_2 10.24963/ijcai.2018/126 10.1109/TKDE.2009.191 10.1109/CVPR.2016.90 10.1109/WACV.2019.00036 10.1109/TCSVT.2018.2808685 10.1109/ICCV.2017.285 10.1109/CVPR.2016.499 10.1016/j.cviu.2017.05.001 10.18653/v1/P18-1224 10.1109/ICCV.2017.362 10.18653/v1/D15-1075 10.1007/978-3-642-29694-9_1 10.1007/978-3-030-01261-8_30 10.1109/CVPR.2015.7298965 10.1109/ICCV.2017.322 10.1007/978-3-540-76298-0_52 10.1109/CVPR.2019.00648 10.1109/CVPR.2019.00857 10.18653/v1/D18-1009 10.1109/CVPR.2017.660 10.18653/v1/P18-1043 10.1109/CVPR.2018.00801 10.1007/s11263-016-0981-7 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/TCSVT.2020.2991866 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Technology Research Database |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1558-2205 |
EndPage | 1054 |
ExternalDocumentID | 10_1109_TCSVT_2020_2991866 9083951 |
Genre | orig-research |
GrantInformation_xml | – fundername: National Natural Science Foundation of China grantid: 61925201; 61771025 funderid: 10.13039/501100001809 |
GroupedDBID | -~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS RXW TAE TN5 VH1 AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c295t-245e95a8356118a0ab35095e8bbe4fa0b9c1576ea3a0ab6f6b4a6f91e575db0d3 |
IEDL.DBID | RIE |
ISSN | 1051-8215 |
IngestDate | Sun Jun 29 16:53:39 EDT 2025 Thu Apr 24 23:12:22 EDT 2025 Tue Jul 01 00:41:13 EDT 2025 Wed Aug 27 02:45:36 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c295t-245e95a8356118a0ab35095e8bbe4fa0b9c1576ea3a0ab6f6b4a6f91e575db0d3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0001-7658-3845 |
PQID | 2498871567 |
PQPubID | 85433 |
PageCount | 13 |
ParticipantIDs | crossref_citationtrail_10_1109_TCSVT_2020_2991866 proquest_journals_2498871567 crossref_primary_10_1109_TCSVT_2020_2991866 ieee_primary_9083951 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2021-03-01 |
PublicationDateYYYYMMDD | 2021-03-01 |
PublicationDate_xml | – month: 03 year: 2021 text: 2021-03-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE transactions on circuits and systems for video technology |
PublicationTitleAbbrev | TCSVT |
PublicationYear | 2021 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref12 ref15 lu (ref25) 2016 ref14 ref53 ren (ref20) 2015 ref11 ref54 ref17 ref16 ref19 ref18 zhang (ref42) 2015 ref51 ref50 ref46 ref48 ref41 ref44 maccartney (ref10) 2009 ref8 ref7 ref9 ref4 ref3 ref6 devlin (ref49) 2019 ref5 ref40 jialin pan (ref43) 2010; 22 ref34 ref37 ref31 ref30 ref33 ref32 hinton (ref36) 2015 ref2 zagoruyko (ref47) 2017 ref39 ref38 ref24 ref26 veit (ref55) 2016 ref22 ref21 ref28 ref27 ref29 kim (ref52) 2017 yi (ref23) 2018 speer (ref35) 2017 pan (ref45) 2018 simonyan (ref1) 2015 |
References_xml | – ident: ref14 doi: 10.1109/CVPR.2018.00636 – ident: ref51 doi: 10.1007/978-3-319-46484-8_44 – ident: ref32 doi: 10.24963/ijcai.2017/179 – year: 2015 ident: ref36 article-title: Distilling the knowledge in a neural network publication-title: ArXiv 1503 02531 – ident: ref40 doi: 10.1609/aaai.v33i01.33013027 – ident: ref50 doi: 10.1016/j.neunet.2005.06.042 – ident: ref22 doi: 10.1109/CVPR.2017.215 – ident: ref31 doi: 10.1145/219717.219748 – ident: ref12 doi: 10.1007/s11263-016-0966-6 – ident: ref48 doi: 10.1109/CVPR.2019.00688 – ident: ref28 doi: 10.1109/TPAMI.2017.2754246 – ident: ref21 doi: 10.1109/CVPR.2016.540 – ident: ref3 doi: 10.1109/ICCV.2015.169 – start-page: 1039 year: 2018 ident: ref23 article-title: Neural-symbolic VQA: Disentangling reasoning from vision and language understanding publication-title: Proc Neural Inf Process Syst (NeurIPS) – ident: ref13 doi: 10.1007/s11263-018-1116-0 – ident: ref4 doi: 10.1109/TPAMI.2016.2577031 – ident: ref18 doi: 10.1145/2701413 – ident: ref37 doi: 10.1109/CVPR.2018.00895 – ident: ref17 doi: 10.1109/CVPR.2018.00807 – ident: ref44 doi: 10.24963/ijcai.2017/263 – ident: ref6 doi: 10.1007/978-3-319-46484-8_2 – start-page: 1 year: 2015 ident: ref1 article-title: Very deep convolutional networks for large-scale image recognition publication-title: Proc Int Conf Learn Represent (ICLR) – ident: ref16 doi: 10.24963/ijcai.2018/126 – year: 2015 ident: ref20 article-title: Image question answering: A visual semantic embedding model and a new dataset publication-title: arXiv 1505 02074v1 – volume: 22 start-page: 1345 year: 2010 ident: ref43 article-title: A survey on transfer learning publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2009.191 – year: 2016 ident: ref55 article-title: Coco-text: Dataset and benchmark for text detection and recognition in natural images publication-title: arXiv 1601 07140 – ident: ref2 doi: 10.1109/CVPR.2016.90 – start-page: 1 year: 2017 ident: ref47 article-title: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer publication-title: Proc Int Conf Learn Represent (ICLR) – ident: ref34 doi: 10.1109/WACV.2019.00036 – ident: ref7 doi: 10.1109/TCSVT.2018.2808685 – ident: ref53 doi: 10.1109/ICCV.2017.285 – ident: ref24 doi: 10.1109/CVPR.2016.499 – start-page: 289 year: 2016 ident: ref25 article-title: Hierarchical question-image co-attention for visual question answering publication-title: Proc Neural Inf Process Syst (NeurIPS) – ident: ref29 doi: 10.1016/j.cviu.2017.05.001 – ident: ref33 doi: 10.18653/v1/P18-1224 – start-page: 4171 year: 2019 ident: ref49 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding publication-title: Proc Conf North Amer Chapter Assoc Comput Linguistics Hum Lang Technol (NAACL-HLT) – start-page: 1394 year: 2015 ident: ref42 article-title: CORPP: Commonsense reasoning and probabilistic planning, as applied to dialog with a mobile robot publication-title: Proc AAAI Conf Artif Intell (AAAI) – start-page: 1 year: 2017 ident: ref52 article-title: Hadamard product for low-rank bilinear pooling publication-title: Proc Int Conf Learn Represent (ICLR) – start-page: 6095 year: 2018 ident: ref45 article-title: Macnet: Transferring knowledge from machine comprehension to sequence-to-sequence models publication-title: Proc Neural Inf Process Syst (NeurIPS) – ident: ref38 doi: 10.1109/ICCV.2017.362 – ident: ref54 doi: 10.18653/v1/D15-1075 – year: 2009 ident: ref10 article-title: Natural language inference – ident: ref11 doi: 10.1007/978-3-642-29694-9_1 – ident: ref46 doi: 10.1007/978-3-030-01261-8_30 – start-page: 4444 year: 2017 ident: ref35 article-title: ConceptNet 5.5: An open multilingual graph of general knowledge publication-title: Proc AAAI Conf Artif Intell (AAAI) – ident: ref9 doi: 10.1109/CVPR.2015.7298965 – ident: ref5 doi: 10.1109/ICCV.2017.322 – ident: ref30 doi: 10.1007/978-3-540-76298-0_52 – ident: ref26 doi: 10.1109/CVPR.2019.00648 – ident: ref27 doi: 10.1109/CVPR.2019.00857 – ident: ref39 doi: 10.18653/v1/D18-1009 – ident: ref8 doi: 10.1109/CVPR.2017.660 – ident: ref41 doi: 10.18653/v1/P18-1043 – ident: ref15 doi: 10.1109/CVPR.2018.00801 – ident: ref19 doi: 10.1007/s11263-016-0981-7 |
SSID | ssj0014847 |
Score | 2.470835 |
Snippet | When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental... When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects’ functions, people’s intents and mental... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 1042 |
SubjectTerms | Cognition Image recognition Information transfer Knowledge Knowledge based systems Knowledge discovery Knowledge management knowledge representation Object recognition Reasoning Task analysis transfer learning Visual commonsense reasoning visual question answering Visual tasks Visualization |
Title | Multi-Level Knowledge Injecting for Visual Commonsense Reasoning |
URI | https://ieeexplore.ieee.org/document/9083951 https://www.proquest.com/docview/2498871567 |
Volume | 31 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELVKT3BgK4hCQTlwA6dO4iy-gSoq9gO0VW-RnUwkBEoRTS58PWNnEZsQUg6RYkeRx7HfG7-ZIeQ4jUKVZF5AeSAF5coLqUhZRn0IHUhSN2PKqHzvg8spv5778w45bWNhAMCIz8DWt-YsP10kpXaVDQXiBaHjpVeQuFWxWu2JAY9MMTGECw6NcB9rAmSYGE5Gj7MJUkGX2bj46gxvXzYhU1Xlx1Js9pfxBrlrvqySlTzbZaHs5P1b0sb_fvomWa-BpnVezYwt0oF8m6x9Sj_YI2cm-pbeat2QddM416yrXPtmsIWFgNaaPS1LfI8OJNHCa7ysB5BL48XdIdPxxWR0SeuKCjRxhV9Ql_sgfImoK0BiIZlUHgIGHyKlgGeSKZE4SEBAevpZkAWKyyATDiCoSxVLvV3SzRc57BFLZwILnEiF4DOeCSVVyCD104xzCHnC-sRphjhO6nTjuurFS2xoBxOxMUuszRLXZumTk7bPa5Vs48_WPT3Obct6iPtk0Fgyrv_HZYwkE1dT5Krh_u-9Dsiqq9UqRl02IN3irYRDhBuFOjLz7AOWE9Dv |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1LT-MwEB7xOAAH3ogCCznACbk4iZ3Eh5UWsaCWFg5QELdgJxMJgQqirdDub-Gv7H_bsZtUvMQNCSmHSLEtxTPyfDP-ZgZgO09ikxVhxESkFRMmjJnKecEkxj5meVBw41i-p1HjQhxfyasxeB7lwiCiI59h3b66u_z8PhvYUNmeIrxAiKCkULbwzxM5aL2fzd8kzZ0gODrsHDRY2UOAZYGSfRYIiUpqwhkRQWnNtQnJREpMjEFRaG5U5hPkRh3ab1ERGaGjQvlIMCY3PA9p3XGYJJwhg2F22OiOQiSufRkBFJ8lZDmrlByu9joH55cdcj4DXqfj3taUe2X2XB-Xd4e_s2hHc_Cv2oshkeW2Puibevb3TZnI77pZ8zBbQmlvf6j7CzCG3UWYeVFgcQl-ufxi1rbMKK9VhQ-9ZtdGn2iER5Ddu7zpDWgdmypjqeX0eGeoey5OvQwXX_ILKzDRve_iKni21lnkJyZGyUWhjDYxx1zmhRAYi4zXwK9EmmZlQXXb1-MudY4VV6lTg9SqQVqqQQ12R3MehuVEPh29ZOU6GlmKtAYbleak5YnTS8mNJntB3ni89vGsLZhqdE7aabt52lqH6cBycxyXbgMm-o8D_EHgqm82nY57cP3VevIfWvUvKw |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-Level+Knowledge+Injecting+for+Visual+Commonsense+Reasoning&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Zhang%2C+Wen&rft.au=Peng%2C+Yuxin&rft.date=2021-03-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1051-8215&rft.eissn=1558-2205&rft.volume=31&rft.issue=3&rft.spage=1042&rft_id=info:doi/10.1109%2FTCSVT.2020.2991866&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon |