VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability,...

Full description

Saved in:
Bibliographic Details
Main Authors Wang, Chao, Zhang, Chunbai, Tian, Yongxiao, Zhou, Yang, Peng, Yan
Format Journal Article
LanguageEnglish
Published 02.09.2025
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2502.00711

Cover

Abstract Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of "evidence for reasoning" to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks. Moreover, VIKSER achieves performance on par with leading proprietary models, such as the latest ChatGPT-5.
AbstractList Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of "evidence for reasoning" to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks. Moreover, VIKSER achieves performance on par with leading proprietary models, such as the latest ChatGPT-5.
Author Peng, Yan
Wang, Chao
Zhou, Yang
Zhang, Chunbai
Tian, Yongxiao
Author_xml – sequence: 1
  givenname: Chao
  surname: Wang
  fullname: Wang, Chao
– sequence: 2
  givenname: Chunbai
  surname: Zhang
  fullname: Zhang, Chunbai
– sequence: 3
  givenname: Yongxiao
  surname: Tian
  fullname: Tian, Yongxiao
– sequence: 4
  givenname: Yang
  surname: Zhou
  fullname: Zhou, Yang
– sequence: 5
  givenname: Yan
  surname: Peng
  fullname: Peng, Yan
BackLink https://doi.org/10.48550/arXiv.2502.00711$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzI1MNIzMDA3NORkcAnz9A52DbJSCMssLk3MUfDOyy_PSU1JT9V1KcosS81TCE7NSdMNSs3MS8svSs7MS1cISk0szs8DsdyKEnNTy_OLsnkYWNMSc4pTeaE0N4O8m2uIs4cu2L74gqLM3MSiyniQvfFge40JqwAAzao5RA
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID GOX
DOI 10.48550/arxiv.2502.00711
DatabaseName arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2502_00711
GroupedDBID GOX
ID FETCH-arxiv_primary_2502_007113
IEDL.DBID GOX
IngestDate Thu Sep 04 12:10:20 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2502_007113
OpenAccessLink https://arxiv.org/abs/2502.00711
ParticipantIDs arxiv_primary_2502_00711
PublicationCentury 2000
PublicationDate 2025-09-02
PublicationDateYYYYMMDD 2025-09-02
PublicationDate_xml – month: 09
  year: 2025
  text: 2025-09-02
  day: 02
PublicationDecade 2020
PublicationYear 2025
Score 3.8448362
SecondaryResourceType preprint
Snippet Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
Title VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework
URI https://arxiv.org/abs/2502.00711
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTJPtUw0MkvUNTNONtYFln7GuonmxqlAItXAODktKdHAArQb2dfPzCPUxCvCNIKJQQG2FyaxqCKzDHI-cFKxPrB-Bp2naQ7avMtsZATqXLn7R0AmJ8FHcUHVI9QB25hgIaRKwk2QgR_aulNwhESHEANTap4Ig0uYp3ewa5CVQlhmcSlQ1hs2jKXrUgQqaxSCU3PSdINSwWeYJgOrEoWg1MRi8DCpghts7ZQog7yba4izhy7Y3vgCyCER8SAnxYOdZCzGwALsyqdKMCiYJlqmGFkkmZpaGBmbmJqkJVkmpyQZJBsZJyclWiSlmUkySOAyRQq3lDQDlxHoVlrQNIeRDANLSVFpqiywqixJkgOHFwASL22J
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VIKSER%3A+Visual+Knowledge-Driven+Self-Reinforcing+Reasoning+Framework&rft.au=Wang%2C+Chao&rft.au=Zhang%2C+Chunbai&rft.au=Tian%2C+Yongxiao&rft.au=Zhou%2C+Yang&rft.date=2025-09-02&rft_id=info:doi/10.48550%2Farxiv.2502.00711&rft.externalDocID=2502_00711