VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability,...

Full description

Saved in:

Bibliographic Details
Main Authors	Wang, Chao, Zhang, Chunbai, Tian, Yongxiao, Zhou, Yang, Peng, Yan
Format	Journal Article
Language	English
Published	02.09.2025
Subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text
DOI	10.48550/arxiv.2502.00711

Cover

Abstract	Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of "evidence for reasoning" to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks. Moreover, VIKSER achieves performance on par with leading proprietary models, such as the latest ChatGPT-5.
AbstractList	Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of "evidence for reasoning" to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks. Moreover, VIKSER achieves performance on par with leading proprietary models, such as the latest ChatGPT-5.
Author	Peng, Yan Wang, Chao Zhou, Yang Zhang, Chunbai Tian, Yongxiao
Author_xml	– sequence: 1 givenname: Chao surname: Wang fullname: Wang, Chao – sequence: 2 givenname: Chunbai surname: Zhang fullname: Zhang, Chunbai – sequence: 3 givenname: Yongxiao surname: Tian fullname: Tian, Yongxiao – sequence: 4 givenname: Yang surname: Zhou fullname: Zhou, Yang – sequence: 5 givenname: Yan surname: Peng fullname: Peng, Yan
BackLink	https://doi.org/10.48550/arXiv.2502.00711$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzI1MNIzMDA3NORkcAnz9A52DbJSCMssLk3MUfDOyy_PSU1JT9V1KcosS81TCE7NSdMNSs3MS8svSs7MS1cISk0szs8DsdyKEnNTy_OLsnkYWNMSc4pTeaE0N4O8m2uIs4cu2L74gqLM3MSiyniQvfFge40JqwAAzao5RA
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	GOX
DOI	10.48550/arxiv.2502.00711
DatabaseName	arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2502_00711
GroupedDBID	GOX
ID	FETCH-arxiv_primary_2502_007113
IEDL.DBID	GOX
IngestDate	Thu Sep 04 12:10:20 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2502_007113
OpenAccessLink	https://arxiv.org/abs/2502.00711
ParticipantIDs	arxiv_primary_2502_00711
PublicationCentury	2000
PublicationDate	2025-09-02
PublicationDateYYYYMMDD	2025-09-02
PublicationDate_xml	– month: 09 year: 2025 text: 2025-09-02 day: 02
PublicationDecade	2020
PublicationYear	2025
Score	3.8448362
SecondaryResourceType	preprint
Snippet	Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Title	VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework
URI	https://arxiv.org/abs/2502.00711
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTJPtUw0MkvUNTNONtYFln7GuonmxqlAItXAODktKdHAArQb2dfPzCPUxCvCNIKJQQG2FyaxqCKzDHI-cFKxPrB-Bp2naQ7avMtsZATqXLn7R0AmJ8FHcUHVI9QB25hgIaRKwk2QgR_aulNwhESHEANTap4Ig0uYp3ewa5CVQlhmcSlQ1hs2jKXrUgQqaxSCU3PSdINSwWeYJgOrEoWg1MRi8DCpghts7ZQog7yba4izhy7Y3vgCyCER8SAnxYOdZCzGwALsyqdKMCiYJlqmGFkkmZpaGBmbmJqkJVkmpyQZJBsZJyclWiSlmUkySOAyRQq3lDQDlxHoVlrQNIeRDANLSVFpqiywqixJkgOHFwASL22J
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VIKSER%3A+Visual+Knowledge-Driven+Self-Reinforcing+Reasoning+Framework&rft.au=Wang%2C+Chao&rft.au=Zhang%2C+Chunbai&rft.au=Tian%2C+Yongxiao&rft.au=Zhou%2C+Yang&rft.date=2025-09-02&rft_id=info:doi/10.48550%2Farxiv.2502.00711&rft.externalDocID=2502_00711