VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework
Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability,...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
02.09.2025
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2502.00711 |
Cover
Abstract | Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of "evidence for reasoning" to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks. Moreover, VIKSER achieves performance on par with leading proprietary models, such as the latest ChatGPT-5. |
---|---|
AbstractList | Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of "evidence for reasoning" to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks. Moreover, VIKSER achieves performance on par with leading proprietary models, such as the latest ChatGPT-5. |
Author | Peng, Yan Wang, Chao Zhou, Yang Zhang, Chunbai Tian, Yongxiao |
Author_xml | – sequence: 1 givenname: Chao surname: Wang fullname: Wang, Chao – sequence: 2 givenname: Chunbai surname: Zhang fullname: Zhang, Chunbai – sequence: 3 givenname: Yongxiao surname: Tian fullname: Tian, Yongxiao – sequence: 4 givenname: Yang surname: Zhou fullname: Zhou, Yang – sequence: 5 givenname: Yan surname: Peng fullname: Peng, Yan |
BackLink | https://doi.org/10.48550/arXiv.2502.00711$$DView paper in arXiv |
BookMark | eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzI1MNIzMDA3NORkcAnz9A52DbJSCMssLk3MUfDOyy_PSU1JT9V1KcosS81TCE7NSdMNSs3MS8svSs7MS1cISk0szs8DsdyKEnNTy_OLsnkYWNMSc4pTeaE0N4O8m2uIs4cu2L74gqLM3MSiyniQvfFge40JqwAAzao5RA |
ContentType | Journal Article |
Copyright | http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | GOX |
DOI | 10.48550/arxiv.2502.00711 |
DatabaseName | arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2502_00711 |
GroupedDBID | GOX |
ID | FETCH-arxiv_primary_2502_007113 |
IEDL.DBID | GOX |
IngestDate | Thu Sep 04 12:10:20 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-arxiv_primary_2502_007113 |
OpenAccessLink | https://arxiv.org/abs/2502.00711 |
ParticipantIDs | arxiv_primary_2502_00711 |
PublicationCentury | 2000 |
PublicationDate | 2025-09-02 |
PublicationDateYYYYMMDD | 2025-09-02 |
PublicationDate_xml | – month: 09 year: 2025 text: 2025-09-02 day: 02 |
PublicationDecade | 2020 |
PublicationYear | 2025 |
Score | 3.8448362 |
SecondaryResourceType | preprint |
Snippet | Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition |
Title | VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework |
URI | https://arxiv.org/abs/2502.00711 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTJPtUw0MkvUNTNONtYFln7GuonmxqlAItXAODktKdHAArQb2dfPzCPUxCvCNIKJQQG2FyaxqCKzDHI-cFKxPrB-Bp2naQ7avMtsZATqXLn7R0AmJ8FHcUHVI9QB25hgIaRKwk2QgR_aulNwhESHEANTap4Ig0uYp3ewa5CVQlhmcSlQ1hs2jKXrUgQqaxSCU3PSdINSwWeYJgOrEoWg1MRi8DCpghts7ZQog7yba4izhy7Y3vgCyCER8SAnxYOdZCzGwALsyqdKMCiYJlqmGFkkmZpaGBmbmJqkJVkmpyQZJBsZJyclWiSlmUkySOAyRQq3lDQDlxHoVlrQNIeRDANLSVFpqiywqixJkgOHFwASL22J |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VIKSER%3A+Visual+Knowledge-Driven+Self-Reinforcing+Reasoning+Framework&rft.au=Wang%2C+Chao&rft.au=Zhang%2C+Chunbai&rft.au=Tian%2C+Yongxiao&rft.au=Zhou%2C+Yang&rft.date=2025-09-02&rft_id=info:doi/10.48550%2Farxiv.2502.00711&rft.externalDocID=2502_00711 |