De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visu...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Gao, Minghe, Li, Juncheng, Hao Fei, Pang, Liang, Ji, Wei, Wang, Guoming, Lv, Zheqi, Zhang, Wenqiao, Tang, Siliang, Zhuang, Yueting
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 05.08.2024
Subjects	Benders decomposition Feedback Reasoning Task complexity Visual tasks
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more robust programs. Moreover, viewing each feedback module as an independent agent will yield fresh prospects for the field of agent research.
ISSN:	2331-8422