Multimodal feature fusion and exploitation with dual learning and reinforcement learning for recipe generation

Recipes belong to long paragraphs with a cooking logic. To recipes from images and food names is more challenging in VQA (Visual Question Answering) due to the gap between images and texts. Although multimodal feature fusion, as a typical solver in VQA, is adopted in most situations for enhancing th...

Full description

Saved in:

Bibliographic Details
Published in	Applied soft computing Vol. 126; p. 109281
Main Authors	Zhang, Mengyang, Tian, Guohui, Gao, Huanbing, Liu, Shaopeng, Zhang, Ying
Format	Journal Article
Language	English
Published	Elsevier B.V 01.09.2022
Subjects	Dual learning Logic Recipe generation Reinforcement learning Visual question answering Logic Recipe generation Visual question answering Dual learning Reinforcement learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recipes belong to long paragraphs with a cooking logic. To recipes from images and food names is more challenging in VQA (Visual Question Answering) due to the gap between images and texts. Although multimodal feature fusion, as a typical solver in VQA, is adopted in most situations for enhancing the accuracy, fused features obtained in this way can hardly provide guidance for keeping logic in produced texts. In this paper, ingredients are introduced to enhancing the relationship between food images and recipes, since they can reflect the cooking logic to a great extent, and dual learning is adopted to provide a complementary view by reconstructing ingredients from produced recipes. In order to make a full exploitation of ingredients for producing effective recipes, ingredients are fused into images and food names with an attention mechanism in the forward flow, and in the backward flow, a reconstructor is designed to reproduce ingredients from recipes. In addition, reinforcement learning is employed to guide ingredient reconstruction for preserving effective features in fused information explicitly. Extensive experiments demonstrate that more attention is allocated for producing effective recipes, and ablative study shows the reasonability of different components in the proposed method. •The problem of logic loss in sequential information generation is addressed in this paper.•Multimodal features in both forward and backward flows are exploited efficiently.•Dual learning provides a complementary view to enhance connections in inputs and outputs.•Reinforcement learning associates multimodal feature exploitation in a specific manner.•Extensive experiments and ablative studies demonstrate the superiority of our method.
ISSN:	1568-4946 1872-9681
DOI:	10.1016/j.asoc.2022.109281