Multimodal feature fusion and exploitation with dual learning and reinforcement learning for recipe generation

Recipes belong to long paragraphs with a cooking logic. To recipes from images and food names is more challenging in VQA (Visual Question Answering) due to the gap between images and texts. Although multimodal feature fusion, as a typical solver in VQA, is adopted in most situations for enhancing th...

Full description

Saved in:
Bibliographic Details
Published inApplied soft computing Vol. 126; p. 109281
Main Authors Zhang, Mengyang, Tian, Guohui, Gao, Huanbing, Liu, Shaopeng, Zhang, Ying
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.09.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recipes belong to long paragraphs with a cooking logic. To recipes from images and food names is more challenging in VQA (Visual Question Answering) due to the gap between images and texts. Although multimodal feature fusion, as a typical solver in VQA, is adopted in most situations for enhancing the accuracy, fused features obtained in this way can hardly provide guidance for keeping logic in produced texts. In this paper, ingredients are introduced to enhancing the relationship between food images and recipes, since they can reflect the cooking logic to a great extent, and dual learning is adopted to provide a complementary view by reconstructing ingredients from produced recipes. In order to make a full exploitation of ingredients for producing effective recipes, ingredients are fused into images and food names with an attention mechanism in the forward flow, and in the backward flow, a reconstructor is designed to reproduce ingredients from recipes. In addition, reinforcement learning is employed to guide ingredient reconstruction for preserving effective features in fused information explicitly. Extensive experiments demonstrate that more attention is allocated for producing effective recipes, and ablative study shows the reasonability of different components in the proposed method. •The problem of logic loss in sequential information generation is addressed in this paper.•Multimodal features in both forward and backward flows are exploited efficiently.•Dual learning provides a complementary view to enhance connections in inputs and outputs.•Reinforcement learning associates multimodal feature exploitation in a specific manner.•Extensive experiments and ablative studies demonstrate the superiority of our method.
ISSN:1568-4946
1872-9681
DOI:10.1016/j.asoc.2022.109281