Multimodal feature fusion and exploitation with dual learning and reinforcement learning for recipe generation
Recipes belong to long paragraphs with a cooking logic. To recipes from images and food names is more challenging in VQA (Visual Question Answering) due to the gap between images and texts. Although multimodal feature fusion, as a typical solver in VQA, is adopted in most situations for enhancing th...
Saved in:
Published in | Applied soft computing Vol. 126; p. 109281 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.09.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Recipes belong to long paragraphs with a cooking logic. To recipes from images and food names is more challenging in VQA (Visual Question Answering) due to the gap between images and texts. Although multimodal feature fusion, as a typical solver in VQA, is adopted in most situations for enhancing the accuracy, fused features obtained in this way can hardly provide guidance for keeping logic in produced texts. In this paper, ingredients are introduced to enhancing the relationship between food images and recipes, since they can reflect the cooking logic to a great extent, and dual learning is adopted to provide a complementary view by reconstructing ingredients from produced recipes. In order to make a full exploitation of ingredients for producing effective recipes, ingredients are fused into images and food names with an attention mechanism in the forward flow, and in the backward flow, a reconstructor is designed to reproduce ingredients from recipes. In addition, reinforcement learning is employed to guide ingredient reconstruction for preserving effective features in fused information explicitly. Extensive experiments demonstrate that more attention is allocated for producing effective recipes, and ablative study shows the reasonability of different components in the proposed method.
•The problem of logic loss in sequential information generation is addressed in this paper.•Multimodal features in both forward and backward flows are exploited efficiently.•Dual learning provides a complementary view to enhance connections in inputs and outputs.•Reinforcement learning associates multimodal feature exploitation in a specific manner.•Extensive experiments and ablative studies demonstrate the superiority of our method. |
---|---|
ISSN: | 1568-4946 1872-9681 |
DOI: | 10.1016/j.asoc.2022.109281 |