Leveraging Efficient Training and Feature Fusion in Transformers for Multimodal Classification

People navigate a world that involves many different modalities and make decision on what they observe. Many of the classification problems that we face in the modern digital world are also multimodal in nature, where textual information on the web rarely occurs alone, and is often accompanied by im...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE International Conference on Image Processing (ICIP) pp. 1420 - 1424
Main Authors Ak, Kenan Emir, Lee, Gwang-Gook, Xu, Yan, Shen, Mingwei
Format Conference Proceeding
LanguageEnglish
Published IEEE 08.10.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:People navigate a world that involves many different modalities and make decision on what they observe. Many of the classification problems that we face in the modern digital world are also multimodal in nature, where textual information on the web rarely occurs alone, and is often accompanied by images, sounds, or videos. The use of transformers in deep learning tasks has proven to be highly effective. However, the relationship between different modalities remains unclear. This paper investigates ways to simultaneously utilize self-attention over both text and vision modalities. We propose a novel architecture that combines the strengths of both modalities. We show that combining a text model with a fixed image model leads to the best classification performance. Additionally, we incorporate a late fusion technique to enhance the architecture's ability to capture multiple modalities. Our experiments demonstrate that our proposed method outperforms state-of-the-art baselines on Food101, MM-IMDB, and FashionGen datasets.
DOI:10.1109/ICIP49359.2023.10223098