Learning Multimodal Representations by Symmetrically Transferring Local Structures

Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary...

Full description

Saved in:
Bibliographic Details
Published inSymmetry (Basel) Vol. 12; no. 9; p. 1504
Main Authors Dong, Bin, Jian, Songlei, Lu, Kai
Format Journal Article
LanguageEnglish
Published Basel MDPI AG 01.09.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary information across the modalities, such as the intra-modal local structures. In other words, they only focus on the object-level alignment and ignore structure-level alignment. To tackle the problem, we propose a novel symmetric multimodal representation learning framework by transferring local structures across different modalities, namely MTLS. A customized soft metric learning strategy and an iterative parameter learning process are designed to symmetrically transfer local structures and enhance the cluster structures in intra-modal representations. The bidirectional retrieval loss based on multi-layer neural networks is utilized to align two modalities. MTLS is instantiated with image and text data and shows its superior performance on image-text retrieval and image clustering. MTLS outperforms the state-of-the-art multimodal learning methods by up to 32% in terms of R@1 on text-image retrieval and 16.4% in terms of AMI onclustering.
ISSN:2073-8994
2073-8994
DOI:10.3390/sym12091504