Learning Multimodal Representations by Symmetrically Transferring Local Structures

Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary...

Full description

Saved in:

Bibliographic Details
Published in	Symmetry (Basel) Vol. 12; no. 9; p. 1504
Main Authors	Dong, Bin, Jian, Songlei, Lu, Kai
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.09.2020
Subjects	Alignment Classification Clustering Cognitive tasks Dictionaries Image management Image retrieval Learning local structure Multilayers multimodal representations Neural networks Process parameters Representations Semantics soft metric learning Teaching methods
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary information across the modalities, such as the intra-modal local structures. In other words, they only focus on the object-level alignment and ignore structure-level alignment. To tackle the problem, we propose a novel symmetric multimodal representation learning framework by transferring local structures across different modalities, namely MTLS. A customized soft metric learning strategy and an iterative parameter learning process are designed to symmetrically transfer local structures and enhance the cluster structures in intra-modal representations. The bidirectional retrieval loss based on multi-layer neural networks is utilized to align two modalities. MTLS is instantiated with image and text data and shows its superior performance on image-text retrieval and image clustering. MTLS outperforms the state-of-the-art multimodal learning methods by up to 32% in terms of R@1 on text-image retrieval and 16.4% in terms of AMI onclustering.
ISSN:	2073-8994 2073-8994
DOI:	10.3390/sym12091504