Learning Multimodal VAEs through Mutual Supervision
Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
23.06.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Multimodal VAEs seek to model the joint distribution over heterogeneous data
(e.g.\ vision, language), whilst also capturing a shared representation across
such modalities. Prior work has typically combined information from the
modalities by reconciling idiosyncratic representations directly in the
recognition model through explicit products, mixtures, or other such
factorisations. Here we introduce a novel alternative, the MEME, that avoids
such explicit combinations by repurposing semi-supervised VAEs to combine
information between modalities implicitly through mutual supervision. This
formulation naturally allows learning from partially-observed data where some
modalities can be entirely missing -- something that most existing approaches
either cannot handle, or do so to a limited extent. We demonstrate that MEME
outperforms baselines on standard metrics across both partial and complete
observation schemes on the MNIST-SVHN (image-image) and CUB (image-text)
datasets. We also contrast the quality of the representations learnt by mutual
supervision against standard approaches and observe interesting trends in its
ability to capture relatedness between data. |
---|---|
DOI: | 10.48550/arxiv.2106.12570 |