The Surprising Effectiveness of Representation Learning for Visual Imitation
While visual imitation learning offers one of the most effective ways of learning from visual demonstrations, generalizing from them requires either hundreds of diverse demonstrations, task specific priors, or large, hard-to-train parametric models. One reason such complexities arise is because stan...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
02.12.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | While visual imitation learning offers one of the most effective ways of
learning from visual demonstrations, generalizing from them requires either
hundreds of diverse demonstrations, task specific priors, or large,
hard-to-train parametric models. One reason such complexities arise is because
standard visual imitation frameworks try to solve two coupled problems at once:
learning a succinct but good representation from the diverse visual data, while
simultaneously learning to associate the demonstrated actions with such
representations. Such joint learning causes an interdependence between these
two problems, which often results in needing large amounts of demonstrations
for learning. To address this challenge, we instead propose to decouple
representation learning from behavior learning for visual imitation. First, we
learn a visual representation encoder from offline data using standard
supervised and self-supervised learning methods. Once the representations are
trained, we use non-parametric Locally Weighted Regression to predict the
actions. We experimentally show that this simple decoupling improves the
performance of visual imitation models on both offline demonstration datasets
and real-robot door opening compared to prior work in visual imitation. All of
our generated data, code, and robot videos are publicly available at
https://jyopari.github.io/VINN/. |
---|---|
DOI: | 10.48550/arxiv.2112.01511 |