LipNet: End-to-End Sentence-level Lipreading
Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zi...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
05.11.2016
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Lipreading is the task of decoding text from the movement of a speaker's
mouth. Traditional approaches separated the problem into two stages: designing
or learning visual features, and prediction. More recent deep lipreading
approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman,
2016a). However, existing work on models trained end-to-end perform only word
classification, rather than sentence-level sequence prediction. Studies have
shown that human lipreading performance increases for longer words (Easton &
Basala, 1982), indicating the importance of features capturing temporal context
in an ambiguous communication channel. Motivated by this observation, we
present LipNet, a model that maps a variable-length sequence of video frames to
text, making use of spatiotemporal convolutions, a recurrent network, and the
connectionist temporal classification loss, trained entirely end-to-end. To the
best of our knowledge, LipNet is the first end-to-end sentence-level lipreading
model that simultaneously learns spatiotemporal visual features and a sequence
model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level,
overlapped speaker split task, outperforming experienced human lipreaders and
the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016). |
---|---|
DOI: | 10.48550/arxiv.1611.01599 |