Using Lip Reading Recognition to Predict Daily Mandarin Conversation

Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks i...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 10; pp. 53481 - 53489
Main Authors	Haq, Muhamad Amirul, Ruan, Shanq-Jang, Cai, Wen-Jie, Li, Lieber Po-Hung
Format	Journal Article
Language	English
Published	Piscataway IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Audio data Automatic speech recognition Background noise Conversation Convolution Datasets Deep learning Feature extraction hearing aid Hearing aids Hidden Markov models Lip reading Lipreading Lips Mandarin Mandarin lip reading Neural networks Oral communication Reading speech aid Speech recognition Tuition Videos visual speech recognition Visual tasks Visualization Voice recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2022.3175867