Multi modal spatio temporal co-trained CNNs with single modal testing on RGB–D based sign language gesture recognition

•RGB-D based Indian sign language model is being developed.•RGB and depth data are used to train the proposed convolutional neural network with data sharing architecture.•Four stream CNN architecture with multi modal data sharing mechanism is proposed.•Uni modal testing is initiated with only RGB vi...

Full description

Saved in:
Bibliographic Details
Published inJournal of computer languages (Online) Vol. 52; pp. 88 - 102
Main Authors Ravi, Sunitha, Suman, Maloji, Kishore, P.V.V., Kumar E, Kiran, Kumar M, Teja Kiran, Kumar D, Anil
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.06.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•RGB-D based Indian sign language model is being developed.•RGB and depth data are used to train the proposed convolutional neural network with data sharing architecture.•Four stream CNN architecture with multi modal data sharing mechanism is proposed.•Uni modal testing is initiated with only RGB video data on the constructed CNN.•Results show that the proposed CNN can handle missing data during testing with only RGB video data. Extracting hand movements using single RGB video camera for sign language recognition is a necessary attribute in developing an automated sign language recognition system. Local spatio temporal methods has shown encouraging outcomes for hand extraction using color cues. However, the color intensities does not behave as an independent entity during video capture in real environments. This has become a roadblock in the development of sign language machine translator for processing video data in real world environments. Not surprisingly, the result is more accurate when additional information is provided in the form of depth for sign language recognition in real environments. In this paper, we make use of a multi modal feature sharing mechanism with a four-stream convolutional neural network (CNNs) for RGB – D based sign language recognition. Unlike the multi stream CNNs, where output class prediction is based on independently operated two or three modal streams due to scale variations, we propose a feature sharing multi stream CNN on multi modal data for sign language recognition. The proposed 4 – stream CNN divides into two input data groupings under the training and testing spaces. The training space uses four inputs: RGB spatial in main stream and depth spatial, RGB and depth temporal on Region of interest mapping (ROIM) stream. The testing space uses only RGB and RGB temporal data for prediction from the trained model. The ROIM stream shares the multi modal data to generate ROI maps of the human subject, which are used to regulate the feature maps in RGB stream. The scale variations in the three streams is managed by translating the depth map to fit the RGB data. Sharing of multi modal features with RGB spatial features during training has circumvented overfitting on RGB video data. To validate the proposed CNN architecture, the accuracy of the classifier is investigated with RGB-D sign language data and three benchmark action datasets. The results show a remarkable behaviour of the classifier in handling missing depth modal during testing. The robustness of the system against state – of – the – art action recognition methods is studied using contrasting datasets.
ISSN:2590-1184
2590-1184
DOI:10.1016/j.cola.2019.04.002